Using Uri's to determine valid urls with minimal RegEx
If there is one aspect of programming I understand the least (less javascript
and Ajax) it is using Regular Expressions. For simple stuff - it is relatively
easy - but I wanted to write a robust url checker and seems everyone has a
different way of doing it when it comes to Regular Expressions...
Is there an Alternative?
You betcha... use the Uri feature to grab all pertinent information of the
url and only use Regular Expressions to determine if the domain extension is
valid...It saves in the frustration of dealing with various expressions and
creates more readable code in my opinion.... I have not yet tested to see how
performance wise the various methods would compare...but since I wrote this code
for administrative purposes - I was looking more for code that I could
understand...
This is sample code so you will need to dress it up for your own use.
First I create the base for using generics to pass various results around the
application...
public class EMAIL_URL_VALIDATION
{
public bool IsValidEmailAddressFormat;
public bool IsValidEmail;
public bool IsValidDomain;
public bool IsValidScheme;
///
<summary>
/// Used
for checking to see if page actually exists
///
</summary>
public bool IsValidPageSource;
///
<summary>
/// Used
to determine if page requested has moved - if so populate this with the new
url.
///
</summary>
public string RedirectedPage;
public string Domain_Error;
public string Email_Error;
public bool IsYourDomain;
public bool UsingLocalHost;
public string Domain;
public string Account;
public string Scheme;
public string Fragments;
public string UriFormatError;
public
EMAIL_URL_VALIDATION()
{
}
}
All this class does is allow me to use generics to assign the various status
results and pass them around so that depending on what other application
functions need - it can be easily referenced.
Next I create collection that I can pass in the scheme (The scheme is the
first part of the url such as http, ft, https etc) and determine if my
application allows it.
public static bool IsSchemeAllowed(string scheme)
{
List<string> colSchemes = new List<string>();
colSchemes.Add("http");
colSchemes.Add("https");
colSchemes.Add("ftp");
colSchemes.Add("file");
colSchemes.Add("gopher");
colSchemes.Add("mailto");
colSchemes.Add("news");
colSchemes.Add("nntp");
return
colSchemes.Contains(scheme);
}
We pass in the scheme and the result sent is a boolean that allows us to set
the IsValidScheme. If it is not a valid scheme then odds are the url will not
be valid anyways. The exception will be relative urls such as
"admin/default.aspx". In this case you would use stringbuilder or add strings
that would prefix the relative url with the http:// and your domain name...
Next we need a way to compare the Domain extension to make sure it is a valid
domain (testing for .com, .net, .ru etc)..This code I grabbed from the listed
source as it was by far the most complete and can also be used for validating
email addresses (which was the authors original intention)...
public static bool DomainExtensionValid(string domain)
{ //Used to validate the Domain extensions
original source
//http://www.aspemporium.com/classes_src.aspx?cid=4
bool valid;
Regex re;
string
domainvalidatorpattern = "";
//pattern to validate all known
TLD's
domainvalidatorpattern += "\\.(";
domainvalidatorpattern += "a[c-gil-oq-uwz]|"; //ac,ad,ae,af,ag,ai,al,am,an,ao,aq,ar,as,at,au,aw,az
domainvalidatorpattern += "b[a-bd-jm-or-tvwyz]|"; //ba,bb,bd,be,bf,bg,bh,bi,bj,bm,bn,bo,br,bs,bt,bv,bw,by,bz
domainvalidatorpattern += "c[acdf-ik-orsuvx-z]|"; //ca,cc,cd,cf,cg,ch,ci,ck,cl,cm,cn,co,cr,cs,cu,cv,cz,cy,cz
domainvalidatorpattern += "d[ejkmoz]|"; //de,dj,dk,dm,do,dz
domainvalidatorpattern += "e[ceghr-u]|"; //ec,ee,eg,eh,er,es,et,eu
domainvalidatorpattern += "f[i-kmorx]|"; //fi,fj,fk,fm,fo,fr,fx
domainvalidatorpattern += "g[abd-ilmnp-uwy]|"; //ga,gb,gd,ge,gf,gg,gh,gi,gl,gm,gn,gp,gq,gr,gs,gt,gu,gw,gy
domainvalidatorpattern += "h[kmnrtu]|"; //hk,hm,hn,hr,ht,hu
domainvalidatorpattern += "i[delm-oq-t]|"; //id,ie,il,im,in,io,iq,ir,is,it
domainvalidatorpattern += "j[emop]|"; //je,jm,jo,jp
domainvalidatorpattern += "k[eg-imnprwyz]|"; //ke,kg,kh,ki,km,kn,kp,kr,kw,ky,kz
domainvalidatorpattern += "l[a-cikr-vy]|"; //la,lb,lc,li,lk,lr,ls,lt,lu,lv,ly
domainvalidatorpattern += "m[acdghk-z]|"; //ma,mc,md,mg,mh,mk,ml,mm,mn,mo,mp,mq,mr,ms,mt,mu,mv,mw,mx,my,mz
domainvalidatorpattern += "n[ace-giloprtuz]|"; //na,nc,ne,nf,ng,ni,nl,no,np,nr,nt,nu,nz
domainvalidatorpattern += "om|"; //om
domainvalidatorpattern += "p[ae-hk-nrtwy]|"; //pa,pe,pf,pg,ph,pk,pl,pm,pn,pr,pt,pw,py
domainvalidatorpattern += "qa|"; //qa
domainvalidatorpattern += "r[eouw]|"; //re,ro,ru,rw
domainvalidatorpattern += "s[a-eg-ort-vyz]|"; //sa,sb,sc,sd,se,sg,sh,si,sj,sk,sl,sm,sn,so,sr,st,su,sv,sy,sz
domainvalidatorpattern += "t[cdf-hjkm-prtvwz]|"; //tc,td,tf,tg,th,tj,tk,tm,tn,to,tp,tr,tt,tv,tx,tz
domainvalidatorpattern += "u[agkmsyz]|"; //ua,ug,uk,um,us,uy,uz
domainvalidatorpattern += "v[aceginu]|"; //va,vc,ve,vg,vy,vn,vu
domainvalidatorpattern += "w[fs]|"; //wf,ws
domainvalidatorpattern += "y[etu]|"; //ye,yt,yu
domainvalidatorpattern += "z[admrw]|"; //za,zd,zm,zr,zw
domainvalidatorpattern += "com|"; //com
domainvalidatorpattern += "edu|"; //edu
domainvalidatorpattern += "net|"; //net
domainvalidatorpattern += "org|"; //org
domainvalidatorpattern += "mil|"; //mil
domainvalidatorpattern += "gov|"; //gov
domainvalidatorpattern += "biz|"; //biz
domainvalidatorpattern += "pro|"; //pro
domainvalidatorpattern += "aero|"; //aero
domainvalidatorpattern += "coop|"; //coop
domainvalidatorpattern += "info|"; //info
domainvalidatorpattern += "name|"; //name
domainvalidatorpattern += "int|"; //int
domainvalidatorpattern += "museum"; //museum
domainvalidatorpattern += ")$";
re = new Regex(
domainvalidatorpattern,
RegexOptions.IgnoreCase
| RegexOptions.Singleline
);
//if domain matches pattern, it has a valid
TLD
valid = re.IsMatch(domain);
//return an indication of TLD
validity
return valid;
}
To me it was more important to actually verify the url is online and is a
valid page. Otherwise - what really is the point of including a link on a page
or creating a valid redirection to an external site...
public static bool SimpleTestUrlAvailability(string Url)
{
bool success =
false;
try
{
HttpWebRequest req =
(HttpWebRequest)HttpWebRequest.Create(Url);
HttpWebResponse res =
(HttpWebResponse)req.GetResponse();
if (res.StatusCode ==
HttpStatusCode.OK)
{
success = true;
}
}
catch
{
//log it
}
return success;
}
Not this code is rather simple. We check for the status code and assign it as
the result boolean. Not the try...catch... a UriFormatException will be thrown
if it can not connect because the site doesn't listen on the requested sheme's
port. For example checking a Https scheme when the site doesn't have https
enabled will throw the exception. So, I nest it in a try catch so that I can
loop through several cases...
The following code is what I use to loop through and determine if it is just
the original url not valid, if https not found try http instead, and end all of
end all - determine if the domain is actually online...
public static bool AdvancedTestUrlAvailability(string OriginalUrl,string domain, string schema)
{
bool success =
false;
success = SimpleTestUrlAvailability(OriginalUrl);
HttpContext.Current.Trace.Warn("$$$ Status for
finding https is : " + success + " Url was:" + OriginalUrl);
if (schema ==
"https" && success
== false)
{
success = SimpleTestUrlAvailability(OriginalUrl.Replace("https://", "http://"));
HttpContext.Current.Trace.Warn("$$$ Status for
finding the http after https failed is : " + success +
" Url was:" +
OriginalUrl.Replace("https://", "http://"));
}
if (success ==
false)
{ //Now we check to see if the domain is
available....
success = SimpleTestUrlAvailability(schema + "://" + domain);
HttpContext.Current.Trace.Warn("$$$ Lookng for
Main Domain status : " + success + " Url was:" + (schema + "://" + domain));
if (success ==
false && schema ==
"https")
{
success = SimpleTestUrlAvailability( "http://" + domain);
HttpContext.Current.Trace.Warn("$$$ Status for
finding Domain after the http after https failed is : " +
success + " Url was:" + (
"http://"+domain));
}
}
return success;
}
}
The next part is the method we actually call.... We
assign a new uri and then dissect it and run the various methods listed above to
determine what is valid and what is not...
public static EMAIL_URL_VALIDATION JodysTestUsingUri(string url)
{
EMAIL_URL_VALIDATION
myresults = new
EMAIL_URL_VALIDATION();
try
{
if (url != null && url.Length > 1)
{
try
{
Uri newUrl =
new Uri(url);
string domain =
newUrl.Authority;
bool isValidDomain =
DomainExtensionValid(domain);
string scheme =
newUrl.Scheme;
myresults.IsValidScheme = IsSchemeAllowed(scheme);
myresults.IsValidDomain = isValidDomain;
myresults.Scheme = scheme;
HttpContext.Current.Trace.Warn("!!! Authority =
" + newUrl.Authority + "
Scheme is " + newUrl.Scheme + " isValidDomain " + isValidDomain + " Scheme Allowed = " +
myresults.IsValidScheme);
if
(myresults.IsValidScheme && myresults.IsValidDomain)
{
myresults.IsValidPageSource = AdvancedTestUrlAvailability(url, domain,
scheme);
HttpContext.Current.Trace.Warn("!!! Page
Availbility is " + myresults.IsValidPageSource);
}
}
catch (UriFormatException e)
{
HttpContext.Current.Trace.Warn("!!! -Requested URI
Check = " + url + " Format
URI error: + " + e.ToString());
myresults.IsValidDomain = false;
myresults.UriFormatError = e.ToString();
}
}
}
catch (Exception ex)
{
HttpContext.Current.Trace.Warn("!!! -Requested URI
Check = " + url + " Format
URI error: + " + ex.ToString());
}
return myresults;
}
You'll notice that I write the more important details to trace to view
results... Again this is an alternative to the RegEx methods and the act it can
actually validate the url as being online is handy. It can be expanded to
handle batches, run as a scheduled task, and enhanced even more...Code is not
pretty but the goal here was to show working code that does the basics...