Welcome to Tech-Review.Org Sign in | Join | Help

.net_2.0

My coding blog entries. Typically will either be more complex coding examples or overcoming product issues / troubleshooting resolutions.
Using Uri's to determine valid urls with minimal RegEx

If there is one aspect of programming I understand the least (less javascript and Ajax) it is using Regular Expressions.  For simple stuff - it is relatively easy - but I wanted to write a robust url checker and seems everyone has a different way of doing it when it comes to Regular Expressions...

Is there an Alternative?

You betcha... use the Uri feature to grab all pertinent information of the url and only use Regular Expressions to determine if the domain extension is valid...It saves in the frustration of dealing with various expressions and creates more readable code in my opinion.... I have not yet tested to see how performance wise the various methods would compare...but since I wrote this code for administrative purposes - I was looking more for code that I could understand...

This is sample code so you will need to dress it up for your own use. 

First I create the base for using generics to pass various results around the application...

public class EMAIL_URL_VALIDATION

{

public bool IsValidEmailAddressFormat;

public bool IsValidEmail;

public bool IsValidDomain;

public bool IsValidScheme;

/// <summary>

/// Used for checking to see if page actually exists

/// </summary>

public bool IsValidPageSource;

/// <summary>

/// Used to determine if page requested has moved - if so populate this with the new url.

/// </summary>

public string RedirectedPage;

public string Domain_Error;

public string Email_Error;

public bool IsYourDomain;

public bool UsingLocalHost;

public string Domain;

public string Account;

public string Scheme;

public string Fragments;

public string UriFormatError;

public EMAIL_URL_VALIDATION()

{

}

}

All this class does is allow me to use generics to assign the various status results and pass them around so that depending on what other application functions need - it can be easily referenced.

Next I create collection that I can pass in the scheme (The scheme is the first part of the url such as http, ft, https etc) and determine if my application allows it.

public static bool IsSchemeAllowed(string scheme)

{

List<string> colSchemes = new List<string>();

colSchemes.Add("http");

colSchemes.Add("https");

colSchemes.Add("ftp");

colSchemes.Add("file");

colSchemes.Add("gopher");

colSchemes.Add("mailto");

colSchemes.Add("news");

colSchemes.Add("nntp");

return colSchemes.Contains(scheme);

}

We pass in the scheme and the result sent is a boolean that allows us to set the IsValidScheme.  If it is not a valid scheme then odds are the url will not be valid anyways.  The exception will be relative urls such as "admin/default.aspx". In this case you would use stringbuilder or add strings that would prefix the relative url with the http:// and your domain name...

Next we need a way to compare the Domain extension to make sure it is a valid domain (testing for .com, .net, .ru etc)..This code I grabbed from the listed source as it was by far the most complete and can also be used for validating email addresses (which was the authors original intention)...

 

public static bool DomainExtensionValid(string domain)

{ //Used to validate the Domain extensions original source

//http://www.aspemporium.com/classes_src.aspx?cid=4

bool valid;

Regex re;

string domainvalidatorpattern = "";

//pattern to validate all known TLD's

domainvalidatorpattern += "\\.(";

domainvalidatorpattern += "a[c-gil-oq-uwz]|"; //ac,ad,ae,af,ag,ai,al,am,an,ao,aq,ar,as,at,au,aw,az

domainvalidatorpattern += "b[a-bd-jm-or-tvwyz]|"; //ba,bb,bd,be,bf,bg,bh,bi,bj,bm,bn,bo,br,bs,bt,bv,bw,by,bz

domainvalidatorpattern += "c[acdf-ik-orsuvx-z]|"; //ca,cc,cd,cf,cg,ch,ci,ck,cl,cm,cn,co,cr,cs,cu,cv,cz,cy,cz

domainvalidatorpattern += "d[ejkmoz]|"; //de,dj,dk,dm,do,dz

domainvalidatorpattern += "e[ceghr-u]|"; //ec,ee,eg,eh,er,es,et,eu

domainvalidatorpattern += "f[i-kmorx]|"; //fi,fj,fk,fm,fo,fr,fx

domainvalidatorpattern += "g[abd-ilmnp-uwy]|"; //ga,gb,gd,ge,gf,gg,gh,gi,gl,gm,gn,gp,gq,gr,gs,gt,gu,gw,gy

domainvalidatorpattern += "h[kmnrtu]|"; //hk,hm,hn,hr,ht,hu

domainvalidatorpattern += "i[delm-oq-t]|"; //id,ie,il,im,in,io,iq,ir,is,it

domainvalidatorpattern += "j[emop]|"; //je,jm,jo,jp

domainvalidatorpattern += "k[eg-imnprwyz]|"; //ke,kg,kh,ki,km,kn,kp,kr,kw,ky,kz

domainvalidatorpattern += "l[a-cikr-vy]|"; //la,lb,lc,li,lk,lr,ls,lt,lu,lv,ly

domainvalidatorpattern += "m[acdghk-z]|"; //ma,mc,md,mg,mh,mk,ml,mm,mn,mo,mp,mq,mr,ms,mt,mu,mv,mw,mx,my,mz

domainvalidatorpattern += "n[ace-giloprtuz]|"; //na,nc,ne,nf,ng,ni,nl,no,np,nr,nt,nu,nz

domainvalidatorpattern += "om|"; //om

domainvalidatorpattern += "p[ae-hk-nrtwy]|"; //pa,pe,pf,pg,ph,pk,pl,pm,pn,pr,pt,pw,py

domainvalidatorpattern += "qa|"; //qa

domainvalidatorpattern += "r[eouw]|"; //re,ro,ru,rw

domainvalidatorpattern += "s[a-eg-ort-vyz]|"; //sa,sb,sc,sd,se,sg,sh,si,sj,sk,sl,sm,sn,so,sr,st,su,sv,sy,sz

domainvalidatorpattern += "t[cdf-hjkm-prtvwz]|"; //tc,td,tf,tg,th,tj,tk,tm,tn,to,tp,tr,tt,tv,tx,tz

domainvalidatorpattern += "u[agkmsyz]|"; //ua,ug,uk,um,us,uy,uz

domainvalidatorpattern += "v[aceginu]|"; //va,vc,ve,vg,vy,vn,vu

domainvalidatorpattern += "w[fs]|"; //wf,ws

domainvalidatorpattern += "y[etu]|"; //ye,yt,yu

domainvalidatorpattern += "z[admrw]|"; //za,zd,zm,zr,zw

domainvalidatorpattern += "com|"; //com

domainvalidatorpattern += "edu|"; //edu

domainvalidatorpattern += "net|"; //net

domainvalidatorpattern += "org|"; //org

domainvalidatorpattern += "mil|"; //mil

domainvalidatorpattern += "gov|"; //gov

domainvalidatorpattern += "biz|"; //biz

domainvalidatorpattern += "pro|"; //pro

domainvalidatorpattern += "aero|"; //aero

domainvalidatorpattern += "coop|"; //coop

domainvalidatorpattern += "info|"; //info

domainvalidatorpattern += "name|"; //name

domainvalidatorpattern += "int|"; //int

domainvalidatorpattern += "museum"; //museum

domainvalidatorpattern += ")$";

re = new Regex(

domainvalidatorpattern,

RegexOptions.IgnoreCase | RegexOptions.Singleline

);

//if domain matches pattern, it has a valid TLD

valid = re.IsMatch(domain);

 

 

//return an indication of TLD validity

return valid;

}

 

To me it was more important to actually verify the url is online and is a valid page.  Otherwise - what really is the point of including a link on a page or creating a valid redirection to an external site...

public static bool SimpleTestUrlAvailability(string Url)

{

bool success = false;

try

{

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(Url);

HttpWebResponse res = (HttpWebResponse)req.GetResponse();

if (res.StatusCode == HttpStatusCode.OK)

{

success = true;

}

}

catch

{

//log it

}

return success;

}

Not this code is rather simple. We check for the status code and assign it as the result boolean.  Not the try...catch... a UriFormatException will be thrown if it can not connect because the site doesn't listen on the requested sheme's port.  For example checking a Https scheme when the site doesn't have https enabled will throw the exception.  So, I nest it in a try catch so that I can loop through several cases...

The following code is what I use to loop through and determine if it is just the original url not valid, if https not found try http instead, and end all of end all - determine if the domain is actually online...

public static bool AdvancedTestUrlAvailability(string OriginalUrl,string domain, string schema)

{

bool success = false;

success = SimpleTestUrlAvailability(OriginalUrl);

HttpContext.Current.Trace.Warn("$$$ Status for finding https is : " + success + " Url was:" + OriginalUrl);

 

if (schema == "https" && success == false)

{

 

success = SimpleTestUrlAvailability(OriginalUrl.Replace("https://", "http://"));

HttpContext.Current.Trace.Warn("$$$ Status for finding the http after https failed is : " + success + " Url was:" + OriginalUrl.Replace("https://", "http://"));

 

}

if (success == false)

{ //Now we check to see if the domain is available....

success = SimpleTestUrlAvailability(schema + "://" + domain);

HttpContext.Current.Trace.Warn("$$$ Lookng for Main Domain status : " + success + " Url was:" + (schema + "://" + domain));

if (success == false && schema == "https")

{

success = SimpleTestUrlAvailability( "http://" + domain);

HttpContext.Current.Trace.Warn("$$$ Status for finding Domain after the http after https failed is : " + success + " Url was:" + ( "http://"+domain));

 

}

}

return success;

}

}

 

The next part is the method we actually call....  We assign a new uri and then dissect it and run the various methods listed above to determine what is valid and what is not...

 

public static EMAIL_URL_VALIDATION JodysTestUsingUri(string url)

{

EMAIL_URL_VALIDATION myresults = new EMAIL_URL_VALIDATION();

try

{

if (url != null && url.Length > 1)

{

try

{

Uri newUrl = new Uri(url);

string domain = newUrl.Authority;

bool isValidDomain = DomainExtensionValid(domain);

string scheme = newUrl.Scheme;

myresults.IsValidScheme = IsSchemeAllowed(scheme);

myresults.IsValidDomain = isValidDomain;

myresults.Scheme = scheme;

HttpContext.Current.Trace.Warn("!!! Authority = " + newUrl.Authority + " Scheme is " + newUrl.Scheme + " isValidDomain " + isValidDomain + " Scheme Allowed = " + myresults.IsValidScheme);

if (myresults.IsValidScheme && myresults.IsValidDomain)

{

myresults.IsValidPageSource = AdvancedTestUrlAvailability(url, domain, scheme);

HttpContext.Current.Trace.Warn("!!! Page Availbility is " + myresults.IsValidPageSource);

}

}

catch (UriFormatException e)

{

HttpContext.Current.Trace.Warn("!!! -Requested URI Check = " + url + " Format URI error: + " + e.ToString());

myresults.IsValidDomain = false;

myresults.UriFormatError = e.ToString();

}

}

}

catch (Exception ex)

{

HttpContext.Current.Trace.Warn("!!! -Requested URI Check = " + url + " Format URI error: + " + ex.ToString());

 

}

return myresults;

}

 

You'll notice that I write the more important details to trace to view results... Again this is an alternative to the RegEx methods and the act it can actually validate the url as being online is handy.  It can be expanded to handle batches, run as a scheduled task, and enhanced even more...Code is not pretty but the goal here was to show working code that does the basics...

 

Posted: Friday, March 09, 2007 4:56 AM by Jody

Comments

.net_2.0 said:

In Part I of this series, the rudimentary basics of the URI class was covered. Covered were the components

# March 13, 2007 4:39 PM
New Comments to this post are disabled