Welcome to Tech-Review.Org Sign in | Join | Help

.net_2.0

My coding blog entries. Typically will either be more complex coding examples or overcoming product issues / troubleshooting resolutions.
URIs..and System.Net (Part I)

The overall URI features of the System.Net is probably one of the worst and least documented aspect of .Net 2.0.  While the basics of the URI class is reasonably documented, there still has to be additional issues addressed by the developer to provide a truly robust routine to identify a valid url.  In this series of articles where I discuss various ways to tackle analyzing URIs for various tasks - the overall assumption is that the URI can be used by the HttpWebRequest object and subsequently the HttpWebResponse will provide the information needed to handle the overall response. I will not be working with streams as the only goal is to identify URIs and response properties.

 NOTE: URIs and URLs are basically the same.  URLs can be anything basically as it is simply a string and requires using string analysis to break down, however the URI is designed to adhere the RFC naming conventions and the methods associated with the URI class actually break down the URI to components.

Why this is important:

1. SPAM prevention:  Most SPAM seen on websites is detected by the use of Host blacklists which in some cases these external sites may combine a mixture of mail, phishing and known web SPAM html sites.  This would can prevent legitimate links (a poor example would be someone who uses AngelFire to host their site), or even let SPAM sites get through.  The principle of SPAM sites is to get the web user to click on a landing page and then be redirected to alternate sites.  If a URI is analyzed first by the web application, in combination with Host blacklists - a better SPAM prevention detection routine can be developed.

2. Verifying external links: Nothing is worse than someone making a typo but they do not have the ability to edit the entry after the fact.  Broken URLs will not only cause user frustration but also effects search engine rankings.

3. Verifying local links: If you use URL re-writers, maintain hundreds of static files, or use custom IHTTP handlers - odds are there might have been the occasional redirection entry where the desired destination url is actually a redirection itself.  Perhaps a url friendly name has already been used, but because URLs & URIs can be case sensitive - you have multiple same source destinations configured that may result in a bogus result. 

For instance: http://tech-review.org/BLOGS & http://tech-review.org/blogs can point to separate resources if a custom IHHTP handler does not parse the URL / URI and convert to same case such as by using .ToLower() or .ToUpper(). While this may not be an issue for a site that is based on physical disk file directories and filenames, it is an issue with any dynamic url creation used by a custom IHTTP handler where the url generated is from some sort of data source such as a back end database and proper formatting of the url stored is not taken in account prior to committing such to the data source.

4. Advanced detection: 

  • Being able to automatically update links when a link has been permanently moved as a result of a site design.   For instance a technical site that references MSDN articles - two years later the article is moved. You could now automatically update links used by your site thus preventing the visitor from waiting for the 'you'll be redirected in 10 seconds' and you will do better in search results with a solid link.  Search engines typically score the cross link with lessor score when it detects a linkto: is redirected (thank the Spammers for this one).
  • Detect if a link requires authentication to view.   Nothing like someone placing a link inline and the user discovers they must be a member of the site to view it. 
  • There are other numerous scenarios where analyzing the response can be beneficial. 

 

Types of URIs:

1. Relative: This means that the URI contains only the Path and Query.  The Scheme and Authority will be null.  Relative URIs, however can not be used to launched a WebRequest as the WebRequest requires a fully qualified path. Example: "/blogs/default.aspx"

2. Absolute: This is a fully qualified url such as "http://tech-review.org/blogs/default.aspx".   

Main Components of the URI.

  • Scheme: This denotes the type of request such as http, ftp, mailto etc... In other words - denotes the protocol that must be used.
  • Authority: This is the actual domain our server source. such as "tech-review.org" or "www.tech-review.org".  It maybe preceded with user credentials as well such as "[jody@tech-review.org].tech-review.org"
  • Path: This will be the actual page path as such: "blogs/default.aspx"
  • Query and Segments: These are identifiers at the end of the path such as a "?prodID=122" or "#Todays_Blog_Highlight".

 

 

All the above probably, was probably merely a refresher course for you at best.  So, let us get into some really interesting aspects of the WebRequest model.  Most of all the sample code (including MSDN) merely deals with creating a request object and then getting the stream so that either the returned html or stream data is searched through for finding links, building a gif of the page, or the likes.  I'll try to share as much insight into what can actually be done with the System.Net class with specific attention to URIs.

 

URIs:

To create a URI:

string UrlAsString = "http://tech-review.org/blogs" ;

URI newURI = new URI(UrlAsString);

Note: URIs are READ ONLY once set.  This means that if you wanted to add "/default.aspx" to the newURI, you can not do this:

newURI = newURI+"/default.aspx";
To re-use a URI variable in a method:
 public static class MyURIHandler(string UrlAsString)
{ URI newURI = null;
 newURI = new URI(UrlAsString);
}
Notes: You have to set the variable to null, other wise you will throw exceptions.  You will want to handle a URI declaration this way since you will want to use Try..Catch blocks.  Exceptions have to be handled when creating a new instance of the URI; because, unlike alot of other C# operators, if the URI can not be successfully parsed it doesn't simply pass a null, and instead throws exceptions.  This is one of the drawbacks of the URI class, as significantly more code has to be written to handle the various reasons why a URI is incorrect and will not instigate a URI object.  More on this later.
URI is Overloaded: 
URI has five additional overloads:
Uri(string url,  UriKind TypeOfUriKind)
UriKind denotes the type of URI and is implemented using UriKind.[Absolute, Relative, or RelativeOrAbsolute].
Example: Uri newUri = new Uri(url, UriKind.RelativeOrAbsolute) ;
Uri(string url, Bool dontEscape)
dontEscape if set true denotes that the string url has already had invalid escape sequences that may be present in the url - replaced.  For example:
If the url is "http://tech-review.org/blogs/My Blog Entry For Today.aspx" - you will want to set the dontEscape boolean value to false as such:
Example: Uri newUri = new Uri(url, false) ;
If the url is "http://tech-review.org/blogs/My%20Blog%20Entry%20For%20Today.aspx" then : 
Example: Uri newUri = new Uri(url, true) ;
Uri(Uri baseUri, string relativeUrl)
The major difference here is that instead of passing in a string for the url, instead you pass a Uri.  The string that denotes the relative url is appended.  In other words think of the (Uri uri ,..) as the base and the string - relativeUrl is appended to it:
For example:
     Uri baseUri = Uri("http://tech-review.org");
     Uri newUri = (baseUri, "/blogs/default.aspx");
Uri(Uri baseUri, Uri relativeUri)
This performs the same function as when using the relative url as a string except the relative url is now a Uri. 
Example (Simple):
Uri baseUri = Uri("http://tech-review.org");
Uri relativeUri =  Uri("/blogs/default.aspx");
Uri newUri = (baseUri, relativeUri);
Example (Advanced):
Uri baseUri = Uri("http://tech-review.org", UriKind.Absolute);
Uri relativeUri =  Uri("/blogs/default.aspx", UriKind.Relative);
Uri newUri = (baseUri, relativeUri);
 
Uri(Uri baseUri, string relativeUrl, bool dontEscape)
 
Determines if illegal characters are parsed out of the final URI (the combination of the base URI and appended relative url string).

 

 Now that we understand the overall semantics of the URI, what is the benefit of using a URI over just parsing or RegEx strings?

When you use the URI class it is kind of like using Generics to enforce a strongly typed collection - you get code safety in knowing that the url syntax is actually valid according the W3C and RFC standards currently in place.  When using RegEx and string parsing - you have to specifically code and dissect each part of the url on your own which can lead to malformed urls or code bloat.  However, the URI and related classes are not without issues as we will discover throughout this series.

You can not verify that a domain extension is valid.  It merely looks for something between the :// and the first / after the scheme.

For example: You set the url as "http://tech-review.org6w3c/default.aspx" the URI authority will be: "tech-review.org6w3c", which obviously is not a valid domain.

Virtual directories: In IIS sometimes we have to set virtual directories.  For example: http://mywebserver/tech-review.org/ will return an authority of "mywebserver" missing the obvious .org domain embedded.
 

With that specific case, the option that will need to be exercised will be a combination of  RegEx and string parsing in combination with the use of the URI components.

This entry is getting rather long,and  as a result I will create a series of related blog entries on this subject.  In the next segment,  I will cover the UriBuilder, UriFormatException, and an overview of the HttpWebRequest, HttpWebResponse and HttpWebClient.

Posted: Monday, March 12, 2007 4:36 PM by Jody
Filed under: ,

Comments

No Comments

New Comments to this post are disabled