Welcome to Tech-Review.Org Sign in | Join | Help

.net_2.0

My coding blog entries. Typically will either be more complex coding examples or overcoming product issues / troubleshooting resolutions.
Robots Gets First Specification Upgrade

If you haven't heard the word on the streets already... Beyond the issues that the latest technologies (like Ajax) are presenting to a site owner's need to be indexed by Google, Yahoo, MSN etc... Developers now get to answer the questions of "How do I create a SiteMap" when I have tens of thousands of dynamically loaded content pages? 

 

Originally back in the day the Robots.Txt file was the standard defacto for instructing a crawler on what to crawl and what couldn't be crawled.  One of the issues with at the CSK (Community Starter Kit) is that multiple communities could be hosted - but all of the content was centralized into a back-end database.  Luckily - the last few years, Bots still sometimes check for the Robots.Txt file - however, if one could not be found, it assumed the entire site was fair game for indexing. Which additionally, made it incredibly difficult to determine what would pop up as the top ranked pages from the site. For instance it was not uncommon to see a a title page, but the description would be -"Users such and such online, Login, Register, etc" as part of the search engines display of results.

 

Personally, I worked around those issues by writing a complex Bot check module that would automatically disable or enable links to particular crawlers based on ratings and known bad bots blacklists etc..However the new standard it appears is going to be the PUSH-PULL crawler method.  Instead of a search engine crawler actually indexing your site, you'll create a XML file that lists all the links, description, etc and send it to the associated recieving service of the crawler in question.   Supposedly Community Server from Telligent will have support for this feature in their upcoming 3.0 version - however - I really do not see any product on the market capable of actually handling this dynamically - RELIABLY.

 

1. The files would have to be stored on disk.  Since BOTs do not actually authenicate, that means either a custom IHTTP handler will have to be used, or the directory location will have to be set to public.  This to me is just as dangerous as what the Robots.Txt specification was.  That or you'll use SQL 2005 to stream Query results as a XML Document.

 

 2. limit's of 10 Meg for a XML SiteMap.  XML while lightweight in terms of providing structure for Web Services... Disk Costs... Last time I checked you can append to a XML file but you can not actually change data within the XML file without re-creating it. Where is the technology advantage here?  Again - even if you do not have a dynamic site - I'd really deplore having to be the one to create these by hand (which is why I steer clear of using the treeview controls etc, as custom binding is just easier).  However, the spec indicate support for gzip compression of the SiteMap.

 

3. Developers will now actually have to write sitemaps that are targeted at particular crawlers to take advantage of optional parameters. Although the standards agreed upon by Google, MSN, and Yahoo all agree to adhere to base specifications and that optional parameters will be parsed and ignored. Or so they say....

 

In some aspects this is a good change I think.  Perhaps. I gave up on Robots.Txt files ages ago and always seemed to place well in the search engines - however, I do contemplate what will happen if I do not conform to the sitemap specification and simply do not provide them and instead on coding to what I don't want to be displayed.  Will my sites still get searched?  

 

I would think that with this improvement (Hey at least it is not a text document anymore and SQL 2005 can actually stream query results as a XML documents from a query at least), however, I wish the one improvement they would actually implement is:

 

Security. It simply is not good enough for detecting a Bot because it states "Hey I am a Crawler", do you believe my UA of "GoogleBot\1+"?  There should be some kind of authentication to actually confirm the Crawler is a crawler.  With copyright protection and privacy - and the likes all being the rage...How do I know if my content is not being duplicated on another site and that site is using my content to boost their rankings because they utilized on of the SPAM - Content Theft Bots?  Some kind of Authentication would be desirable.

 




One Example:   All web sites have a "Bot Registration Form" that have standardized form fields. The bot provides registration name, password, and an associated accountID (that they create for your domain), and a BOT url link that actually links to the company's site for additional verification.(much like what you do for say a Google Analytics Account). And sumbits the form.  The site operator then can approve or deny the account.   And the bot would have to authenticate before crawling pages...

 

Second Example: Site operators would have to create a tracking account, the search engine company would give a unique id - and that the BOT / Crawler would have to append it as a query string in each request. 

 

Personally, just been a pet peeve considering my sites are crawled from over 5000 different identifiable bot engines, of which 95% of them are content stealer bots that spoof real UA for browsers or real Crawlers.. while 5% are actually respected industry search engine crawlers...

 

But like the new SiteMap specification - the whole goal is to actually push it onto the developer to handle all those issues - and well as someone posted in the ASP.Net forums - kinda have a gun to your head to adopt - else - its a Seinfeld Soup Nazi moment:

 

"NO SEARCH RANKING FOR YOU!"...

 

You can find out more about this new specification at SiteMaps.Org 

 

 

Posted: Sunday, November 19, 2006 1:29 AM by Jody
Filed under:

Comments

MhLove said:

I need to know where I can get help converting this to VB. I am so new to this and I not having much luck doing it myself.  Is there a version in VB?

# November 20, 2006 6:18 PM

Jody said:

No exactly sure what you mean by converting this to VB as this isn't exactly a code article.  Is this per chance comment intended for another post?  I do not code in VB so anything I post will be written in C#. if you do a search on C# to VB on google - there are some code converters out there that do a fair job of conversions from one coding language to another...

# November 21, 2006 1:33 PM
New Comments to this post are disabled