Journal/20 Floréal CCXIV from Evan Prodromou

After a couple of days working on this site, I'm getting less and less happy about the hacks I'm doing to WiLiKi. I'm too used to being non-invasive on a piece of software. From my experience with MediaWiki, I know it's important to keep customizations separate from your upstream code.

Anyways, I managed to add an RSS feed just for the Journal part of the site, and this morning I made a sitemap in Google Sitemap format. All to the purpose of making the site more WebSoftwareFriendly.

Lost software

So, I've had various bits of software hosted on this domain going back to before 2000. Some of it has been lost, which I kind of regret. If anyone knows where to find these packages, please let me know!

One more!

I also added a Yahoo! urlllist.txt -- just a plaintext list of files on a site. The macro is at http://evan.prodromou.name/software/wiliki-urllist.scm . The sitemap is at http://evan.prodromou.name/software/wiliki-sitemap.scm . I think both need some customization for other URI formats, but they're a good start.

WebSoftwareFriendly

I'm including this essay in this day's entry, since it's appropriate to what I've been thinking about today. I hope it's helpful.

I don't think there are a lot of people who've been using the Web for very long who appreciate the initials SEO very much. The name evokes dirty tricks, unpleasant or underhanded user experience, and unneighbourly abuses of other Web sites. I also don't think that sneaking around with search engine software is a really good long-term approach; I think the best way to improve search engine ranking is through UserExperienceOptimization.

However, even though I don't think you can artificially boost your search engine traffic over the long term, I do think you can artificially reduce your search engine traffic by making your site run in ways that aren't friendly to Web software.

Billions of people each day have their Web experience mediated by some kind of Web software. Most of this software is very dumb, but it's still absolutely necessary to distill and organize the massive amount of data on the Web. If your site is unfriendly to that software, then it will not be your friend and carry your words to the multitude.

These are my suggestions for making any Web site more Web agent friendly.

  1. Think of the long tail. If you think to yourself, "People can read my site in IE, and Google seems to be spidering it OK," think again. There are thousands of pieces of software that will be reading, analyzing, and distilling your site for human use -- alternative browsers, browser extensions and plugins, mobile device Web readers, feed aggregators, desktop filters, archivers, and of course various search engines and meta-search-engines. If your site is readable to these bits of software, then users of that software will favour your site. Each tiny segment will add up into a lot of users -- users who will embrace you as "one of them".
  2. Validate. A lot of people think that as long as their Web pages show up correctly in the browser (where "the browser" == "Internet Explorer"), then they're fine. This is just plain wrong. If your documents cannot be parsed by Web agents, they will not be read, their information will not be extracted, and you will have lost a channel to readers.
    Browsers are extremely tolerant of errors in Web resources, since the software has a huge incentive to make whatever effort possible to guess how to show a resource correctly (or even mostly correctly). The user has asked for this object directly; they have dedicated their personal computer's resources to the task; they are sitting impatiently watching the browser, waiting for it to do something with the incoming data.
    Offline Web agents do not have any of these incentives. They go through billions or trillions of Web documents and resources per day; their programmers have an incentive to make the software go through as many documents as possible as quickly as they can. Spending any extra time backtracking through your document and trying to make some sense of it is time better spent downloading and reading 2000 other documents. To re-iterate: browsers will take the time to work around your errors, offline Web readers will not. Do your best to validate your HTML, CSS, images, RDF and XML to the best of your ability. As Jon Postel said, "Be liberal in what you accept, and conservative in what you send."
  3. Use Web standards. This is kind of a corollary to the above two items. Writers of Web software with limited resources will program the software to read at most a handful of data formats. The more access they have to the specs for those formats, they more likely they'll be to program their software to read them. If your data is in some kind of proprietary format (like Flash or PDF or whatever), the Web agents will skip your data and move on to the next site.
    In addition, your ability to validate a document (outside of a fault-tolerant user-oriented piece of software like a browser) will depend on you having validation tools. Validation tool writers have the same motivations as Web agent programmers -- they'll write validators for standard formats.
    1. Use UTF-8. If your pages are in some weird or unrecognizeable character set, your site will be skipped. Make sure you export pages to UTF-8, and that they look OK in that charset.
  4. Be semantic. The Semantic Web is a collection of technologies that help pieces of software extract meaning from Web resources. It is typically dismissed as academic bullshit by anyone working on the Web for money. Those people are wrong; Semantic Web technologies are extremely practical. They meet software halfway -- conveying meaningful knowledge to software in easy-to-digest form. This way, software doesn't have to go through the imperfect, haphazard, time- and resource-intensive process of trying to understand ambiguous, confusing, connotative knowledge. We have our data in forms that humans understand: free text, images, video, spoken words or music. If you can translate the meaning of text and images in a way that software agents can comprehend, then they will scurry off and tell their human master, who will come read the human-readable values.
    1. Use microformats.
    2. Use RDF.
  5. Make friends with HTTP. HTTP is the underlying protocol that all software will use to contact your site. This means that the better your site supports HTTP in all its vagaries, the more likely that software will be to read, analyze, and distill your data.
    Many Web developers who make dynamic Web sites do a first pass at their site's HTTP support from the browser point of view. Can people reach the site with a browser? Great. Later, they have to do an additional pass at their HTTP support, when their performance needs tuning and they need to interact better with browser caches and other caching software (an interesting form of Web agent, by the way). They learn to handle HEAD requests, support If-Modified-Since, return Last-Modified, Expires, and Vary headers, etc.
    Sites that carefully handle the spectrum of HTTP requests properly will be most likely to benefit from special-case code in HTTP-based Web agent software. Stuff that works well for Web caches will usually help a lot of Web agents, but look carefully at other opportunities to boost your HTTP IQ.
  6. Avoid query strings. Query strings with lots of parameters at the end of an URL are unfriendly to Web agents. Agents have no idea what parts of the query string represent the object they're retrieving, and which parts are fluff and cruft to represent the state of the user or the state of the session. Many Web agents will ignore or truncate query strings entirely.
    Learn mod_rewrite, or whatever equivalent there is in your Web server software to translate path-oriented URLs (http://example.com/cities/dallas ) into dynamic software queries (http://example.com/action.jsp?action=city&city=dallas ).
  7. URL is identity. Web software typically identifies an object by its URL -- say, as an identifier in a database. If there are two URLs for the same object, typical Web software will be unaware of that fact. It will add up data, or summarize counts, or note relationships with other objects based on the misconception that the two URLs represent different things.
    As much as possible, use a single URL to identify a single object or service wherever you refer to that service. If you are tolerant of user error or uppercase-lowercase issues, use 301 Moved Permanently redirect responses to indicate the correct, canonical URL. mod_rewrite can be a real help here.
  8. Share your metadata. Wherever possible, share metadata about your data objects in whatever way you can. Look for ways to indicate modification and creation dates, change frequency, sizes, formats, related objects or people, language, purpose.
    Web agents will use this information in two ways. First, they'll often be developed just to find this metadata. Second, the metadata can guide the Web agents and make their work more efficient -- making your site a preferred site to visit.
  9. Make feeds. Make lists of resources that you have, and expose them in as many formats as possible.
    Web agents typically enter your site from the home page (or, worse, some random page) and start reading HTML pages. They read all the <a> tags, put the href values in their queue, and then read some more pages. Understandably, this process is extremely error-prone, and agents usually also cap the depth of links they'll follow, or a total number of links for a site. The chance that a Web agent will find all your pages, given any random page, is practically nil. A good sitemap page can help, but it's no guarantee.
    By providing a feed, you can give Web agents a more direct route to all of your pages. They'll typically do linking analysis between the pages anyways, but at least the feed gives them a one-step entry to every part of your site.
    1. Have an RSS feed. This is by far the most well-supported feed format on the Web right now; there are hundreds of Web sites and thousands of desktop applications that read and use RSS data. RSS makes sense for practically any Web site style -- it's not just for blogs, wikis, or Bittorrent feeds.
    2. Create a Google Sitemap. Google has an extremely simple sitemap format that it uses to determine how to walk your site. Having a Google sitemap can be the single biggest way to boost your site's visibility. Yahoo and Microsoft now support the sitemap format, too. See http://sitemaps.org/ .