Crawling Commented Styling with Heritrix

This post isn’t something I can take credit for. The purpose is to make two potential solutions discoverable for someone like me, looking for an answer. Credit will be given where it is absolutely due.

As I’ve written about before, I inherited a Wayback/Heritrix server in my role and have had the pleasure of hacking my way through it on occasion. A recent challenge arose when I needed Heritrix to crawl an old Plone site. The Plone template placed html comments around all the styling tags to hide it from older browsers, which couldn’t understand the CSS. The result looks something like this:

<style type="text/css"><!--
/* - base.css - */
@media screen {
/* http://servername.domain.net/portal_css/base.css?original=1 */
/* */
/* */
.documentContent ul {
list-style-image: url(http://servername.domain.net/bullet.gif);
list-style-type: square;
margin: 0.5em 0 0 1.5em;
}
--></style>

Unfortunately, it seems Heritrix happily skips past any URLs within comments by default and does not follow them, regardless of your seeds and other configurations. Because, hey, they’re only comments, right? The end result is that it looks like the site was crawled successfully, but some resources were actually missed. In the above example, the Wayback version of the site was still pointing to http://servername.domain.net/bullet.gif for thelist-style-image, rather than http://wayback.domain.net:3366/wayback/20151002173414im_/http://servername.domain.net/bullet.gif. Therefore, it was not a complete archive of the site and its contents.

In my case, this was an internal site that I had total control over. However, try as I might, I could not figure out how to remove the comments from the old Plone template. Grepping for ‘<style type=”text/css”><!–‘ turned up ‘_SkeletonPage.py’. I tried modifying it and then running buildout to no avail. I am sure people more experienced with Plone could tell you where to change this in a heartbeat, but it’s beyond my knowledge with the application at this point. After coming up short on searches for solutions with Heritrix (thus, this post), I started looking for ways to remove the comment tags with something like Apache’s mod_substitute, since Plone was being reverse-proxied through Apache anyway.

Solution 1: Mod_Substitute/Mod_Filter

Eventually, I stumbled upon this configuration from Chris, regarding mod_substitute and mod_filter. Mod_filter needed to be used for mod_substitute to work properly because of the content being reverse-proxied. A simple modification of Chris’s configuration worked to remove the comment tags beautifully (using CentOS/Httpd for reference):

LoadModule substitute_module modules/mod_substitute.so
LoadModule filter_module modules/mod_filter.so
FilterDeclare replace
FilterProvider replace SUBSTITUTE Content-Type $text/html
FilterChain +replace
FilterTrace replace 1
Substitute "s/css\"><!--/css\">/n"
Substitute "s|--></style>|</style>|n"

(Note: I probably could have made this a little more efficient by using a single regex instead of two separate substitutes. But, meh. This was good enough.)

Chris recommended loading this into a new file: /etc/httpd/conf.d/replace.conf.

Solution 2: Hack Heritrix

While exploring my options with Apache, I also decided to reach out to the archive-crawler community on Yahoo! for help. A user identified as “eleklr” shared a patch that he used often for this kind of scenario. I think this is the better route to go, though I have not had an opportunity to try it out yet. Its biggest strength is that it doesn’t require you to have complete control over the site you are crawling, as is necessary for solution 1.

If you’ve found yourself in my position, rejoice in the fact that it’s not just you and there are solutions out there. Hopefully, one of the two listed above will help you on your way. Please share if you’ve discovered other solutions to this or similar problems.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s