Starting and stopping Heritrix and Wayback

(note: this post references CentOS specifically)

Hopefully, you’re familiar with the Internet Archive’s Wayback machine; a service that lets you to see historical snapshots of web sites. But, you may not realize that you can setup your very own Wayback service on a server. Or, maybe like me, you’ve just had this service dropped in your lap with very little experience. If that’s you, this is a quick post for you.

TL;DR


/etc/init.d/heritrix stop/start

/etc/init.d/tomcat stop/start #Assuming your Wayback service uses the default tomcat/5 name

As I mentioned, managing a Wayback server came rather suddenly to me and I found documentation either lacking or over my head. I started encountering a problem where every site would show up as “Not in archive” even when I knew the site was in the index. Since I couldn’t find quick answers and generally didn’t have a lot of time to dedicate, the quick fix was a reboot.

This morning, I finally stuck it out long enough to identify which service Wayback was running under and which of the two underlying services (Heritrix or Wayback) was the culprit. The short story is that this error is caused by a problem with the Heritrix service. Restarting it (after killing the orphaned process) cleared things up. This is a quick summary of what steps I took–which were standard steps for any service–to identify each service.

Heritrix

Maybe you’ve seen this word pop up in access logs. It is the web crawler Wayback depends on to archive web sites. Heritrix creates an index, or archive, of crawled sites. Wayback calls these archives (in WARC format) up when you want to view one. So, when a site isn’t in the archive that should be, Heritrix is your candidate.

This was the easy one, but I didn’t jump to it first, because Wayback presented the error. But, it looks like Heritrix generally shows up under the service name “heritrix,” which means you just have to type something like “/etc/init.d/heritrix stop” and “/etc/init.d/heritrix start” to stop and/or start the service. The problem this morning was that the process was orphaned, so I couldn’t simply stop it. The following command verified that the process was orphaned:


ls /var/run/heritrix.pid

The file didn’t exist, which explained why the service couldn’t be stopped: there was no PID associated with it. Examining /etc/init.d/heritrix showed me that the configuration, which I assumed would contain the port for Heritrix, was located in /etc/sysconfig/heritrix. Opening this file up revealed the port Heritrix was running on, via the “Dcom.sun.management.jmxremote.port” setting. This allowed me to identify the orphaned process using the following command (where “1234” represents the port Heritrix is running on):

sudo netstat -tulpn | grep 1234
tcp        0      0 :::1234        :::*        LISTEN     2804/java

 So, a quick “kill 2804” and a “/etc/init.d/heritrix start” got Heritrix up and running again. Within a couple minutes, I could access the site in the archive without an error.

Wayback

The Wayback service was a little trickier to identify because, at least in my case, it used the generic “tomcat” service name. I knew the port that the Wayback service was running on because of using the interface often. I was able to grep around long enough to dig up /usr/local/apache-tomcat/webapps/wayback/WEB-INF/wayback. Viewing /etc/init.d/tomcat confirmed that this was inside the $CATALINA_HOME directory. So, that was the connection. The Wayback info was in the $CATALINA_HOME directory, and the tomcat service pointed to $CATALINA_HOME. “/etc/init.d/tomcat stop/start/restart” was all I needed. But, don’t assume your Wayback service is the same since multiple instances of Tomcat can run on a server.

Nothing groundbreaking here, but it would have been much faster if I had found references to service names and dependencies. If you are in the same position as me, I hope this clears things up a bit. I’m not a Wayback/Heritrix expert, but I certainly know a lot more today than I did when I first started out.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s