Redirect HTTPS Tableau Traffic to a Valid URL

You’ve heard it before: it’s past time to encrypt ALL the things! Even internal traffic should be encrypted, since you never know what rogue devices or people may be listening on an ethernet port or an unsecured hotspot. That’s why, when I inherited a Tableau server, I decided that encryption should be a priority. Especially when you consider the kind of data that can flow in and out of Tableau. And, Tableau makes it surprisingly easy to turn on TLS (or SSL, as they and so many others like me still call it). What they don’t make so easy is redirecting users over to an address that matches your cert. No big deal if your users have always accessed Tableau Server with the right alias, but until now, we only had an internal address that doesn’t match the cert we applied. The good news is Tableau Server uses Apache for its web server. With a pretty small tweak, you can redirect your users in no time.

Please note: this is not documented or supported by Tableau as far as I can tell. Be sure to test thoroughly before applying to your production environment. I also assume these settings will be overwritten by an update/upgrade, thus needing to be reapplied afterward (update: Having since gone through a few upgrades, I can confirm that these settings need to be reapplied afterward).

As I mentioned, Tableau Server uses Apache for its web server. An interesting choice since Tableau is only supported on Windows. This means a couple rewrite conditions/rules in httpd.conf will have you off and running. The first thing you need to know is where this file lives. It will be under Tableau’s data folder, which is located based on which drive Tableau was installed on. Tableau was installed on C: for us, which puts the httpd.conf file in C:\ProgramData\Tableau\Tableau Server\data\tabsvc\config (we will talk about moving your data folder to another drive in a later post). I am not entirely certain what the structure looks like if you installed on a separate drive, so you may need to do some digging.

Once you have located the httpd.conf file, the second thing you need to know is that this file is formatted for *nix line feeds and carriage returns. I.E: If you open it in Notepad, it will all be jumbled together. If you already have a tool like Notepad++ installed on the server, it should do nicely. In my case, I chose to copy the file to my local machine, edit it with Atom, and then push it back to the server. Just be sure to make a backup of the file first.

Ok, so you’ve found httpd.conf, you’ve made a backup, and opened it up in your favorite *nix-friendly text editor. If you scroll down to around line 581, you will start to see several RewriteCond and RewriteRule lines. Our rules don’t have to go here, but it seemed logical since there are already related rules in the vicinity. If you aren’t familiar with mod_rewrite rules, they basically look for certain conditions in an Apache request and rewrite/redirect (with a 301 by default) the URL sent to the server. Here is what I added after Tableau’s built-in list of rewrite rules:


RewriteCond %{HTTP_HOST} !^tableau\.mycompany\.net [NC]
RewriteCond %{HTTP_HOST} !^localhost [NC]
RewriteRule (.*) https://tableau.mycompany.net$1 [R=301,L]

What does each line mean? The first line looks for requests containing a URL that doesn’t match the address we want people to use. Replace “tableau.mycompany.net” with your company’s preferred address for Tableau. Of course, make sure the record actually exists in DNS and points to your Tableau server.

The second line is an AND condition (by virtue of the previous line not ending in “[OR]”) and filters out requests using the “localhost” URL. The reason for this is that Tableau Web Data Connectors (WDC) published on the server will always be refreshed using http://localhost/webdataconnectors/yourWDCname.htm. And, as I found out, Tableau won’t follow the redirect when it tries to extract, but it will seemingly ignore the certificate/server name mismatch. Adding this line makes sure we don’t break any scheduled extracts using a WDC. Side note: it seems that in Tableau 10, you can maintain a list of approved WDC’s external to Tableau (aw yeah!), which I find preferable and would make this line unnecessary.

Now, the third line. This line takes the requests that haven’t been filtered out by the two previous conditions, and rewrites them to use our preferred address. Notice that here I have added the protocol (https://), whereas it is not needed for the conditions since we want to catch HTTP and HTTPS requests. The variables at the end will keep the rest of the URL as-is, so that something like http://nyctabprd01.internaldomain.net/#/views/some/content becomes https://tableau.mycompany.net/#/views/some/content, rather than redirecting to Tableau’s landing page.

Once you have updated httpd.conf with the lines above, restart Tableau Server (tabadmin restart). Now, whenever someone tries the old address, they should be redirected to the new one. This all depends on the visitor or other clients following a 301 redirect, which is pretty standard. Still, be thorough in your testing to account for all conditions.

That’s it! A lot of talking for 3 lines of text.

Compare-ObjectIs: No more weird Foreach… -Contains code

Yesterday, I was again faced with the task of using PowerShell to determine whether one array contained any of the values in another array. Specifically, I had an array of AD group Distinguished Names (DN) and needed to determine if users were members of any of these groups (an LDAP filter would probably be easier, but I was already invested in solving this). Typically, I would handle this with something of a foreach loop: for each user, loop through each of their group memberships and see if the group array contains their group string. This always feels terribly inefficient, so I wanted to find a cleaner way of handling these types of comparisons.

Looking around online, I realized PowerShell has a Compare-Object cmdlet, which sounded promising. It works by accepting a -ReferenceObject and -DifferenceObject, and comparing which values are the same or different between the two. Now, this cmdlet is almost helpful, but really works better for someone interacting with the shell, rather than a script. The output looks something like this:

Screen Shot 2016-05-19 at 7.43.10 AM

The “SideIndicator” tells us which object/array (the reference, or the difference object) has a different value. In this example, the second array contains “orange,” but the first array does not. Conversely, the  first array contains “apple,” but the second does not. Again, handy if you are in the shell, but how do you use this in a script. Well, here is the short of what I came up with:

compare-object $_.MemberOf $includeGroups -includeequal -excludedifferent

You might first notice that there are no “-ReferenceObject” or “-DifferenceObject” parameter names spelled out above. That is because, as with all PowerShell cmdlets, if you specify parameters in the right order, you can skip those names. So, in this case, $_.MemberOf is the reference object and $includeGroups is the difference object. The next two switches are very important for this to work. “-includeequal” tells the cmdlet to return the items that match between the two objects and “-excludedifferent” prevents it from returning the objects that are different. This is because, for this comparison, we really only care about the items that match across arrays.

Continuing the fruit example above, here is what we see:

Screen Shot 2016-05-19 at 7.52.05 AM

This “==” tells us that “pear” and “banana” exist in both arrays. Since we exclude differences, if there are no matches this cmdlet will return $null. That means we can do something like this:


if ( compare-object $MemberOf $includeGroups -includeequal -excludedifferent ) {

  #Do something

}

Or…

... | Where { compare-object $_.MemberOf $includeGroups -includeequal -excludedifferent }

Of course, format it however you would like and surround with parenthesis when using multiple conditions. I feel a little silly that this cmdlet has been there since PowerShell version 3, but I am at least satisfied that I no longer need to employ cumbersome foreach loops in these situations.

 

SSH Key Auth Fails when using Git with Sudo

Flashback about two years, and I had never touched git. GitHub was that place where you clicked the Download link to get a zip of the code you wanted. That being the case, I am still learning as I go. The other day, I drove myself crazy over a complete n00b mistake, which I am embarrassed to admit.

While working on an Ubuntu server, I was trying to pull changes from a Bitbucket repo into a subdirectory of /var/www. Every time I ran “git pull origin master,” the following error was displayed:

Permission denied (publickey).
fatal: Could not read from remote repository.

I knew the correct SSH key had been added to my profile, but I ran “ssh -T hg@bitbucket.org” to be sure. This returned the positive and expected:

logged in as jdoe.

So, why did it fail when I tried a git pull? I double-checked my remote, removed and re-added it to be sure. Then, it hit me! Because of the permissions set on the current directory, I was having to use sudo for all my git commands. But, sudo doesn’t know about my SSH keys (at least by default). After realizing this, the fix was easy: correct the permissions on the directory (which I should have done from the beginning instead of working around it with sudo) and add myself to the directory’s group. Sudo was no longer required and my authentication attempt worked beautifully.

Again, I hang my head in shame over this, but thought I should share in case someone else has a temporary lapse in judgement.

(While verifying the cause, I came across a quick guide that will forward your key to sudo, if you really must.)

routeToS3: Storing Messages in S3 via Lambda and the API Gateway

(If you want to cut to the chase, skip the first four paragraphs.)

Over the past year, Amazon Web Services (AWS) has previewed and released several new services that have the potential to drive the cost of IT down. This includes services like EFS and Aurora, but the service I was most excited about was Lambda. Lambda is a service that executes code on-demand so you don’t have to pay for an entire EC2 instance to sit around waiting for events. I recall at my previous position having a server that only existed to execute scheduled tasks. As supported languages expand, Lambda has the potential to completely replace such utility servers.

There are many ways to trigger Lambda functions, including S3 events, SNS messages and schedules. But, until recently, it wasn’t straightforward to trigger a Lambda event from outside your AWS environment. Enter Amazon’s fairly new API Gateway. The API Gateway is a super simple way to setup http endpoints that communicate with AWS resources, including Lambda functions. And, you don’t have to be a seasoned developer to use it. In fact, I had only recently started learning some standard concepts while playing around with the Slim Framework for PHP. While understanding RESTful APIs will help the API Gateway feel more natural, you can get started without knowing everything.

Let me back up a bit and explain why I came across the API Gateway in the first place. SendGrid has become our go-to service for sending email from various applications. I can’t say enough good about SendGrid, but it has some intentional limitations. One of those is that it will store no more than 500 events or 7 days (whichever comes first) at a time. You still get all your stats, but if you need to look up what happened to a specific email two weeks ago (or two minutes ago depending on your volume), you’re out of luck. Fortunately, SendGrid thought this through and made an event webhook available that will POST these events as a JSON object to any URL you give it. “Perfect!” I thought, “We can build something to store it in RDS.” But first, I thought it prudent to explore the Internet for pre-built solutions.

My research brought me to Keen.io, which was the only out-of-the-box solution I found that would readily accept and store SendGrid events. If you are here for the exact same solution that I was looking for, I strong recommend checking out Keen.io. The interface is a little slow, but the features and price are right. We would have gone this route in a heartbeat, but had some requirements that the terms of service could not satisfy. With that option gone, I was back to the drawing board. After brainstorming many times with my teammates, we finally came up with a simple solution: SendGrid would POST to an http endpoint via the API Gateway, which would in turn fire up a Lambda function, which would take the JSON event and write it to an S3 bucket. The reason for S3 instead of something more structured like RDS or SimpleDB is because we can use Splunk to ingest S3 contents. Your requirements may be different, so be sure to check out other storage options like those I have mentioned already.

SendGrid Logging Diagram

The initial plan. The API structure changed, but the flow of events is still accurate.

Now that we have introductions out of the way, let’s jump in and start building this thing. You will need to be familiar with creating Lambda functions and general S3 storage management. Note that I will borrow heavily from the API Gateway Getting Started guide and Lambda with S3 tutorial. Most of my testing took place on my personal AWS account and cost me $.02.

Create an S3 Bucket

The first thing you need to do is create your S3 bucket or folder that will store SendGrid events as files (you can also use an existing bucket). The simple GUI way is to open your AWS console and access the S3 dashboard. From there, click the Create Bucket button. Give your bucket a unique name, choose a region and click Create.

Create a Lambda Function

This won’t be an in-depth guide into creating Lambda functions, but we will cover what you need to know in order to get this up and running. At the time of writing, Lambda supports three languages: Java, Node.js, and Python. I will use Node.js in this guide.

The Code

Create a file called index.js and add the following contents:


//Modified from AWS example: http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html

var AWS = require('aws-sdk');

exports.handler = function(event, context) {
console.log("routeToS3 Lambda function invoked");

//Restrict this function so that not just anyone can invoke it.
var validToken = event.validToken;
//Check supplied token and kill the process if it is incorrect
var token = event.token;
if (token != validToken) {
console.log('routeToS3: The token supplied (' + token + ') is invalid. Aborting.');
context.fail('{ "result" : "fail", "reason" : "Invalid token provided" }');
} else {
uploadBody(event, context);
}
};

uploadBody = function(event, context) {

var bucket = event.bucket;
var app = event.app;
var timestamp = Date.now();
var key = app + '_' + timestamp;
var body = JSON.stringify(event.body);

var s3 = new AWS.S3();
var param = {Bucket: bucket, Key: key, Body: body};
console.log("routeToS3: Uploading body to S3 - " + bucket);
s3.upload(param, function(err, data) {
if (err) {
console.log(err, err.stack);// an error occurred, log to CloudWatch
context.fail('{ "result" : "fail", "reason" : "Unable to upload file to S3" }');
} else {
console.log('routeToS3: Body uploaded to S3 successfully');// successful response
context.succeed('{ "result" : "success" }');
}

});
};

This script will become your Lambda function and has a few key elements to take note of. First, it declares a variable named AWS with “require(‘aws-sdk’)”. This pulls in the aws-sdk Node.js module, which is required for writing to S3. With most Node.js modules, you will need to zip up the module files with your Lambda function. However, the AWS SDK is baked in, so you don’t need to worry about uploading any dependency files with the above function.

Next, the function declares a series of variables, starting with “validToken” and “token.” This might be where most seasoned API engineers roll their eyes at me. When possible, it makes sense to handle authentication at the API level and not inside your function. In fact, the API Gateway has this functionality built in. However, the supported method requires a change to the incoming requests header. That is not an option with SendGrid’s event webhook, which only gives you control over the URL, not the data. So, I had to cheat a little. We will cover this a little more when we setup the API, but for now it is sufficient to understand that token must match validToken for the function to work. Otherwise, the function will exit with an error.

Moving on to the other important variables:

  • bucket – The bucket or bucket/path combination (e.g.: my-bucket/SendGridEvents)
  • app – The name of the app these events are coming from; will be used as the resulting file’s prefix
  • timestamp – The current timestamp, which will be used to make the file name/key unique
  • key – constructed from app and timestamp to generate the file name

All of these variables will be passed in via the API Gateway as part of the event variable. That is why they all look something like “bucket = event.bucket”.

When this script is run, the very first thing Lambda will do is call the “exports.handler” function. In our case, exports.handler simply checks the token and, if it is correct, calls the “uploadBody” function. Otherwise, it exits the script and writes an error to CloudWatch via console.log.

Zip up index.js and use it to create a new Lambda function named “routeToS3.” You can do this all through the GUI, but I am more familiar with the CLI method. Not because I am a CLI snob, but because when Lambda first came out, only account admins could access the Lambda GUI.

Create your API

The AWS API Gateway enables people to build APIs without typing a line of code. It’s really fast to get something up and running. In fact, when all I meant to do was make sure my permissions were set correctly, I accidentally built the whole thing. I recommend checking out AWS’s guide, but you can also learn a bit by following along here.

To start…

  1. Log into your AWS console and open up the API Gateway service and click the Create API button.
  2. Name your API routeToS3 and click the next Create API button.
  3. With the root resource selected (it should be your only resource at this point), click Actions -> Create Resource.
  4. Name the resource “input” and set the path to “input” as well.
  5. Select /input from Resources menu on the left.
  6. Click Actions -> Create Method.
  7. In the dropdown that appears on the Resources menu, select POST and click the checkmark that appears to the right.
  8. For Integration Type, choose Lambda Function.
  9. Set your Lambda Region (choose the same region as your S3 bucket).
  10. Type or select the name of your Lambda function (routeToS3) in the Lambda Function field.
  11. Click Save
  12. When prompted to Add Permission to Lambda Function, click OK.

Congratulations! You just built an API in about two minutes. Now, in order to make sure the Lambda function gets all the parameters we mentioned earlier (body, bucket, app, etc.), we need to configure query strings, a mapping template, and a stage variable. We won’t be able to create a stage variable just yet, so that will come a little later.

With your POST method selected in the Resources menu, you should see a diagram with boxes titled Method Request, Integration Request, Method Response, and Integration Response:

POST Function

Click on Method Request to setup our query strings. From here, click to expand the URL Query String Parameters section. Any query string we add here will act as what some of us might refer to as GET parameters (e.g.: /?var1=a&var2=b&var3=etc). To setup the strings we will need, follow these steps:

  1. Click the Add query string link.
  2. Name the string token and click the checkmark to the right.
  3. Repeat for app and bucket.

Go back to the method execution overview by clicking POST in the Resources menu or <- Method Execution at the top.

Next, we will add a mapping template:

  1. Click Integration Request.
  2. Expand the Body Mapping Templates section.
  3. Click Add mapping template
  4. Type application/json (even though it is already filled in and doesn’t disappear when you click inside the text box) and click the checkmark to the right.
  5. Click the pencil icon next to Input Passthrough (it’s possible you could see “Mapping template” instead).
  6. Add the following JSON object and click the checkmark

{
"bucket": "$input.params('bucket')",
"app": "$input.params('app')",
"token": "$input.params('token')",
"validToken": "$stageVariables.validToken",
"body": $input.json('$')
}

This mapping will take the body of the request and our variables, and pass them along as part of the event object to Lambda. Note that all values, like “$input.params(‘bucket’)” are wrapped in double quotes, except for $input.json(‘$’). That is because we are actually calling a function on the body (‘$’), so wrapping it in quotes will break things.

Now, it’s time to deploy our API, which will make it accessible over HTTP. But, we haven’t tested it yet and that validToken variable is still undefined. Don’t worry, we haven’t forgotten those two critical pieces. But, we have to create a stage first, which is part of the deployment process.

  1. Click the Deploy API button at the top of the screen.
  2. On the screen that appears, choose [New Stage] for the Deployment Stage.
  3. Choose a name for the stage (Stages are like different environments, for example dev or prod).
  4. Enter a Deployment description and click Deploy.

On the screen that follows, you will see a tab labeled Stage Variables. Open this tab and click Add Stage Variable. Name the variable validToken and enter a token of your choosing for the Value. Use something strong.

Go back to the Settings tab and take a look at the options there. You may be interested in throttling your API, especially if this is a development stage. Remember that, although the API Gateway and Lambda are fairly cheap, too much traffic could rack up a bill. Since we aren’t using a client certificate to authenticate the calling app, we have to invoke the Lambda function to verify the provided token. Just something to keep in mind when considering throttling your API.

Now that I’ve distracted you with some prose, click Save Settings at the bottom of the page.

At the top of the screen, you will see an Invoke URL. This is the address to access the stage you just deployed into. All of our magic happens in the /input resource, so whatever that invoke URL is add “/input” to the end of it. For example, https://yudfhjky.execute-api.region.amazonaws.com/dev would become https://yudfhjky.execute-api.region.amazonaws.com/dev/input.

With our stage setup, we can now test the method.

  1. Go back to the routeToS3 API and click on the POST method in the Resources menu.
  2. Click Test.
  3. Enter a token, app, and a valid bucket/folder path (e.g.: my-bucket/routeToS3/SendGrid)
  4. Enter a value for validToken (this should be the same as token if you want the test to succeed).
  5. For Request Body, type something like {“message”: “success”}.
  6. Click Test.

You should see the CloudWatch logs that indicate the results of your test. If all is well, you will get a 200 status back and a corresponding file will appear the bucket you provided. The file contents should be {“message”: “success”} or whatever you set for the request body.

If things are working as expected, then it is time to head over to SendGrid and configure your event webhook:

  1. Log into SendGrid.
  2. Click Settings -> Mail Settings.
  3. Find Event Notification.
  4. Click the gray Off button to turn event notifications on.
  5. If needed, click edit to enter the HTTP POST URL.
  6. Enter the URL to your API endpoint, along with all necessary query strings (e.g.: https://yudfhjky.execute-api.region.amazonaws.com/dev/input?bucket=my-bucket/routeToS3/SendGrid&token=1234567890&app=SendGrid).
  7. Click the checkmark.
  8. Check all the events you want to log.
  9. Click the Test Your Integration button.
  10. Wait a couple minutes and then check your bucket to see if SendGrid’s test events arrived.

Tada! You should now be logging SendGrid events to an S3 bucket. Honestly, it’s much simpler than you might think based on the length of this post. Just keep the perspective that all of this is accomplished with three lightweight and low-cost services: the API Gateway to receive the event from SendGrid, Lambda to process that event and upload it to S3, and S3 to store the body of the SendGrid event. I hope you find this as helpful and straightforward as I have.

FTP to Google Drive

Let’s be clear that Google Drive does not provide FTP access to your content. But, that doesn’t mean it isn’t possible. I’ve been playing recently with a wireless security camera that can send images to an FTP server fairly easily. But, I didn’t have something reliable in the cloud handy and for the right price. Google Drive seemed like an excellent storage solution, but there was no way for the camera to utilize it… Directly.

At some point, I remembered I had a 2006 Mac Mini sitting around. The older versions of OSX make it really simple to get an FTP server up and running, which is the boat I found myself in:

  • Open System Preferences
  • Go to Sharing
  • Enable File Sharing
  • Modify permissions and paths to your liking
  • FTP will now be available on port 21

If you want to use a Mac for this exercise and you have a newer OS installed, you may need to follow these steps.

First half of your work? Done. The next step is pretty simple: Download and install the Google Drive app (tip: limit the folders Google Drive will sync if this will be a single-use computer/server). Google Drive content will be accessible at /Users/username/Google Drive. However, if like me, your camera or other client doesn’t play nicely with spaces. A symlink (or a shortcut in Windows) took care of this. I ran a command like this to create a space-free symlink:

ln -s ~/Google\ Drive/ ~/googleDrive

The backslash (“\”) escapes the space in a *nix environment. Now, anytime you write to /Users/username/googleDrive, you will actually be writing to your Google Drive folder. That means, if you use this path in your FTP configuration, you are essentially writing to Google Drive using FTP. Sneaky, sneaky. It worked beautifully for me. In fact, it worked a little too well. I didn’t quite nail the security camera’s sensitivity level and woke up to more than 10,400 images synced to Google Drive.

But, are there downsides? Of course. First and foremost is that, at least in my setup, it means one more device powered up. The Mini isn’t the worst thing to have going, but it also isn’t your only option if you want to be a little more green. You could setup something in AWS or Azure, use a Raspberry Pi, etc., but keep in mind there is no official Google Drive app for Linux yet. The second downside is that Google Drive only runs, and therefore only syncs, when you are logged in. That means, my Mac Mini is setup for automatic login, never goes to sleep, and starts up immediately after a power failure.

It’s not a perfect setup, but it worked in a pinch. My next task is setting up a secure proxy to the camera’s web interface. Another need the Mini can easily fill.

Crawling Commented Styling with Heritrix

This post isn’t something I can take credit for. The purpose is to make two potential solutions discoverable for someone like me, looking for an answer. Credit will be given where it is absolutely due.

As I’ve written about before, I inherited a Wayback/Heritrix server in my role and have had the pleasure of hacking my way through it on occasion. A recent challenge arose when I needed Heritrix to crawl an old Plone site. The Plone template placed html comments around all the styling tags to hide it from older browsers, which couldn’t understand the CSS. The result looks something like this:

<style type="text/css"><!--
/* - base.css - */
@media screen {
/* http://servername.domain.net/portal_css/base.css?original=1 */
/* */
/* */
.documentContent ul {
list-style-image: url(http://servername.domain.net/bullet.gif);
list-style-type: square;
margin: 0.5em 0 0 1.5em;
}
--></style>

Unfortunately, it seems Heritrix happily skips past any URLs within comments by default and does not follow them, regardless of your seeds and other configurations. Because, hey, they’re only comments, right? The end result is that it looks like the site was crawled successfully, but some resources were actually missed. In the above example, the Wayback version of the site was still pointing to http://servername.domain.net/bullet.gif for thelist-style-image, rather than http://wayback.domain.net:3366/wayback/20151002173414im_/http://servername.domain.net/bullet.gif. Therefore, it was not a complete archive of the site and its contents.

In my case, this was an internal site that I had total control over. However, try as I might, I could not figure out how to remove the comments from the old Plone template. Grepping for ‘<style type=”text/css”><!–‘ turned up ‘_SkeletonPage.py’. I tried modifying it and then running buildout to no avail. I am sure people more experienced with Plone could tell you where to change this in a heartbeat, but it’s beyond my knowledge with the application at this point. After coming up short on searches for solutions with Heritrix (thus, this post), I started looking for ways to remove the comment tags with something like Apache’s mod_substitute, since Plone was being reverse-proxied through Apache anyway.

Solution 1: Mod_Substitute/Mod_Filter

Eventually, I stumbled upon this configuration from Chris, regarding mod_substitute and mod_filter. Mod_filter needed to be used for mod_substitute to work properly because of the content being reverse-proxied. A simple modification of Chris’s configuration worked to remove the comment tags beautifully (using CentOS/Httpd for reference):

LoadModule substitute_module modules/mod_substitute.so
LoadModule filter_module modules/mod_filter.so
FilterDeclare replace
FilterProvider replace SUBSTITUTE Content-Type $text/html
FilterChain +replace
FilterTrace replace 1
Substitute "s/css\"><!--/css\">/n"
Substitute "s|--></style>|</style>|n"

(Note: I probably could have made this a little more efficient by using a single regex instead of two separate substitutes. But, meh. This was good enough.)

Chris recommended loading this into a new file: /etc/httpd/conf.d/replace.conf.

Solution 2: Hack Heritrix

While exploring my options with Apache, I also decided to reach out to the archive-crawler community on Yahoo! for help. A user identified as “eleklr” shared a patch that he used often for this kind of scenario. I think this is the better route to go, though I have not had an opportunity to try it out yet. Its biggest strength is that it doesn’t require you to have complete control over the site you are crawling, as is necessary for solution 1.

If you’ve found yourself in my position, rejoice in the fact that it’s not just you and there are solutions out there. Hopefully, one of the two listed above will help you on your way. Please share if you’ve discovered other solutions to this or similar problems.

Who Isn’t Taking Out the Trash? Use WinDirStat and PowerShell to Find Out.

Using WinDirStat to find unnecessary files on a hard drive is a pretty routine task. A common find is that someone’s recycling bin has large zip or executable files. WinDirStat is helpful for showing this to you, but it only reveals the user’s local SID, such as:

S-1-1-12-1234567890-123456789-123456789-123

It’s not terribly difficult to track down the associated profile using regedit. Still, clicking through a series of plus buttons in a GUI seems inefficient. Here is a simple method I used today to make this process a little quicker. Ok, so it took a bit longer than clicking through the first time, but it will be quicker for me next time:


((get-itemproperty "hklm:\Software\Microsoft\Windows NT\CurrentVersion\ProfileList\*") | where {$_.pschildname -like "S-1-1-12-1234567890-123456789-123456789-123"}).ProfileImagePath

This will return the ProfileImagePath value, which is the file path to the guilty profile. If you want to cut straight to the username, try this:


(((get-itemproperty "hklm:\Software\Microsoft\Windows NT\CurrentVersion\ProfileList\*") | where {$_.pschildname -like "S-1-1-12-1234567890-123456789-123456789-123"}).ProfileImagePath).split("\")[-1]