Transforming hyperlinks when copying websites

Recently a website I infrequently use was badly defaced, and in the course of repairing the damage the owners of the site temporarily took it down. As I found it to be a very useful resource I lamented not having an offline copy and so when the site was restored, I decided to make a copy without further ado.

However, as I swiftly discovered, that was a problem - the site used JavaScript for many internal links, and WebCopy doesn't support JavaScript. Somewhat fortunately, when I looked at how the JavaScript links functioned, I discovered they were all of a predicable nature - a call to a single function with two string arguments. The destination URL was a simple concatenation of these arguments with no extra processing.

Although WebCopy is our most popular product, it actually got started accidentally as an offshoot of Sitemap Creator to make a copy of a long forgotten website. And the reason for bringing up that trivia is that right from the start Sitemap Creator had a feature where it could transform page titles to remove the extra text these typically have. This functionality has now been re-purposed to allow WebCopy to intercept a URI at the detection stage and transform it into something different.

New options for replacing detected URI strings are now available

What can you use it for?

The initial use case is to transform values from one form into another in a predicable fashion, for example to remove calling an interim page or to handle very simple JavaScript.

How do you use it?

You can find the configuration settings in the URI Transforms section of the Project Properties dialog.

Each replacement is comprised of a Pattern, a Replacement and an optional URI. The pattern is a regular expression which is used to both match the source link and define any result groups. Replacement is another expression which defines how the URI is transformed. Finally, URI can be used to only perform the pattern matching on links belonging to a given URI.

Regular expressions are a vast and complicated topic and it would be nice if WebCopy didn't depend so much on them - they don't make WebCopy very easy to use in many respects. WebCopy does include a basic editor for expressions which can be quite handy for testing patterns and replacements but it could be improved.

The built in editor can help with testing regular expressions

Usage Scenario: Cutting out the middle man

One use case is for cutting out an interim page. For example, one page may ultimately link to another, but it does this by first calling an interim page with a query string argument describing the destination. The interim page will perform some action (such as logging the "click", showing a timed advert, etc.) and then navigate to the destination. By using a transform, we can manipulate the URL to discard the interim page and just go directly to the destination, remapping the source link appropriately.

The WebCopy demonstration page includes an example of this behaviour. The Middleman Redirect link will navigate to redirecttracker.php?url=uritransformfinal.php. We can use a simple pattern to strip out the bulk of the URL and just keep the query string parameter.

Pattern: redirecttracker\.php\?url=(.*)
Replacement: $1

Breakdown of a pattern for capturing simple redirection links

This pattern matches redirecttracker.php?url= and captures everything after in a group. The replacement then simply outputs the contents of the group, uritransformfinal.php in the above example.

If you tell WebCopy to crawl the demonstration site without the above transform, it will find uritransformfinal.php, but the source page will still point to the original redirection page. With the above transform in place, WebCopy will never know that redirecttracker.php exists - it will skip directly the final page.

Usage Scenario: Converting simple JavaScript links

For a more advanced example, the demonstration page also has three hyperlinks with the following href attributes

javascript:openPage('1', 'index')
javascript:openPage('1', 'second')
javascript:openPage('2', 'index')

Clicking the first link will navigate to 1-index.php, the second to 1-second.php and the third to 2-index.php. While not really best practice for modern websites, this mirrors the behaviour of original site I wanted to copy.

If you do a normal scan using WebCopy, while it will detect all three links above, it will silently ignore them. To get WebCopy to correctly process these links we need to detect the calls to the openPage function, and construct a replacement URI using the two parameters, plus an extension.

This can be done with the following transform

Pattern: javascript:openPage$'(.*)',\s?'(.*)'$
Replacement: $1-$2.php

Breakdown of a pattern for capturing simple JavaScript links

The above expression will first try and match javascript:openPage(' (braces are escaped with \ as they are special characters). It will then capture any characters between the first set of single quotes into a capture group. After the closing quote, it will then match a , character. The \s token means to match any white space character, and the ? makes it optional. Next the pattern captures any characters between the second set of single quote characters and matches a closing brace. In fairness, the pattern could be simplified further, but then it would look even more confusing to newcomers so I've tried to keep it more explicit.

The replacement expression basically combines the two groups (the $1 and $2 tokens represent capture groups from the pattern) with a - between them and then adding the .php extension.

Now when scanning the demonstration website, WebCopy will find those three links and automatically transform them, therefore finding and downloading the linked pages. At the end of the copy, when WebCopy remaps downloaded HTML to ensure links are local to the copy, it will also replace the source links with the transformed name.