Transforming hyperlinks when copying websites
Recently a website I infrequently use was badly defaced, and in the course of repairing the damage the owners of the site temporarily took it down. As I found it to be a very useful resource I lamented not having an offline copy and so when the site was restored, I decided to make a copy without further ado.
Although WebCopy is our most popular product, it actually got started accidentally as an offshoot of Sitemap Creator to make a copy of a long forgotten website. And the reason for bringing up that trivia is that right from the start Sitemap Creator had a feature where it could transform page titles to remove the extra text these typically have. This functionality has now been re-purposed to allow WebCopy to intercept a URI at the detection stage and transform it into something different.
What can you use it for?
How do you use it?
You can find the configuration settings in the URI Transforms section of the Project Properties dialog.
Each replacement is comprised of a Pattern, a Replacement and an optional URI. The pattern is a regular expression which is used to both match the source link and define any result groups. Replacement is another expression which defines how the URI is transformed. Finally, URI can be used to only perform the pattern matching on links belonging to a given URI.
Regular expressions are a vast and complicated topic and it would be nice if WebCopy didn't depend so much on them - they don't make WebCopy very easy to use in many respects. WebCopy does include a basic editor for expressions which can be quite handy for testing patterns and replacements but it could be improved.
Usage Scenario: Cutting out the middle man
One use case is for cutting out an interim page. For example, one page may ultimately link to another, but it does this by first calling an interim page with a query string argument describing the destination. The interim page will perform some action (such as logging the "click", showing a timed advert, etc.) and then navigate to the destination. By using a transform, we can manipulate the URL to discard the interim page and just go directly to the destination, remapping the source link appropriately.
The WebCopy demonstration page includes an example of this behaviour. The Middleman Redirect link will navigate to
redirecttracker.php?url=uritransformfinal.php. We can use a simple pattern to strip out the bulk of the URL and just keep the query string parameter.
This pattern matches
redirecttracker.php?url= and captures everything after in a group. The replacement then simply outputs the contents of the group,
uritransformfinal.php in the above example.
If you tell WebCopy to crawl the demonstration site without the above transform, it will find
uritransformfinal.php, but the source page will still point to the original redirection page. With the above transform in place, WebCopy will never know that
redirecttracker.php exists - it will skip directly the final page.
For a more advanced example, the demonstration page also has three hyperlinks with the following
Clicking the first link will navigate to
1-index.php, the second to
1-second.php and the third to
2-index.php. While not really best practice for modern websites, this mirrors the behaviour of original site I wanted to copy.
If you do a normal scan using WebCopy, while it will detect all three links above, it will silently ignore them. To get WebCopy to correctly process these links we need to detect the calls to the
openPage function, and construct a replacement URI using the two parameters, plus an extension.
This can be done with the following transform
The above expression will first try and match
\ as they are special characters). It will then capture any characters between the first set of single quotes into a capture group. After the closing quote, it will then match a
, character. The
\s token means to match any white space character, and the
? makes it optional. Next the pattern captures any characters between the second set of single quote characters and matches a closing brace. In fairness, the pattern could be simplified further, but then it would look even more confusing to newcomers so I've tried to keep it more explicit.
The replacement expression basically combines the two groups (the
$2 tokens represent capture groups from the pattern) with a
- between them and then adding the
Now when scanning the demonstration website, WebCopy will find those three links and automatically transform them, therefore finding and downloading the linked pages. At the end of the copy, when WebCopy remaps downloaded HTML to ensure links are local to the copy, it will also replace the source links with the transformed name.
Getting the build
Currently this functionality is only available in nightly builds, available from the WebCopy download page.