Recently a website I infrequently use was badly defaced, and in the course of repairing the damage the owners of the site temporarily took it down. As I found it to be a very useful resource I lamented not having an offline copy and so when the site was restored, I decided to make a copy without further ado.
Although WebCopy is our most popular product, it actually got started accidentally as an offshoot of Sitemap Creator to make a copy of a long forgotten website. And the reason for bringing up that trivia is that right from the start Sitemap Creator had a feature where it could transform page titles to remove the extra text these typically have. This functionality has now been re-purposed to allow WebCopy to intercept a URI at the detection stage and transform it into something different.
You can find the configuration settings in the URI Transforms section of the Project Properties dialog.
Each replacement is comprised of a Pattern, a Replacement and an optional URI. The pattern is a regular expression which is used to both match the source link and define any result groups. Replacement is another expression which defines how the URI is transformed. Finally, URI can be used to only perform the pattern matching on links belonging to a given URI.
Regular expressions are a vast and complicated topic and it would be nice if WebCopy didn't depend so much on them - they don't make WebCopy very easy to use in many respects. WebCopy does include a basic editor for expressions which can be quite handy for testing patterns and replacements but it could be improved.
One use case is for cutting out an interim page. For example, one page may ultimately link to another, but it does this by first calling an interim page with a query string argument describing the destination. The interim page will perform some action (such as logging the "click", showing a timed advert, etc.) and then navigate to the destination. By using a transform, we can manipulate the URL to discard the interim page and just go directly to the destination, remapping the source link appropriately.
The WebCopy demonstration page includes an example of this
behaviour. The Middleman Redirect link will navigate to
redirecttracker.php?url=uritransformfinal.php. We can use a
simple pattern to strip out the bulk of the URL and just keep
the query string parameter.
This pattern matches
redirecttracker.php?url= and captures
everything after in a group. The replacement then simply outputs
the contents of the group,
uritransformfinal.php in the above
If you tell WebCopy to crawl the demonstration site without the
above transform, it will find
uritransformfinal.php, but the
source page will still point to the original redirection page.
With the above transform in place, WebCopy will never know that
redirecttracker.php exists - it will skip directly the final
For a more advanced example, the demonstration page also
has three hyperlinks with the following
Clicking the first link will navigate to
1-second.php and the third to
not really best practice for modern websites, this mirrors the
behaviour of original site I wanted to copy.
If you do a normal scan using WebCopy, while it will detect all
three links above, it will silently ignore them. To get WebCopy
to correctly process these links we need to detect the calls to
openPage function, and construct a replacement URI using
the two parameters, plus an extension.
This can be done with the following transform
The above expression will first try and match
\ as they are
special characters). It will then capture any characters between
the first set of single quotes into a capture group. After the
closing quote, it will then match a
, character. The
token means to match any white space character, and the
makes it optional. Next the pattern captures any characters
between the second set of single quote characters and matches a
closing brace. In fairness, the pattern could be simplified
further, but then it would look even more confusing to newcomers
so I've tried to keep it more explicit.
The replacement expression basically combines the two groups
$2 tokens represent capture groups from the
pattern) with a
- between them and then adding the
Now when scanning the demonstration website, WebCopy will find those three links and automatically transform them, therefore finding and downloading the linked pages. At the end of the copy, when WebCopy remaps downloaded HTML to ensure links are local to the copy, it will also replace the source links with the transformed name.
Currently this functionality is only available in nightly builds, available from the WebCopy download page.
- 2017-05-29 - First published
- 2020-11-23 - Updated formatting