Note: WebCopy 2.0 isn't even in alpha yet. Images and descriptions below are from prototype's testing the feasibility of new functionality and how it might work. These features might never see the light of day, look different, work different, etc.
As soon as WebCopy was originally completed, I knew it had a rather large flaw. The system of rules it uses is quite powerful, but fatally limited. Essentially you have a regular expression to match a URL, and a number that tells WebCopy what to do. Great if you can write regular expressions and want to do a limited number of things. Utterly useless if you want more control - such as skipping files which are 1GB in size, or that are > 10 levels deep.
Some of the most frustrating support requests we deal with tend to be people trying to copy forums using WebCopy. While it's technically possible, it's not an easy thing. Forums can have many thousands of links, and a lot of them are replicated many times for "new posts" and "replies" and so on. In most cases, copying this is not what you want as it adds zero value to the copy.
You could use regular expressions to filter out URI's containing query strings of particular keys and values, that much at least is possible (if regular expressions don't scare you aware that is). But you can't manipulate the query string at all. Ok, you can clear it completely, but that's going to be useful in a tiny percentage of cases.
I wanted to add a proper rules system to WebCopy, but doing so means a substantial rewrite of much of the crawling engine, and that means essentially it's a "version next" feature. For the last several months WebCopy updates have slowed down in terms of bug fixes and enhancements that are directly to do with the crawling engine. It's stable enough for most people it would seem.
The screenshot above shows a prototype GUI for a new rule engine. The bottom half of the screen shows the conditions of a rule, and the upper half shows the results of the rule being evaluated against a list of URI's - everything highlighted in reddish orange has been matched against the rule, and so if this were a fully functional application, the specified actions would have been carried out.
It is currently my intention that this rule engine will control all aspects of crawling in WebCopy 2.0, be it simple exclusions of URI's, manipulation of query strings, or changes to the downloaded HTML. Of course, the GUI might not look like you are configuring rules of this nature, but behind the scenes that's what it will be doing.
One of the great things about this type of system is that each rule is self contained, making it so much easier to test, which should mean for a more stable product from the start without daft mistakes.
My final comments are regarding query strings, and the animation below should describe nicely how that is being planned so far.
As I mentioned in the block quote at the start of this post, nothing is final, not the functionality, nor the types of rules that will be available, nor how the GUI will work - but I'll touch upon the GUI specifically in another post.
Update History
- 2014-03-27 - First published
- 2020-11-23 - Updated formatting
Like what you're reading? Perhaps you like to buy us a coffee?