Introducing WebCopy 1.8
It's been over two months since the last CI build of WebCopy was made, and during this time I've been working quite hard on some major internal refactoring and adding a long requested feature. It hope it's worth the wait, I need a break!
WebCopy 1.8 nightly builds are now available for download and so this series of posts will describe some of the changes and new functionality that have been made to the software. This first post will cover a grab bag of smaller changes.
Files saved with WebCopy 1.8 are not compatible with older versions. Please ensure you have a backup of WebCopy projects so you can revert to an older version of WebCopy if required.
The most obvious change is that when you open WebCopy the Rules List now has three columns rather than two. Although I wrote back in 2014 about WebCopy 2.0 and a more extensive rule system, the simple fact is WebCopy has 10 years of bug fixes and improvements to the core engine and while there is work to do, I'm not in a hurry to rewrite it. However, the rules system could do with some TLC to avoid users having to become regular expression wizards.
I've added the ability for you to choose which part of the URI is used by the rule. By default this is the path and query string (same as previous versions), but you can also tell it to look at just the path, or just the query, or you could include the domain too. As a result of this change the Use full URL flag has been removed. Any rules with this flag currently set will be automatically upgraded to new settings.
I've also added the ability for rules to run against the content type of a given URL. This will make it much simpler to scan a website and only keep images or other resources, rather than having to set up a rule that tries to check URL extensions.
Content type rules currently only work if HEAD checking is enabled (this is the default). If this option is not set, the rules probably won't work - I'll address that in a future 1.8 nightly
A new UI option has been added allowing you to control the scan depth. This option only applies if the domain matches the primary domain being copied.
You can now choose to only include files that are directly linked to the source page being copied, or within a certain "click distance".
Cookie name/value pairs can now be set for each project.
The UI where cookies are displayed has also had various usability tweaks.
WebCopy setup is now using InnoSetup version 6. This should make it easier to add local installs as currently Setup requires power user access to install. This isn't available yet however as I have looking into resolving the last few hurdles preventing me allowing the "portable" zips to be enabled.
Last but not least, some technical changes to make things run a little smoother.
The code WebCopy uses to build site maps is 10 years old at this point, being the same code used by Sitemap Creator. Saying that it isn't very efficient is quite the understatement. In addition, it doesn't handle the ability to load portions on demand, so in order for WebCopy to display a sitemap it has to build the entire map upfront and this results in a wide range of out of memory exceptions for some users. For WebCopy 1.8, this has been replaced with a simpler system that "walks" the sitemap, creating only the bits needed at that point. Expand a branch and only the immediate children are loaded - this should mean that initial load time of the tree is substantially faster and memory requirements are much lower.
The sitemap and diagram extensions have also been updated to use this new code. While both of these still need to scan the entire map, the fact it isn't building this map out of inefficient string operations means these shouldn't cause a crash either, although this doesn't solve all problems with the diagram extension.
I still need to do performance profiling, but I'm quite happy with the functionality thus far and am confident that this change will wipe out a great deal of the out of memory exceptions that users experience. The old code isn't used at all now by WebCopy (Sitemap Creator still uses it at the moment).
In a related note, the way links are identified has changed too, resulting in a nice reduction in memory required for large projects. Unfortunately as a result of these changes, older versions of WebCopy won't be able to open project files saved with WebCopy 1.8.