A new beta version of WebCopy has been released, containing a range of features and bug fixes.
If you're finding WebCopy useful, please donate to keep the project alive
Custom Attributes
If you tried to use WebCopy to copy a responsive website, then it is possible that WebCopy wouldn't pick up custom images if they were referenced in ways that WebCopy won't detect by default, such as on custom data attributes.
With WebCopy 1.1.1 or higher, you can now define your own HTML crawler expressions. It sounds complicated, but it's not - in the simplest fashion, you can just enter the name of an attribute, and WebCopy will check for any match on any HTML element.
For example, consider the HTML fragment below, which references background3.png and background1.png.
By default, WebCopy will only find background1.png as it is
on the standard src
attribute. By simply adding
data-original
to the custom attributes list, WebCopy will now
find and process background3.png too.
In addition to this basic form, you can also do more advanced
expressions by entering XPath statements. Continuing the above
example, if you add //img/@data-original
as a custom
attribute, then WebCopy will only look at attributes named
data-original
that belong to img
elements.
Hopefully this approach strikes a nice balance between something easy for users, and then something for the power user.
srcset Support
Staying with the responsive theme, WebCopy now supports the
srcset
attribute. This attribute allows you to specifying
multiple images for a single img
element, and the browser will
choose the most appropriate one.
In the above example, that single img
tag references five
different image files. WebCopy will now detect and process all
five images.
The WebCopy demo site has been updated to include custom attributes and srcset attributes.
"Multiple Choices" Status Code Support
The 300 status code is often used by Linux based web servers (such as Apache) as a user friendly 404 - if you try to access a specific URL that doesn't exist, but it almost matches other URLs, the web server will return the list of matches. I don't think I've ever seen an IIS website return 300, it always seems to just 404.
Previously WebCopy ignored this status code - it knew it was a redirect, but couldn't do anything with it as it didn't include a location header. Now, WebCopy will download the body containing the list of URLs and crawl each of these.
I doubt this feature will see much real world use, but you never know!
Performance Improvements and Bug Fixes
Although I don't go out of my way to profile WebCopy for
performance (most of the time will be spent downloading files
after all), I do keep an eye out for areas that could do with
improvement. While adding support for srcset
(which is fairly
unique in terms of HTML attributes as it lets you specify
multiple values in a single attribute), I refactored the
crawling code and got a small performance improvement.
No WebCopy update would be complete without a bug fix or 10, and so we have a number of fixes implemented, including (finally) a fix for a bug which could leave a project unreadable by WebCopy. A full list of corrections can be found in the release notes.
And next?
Even though all the tests (old and new) pass, due to the changes to crawling, multi value attribute reading and writing and the other new features, I'm still classing this as a beta build - there's bound to be some edge case I haven't come across yet.
The next update is going to (finally!) tackle WebCopy's woeful support for query strings and vastly improve that, perhaps then making it possible to copy forums more easily.
However, I stress again that we need your support - if you're using WebCopy and it is useful to you, please donate!
Update History
- 2015-09-12 - First published
- 2020-11-23 - Updated formatting
Like what you're reading? Perhaps you like to buy us a coffee?