A new beta version of WebCopy has been released, containing a range of features and bug fixes.
If you're finding WebCopy useful, please donate to keep the project alive
If you tried to use WebCopy to copy a responsive website, then it is possible that WebCopy wouldn't pick up custom images if they were referenced in ways that WebCopy won't detect by default, such as on custom data attributes.
With WebCopy 1.1.1 or higher, you can now define your own HTML crawler expressions. It sounds complicated, but it's not - in the simplest fashion, you can just enter the name of an attribute, and WebCopy will check for any match on any HTML element.
For example, consider the HTML fragment below, which references background3.png and background1.png.
By default, WebCopy will only find background1.png as it is
on the standard
src attribute. By simply adding
data-original to the custom attributes list, WebCopy will now
find and process background3.png too.
In addition to this basic form, you can also do more advanced
expressions by entering XPath statements. Continuing the above
example, if you add
//img/@data-original as a custom
attribute, then WebCopy will only look at attributes named
data-original that belong to
Hopefully this approach strikes a nice balance between something easy for users, and then something for the power user.
Staying with the responsive theme, WebCopy now supports the
srcset attribute. This attribute allows you to specifying
multiple images for a single
img element, and the browser will
choose the most appropriate one.
In the above example, that single
img tag references five
different image files. WebCopy will now detect and process all
The WebCopy demo site has been updated to include custom attributes and srcset attributes.
The 300 status code is often used by Linux based web servers (such as Apache) as a user friendly 404 - if you try to access a specific URL that doesn't exist, but it almost matches other URLs, the web server will return the list of matches. I don't think I've ever seen an IIS website return 300, it always seems to just 404.
Previously WebCopy ignored this status code - it knew it was a redirect, but couldn't do anything with it as it didn't include a location header. Now, WebCopy will download the body containing the list of URLs and crawl each of these.
I doubt this feature will see much real world use, but you never know!
Although I don't go out of my way to profile WebCopy for
performance (most of the time will be spent downloading files
after all), I do keep an eye out for areas that could do with
improvement. While adding support for
srcset (which is fairly
unique in terms of HTML attributes as it lets you specify
multiple values in a single attribute), I refactored the
crawling code and got a small performance improvement.
No WebCopy update would be complete without a bug fix or 10, and so we have a number of fixes implemented, including (finally) a fix for a bug which could leave a project unreadable by WebCopy. A full list of corrections can be found in the release notes.
Even though all the tests (old and new) pass, due to the changes to crawling, multi value attribute reading and writing and the other new features, I'm still classing this as a beta build - there's bound to be some edge case I haven't come across yet.
The next update is going to (finally!) tackle WebCopy's woeful support for query strings and vastly improve that, perhaps then making it possible to copy forums more easily.
However, I stress again that we need your support - if you're using WebCopy and it is useful to you, please donate!
- 2015-09-12 - First published
- 2020-11-23 - Updated formatting
Like what you're reading? Perhaps you like to buy us a coffee?