A new beta version of WebCopy has been released, containing a range of features and bug fixes.

If you're finding WebCopy useful, please donate to keep the project alive

Custom Attributes

If you tried to use WebCopy to copy a responsive website, then it is possible that WebCopy wouldn't pick up custom images if they were referenced in ways that WebCopy won't detect by default, such as on custom data attributes.

With WebCopy 1.1.1 or higher, you can now define your own HTML crawler expressions. It sounds complicated, but it's not - in the simplest fashion, you can just enter the name of an attribute, and WebCopy will check for any match on any HTML element.

For example, consider the HTML fragment below, which references background3.png and background1.png.

html
<img data-original="/assets/img/background3.png" src="https://www.cyotek.com/assets/img/background1.png" alt="Background" style="width: 100%;"/>

By default, WebCopy will only find background1.png as it is on the standard src attribute. By simply adding data-original to the custom attributes list, WebCopy will now find and process background3.png too.

In addition to this basic form, you can also do more advanced expressions by entering XPath statements. Continuing the above example, if you add //img/@data-original as a custom attribute, then WebCopy will only look at attributes named data-original that belong to img elements.

Hopefully this approach strikes a nice balance between something easy for users, and then something for the power user.

srcset Support

Staying with the responsive theme, WebCopy now supports the srcset attribute. This attribute allows you to specifying multiple images for a single img element, and the browser will choose the most appropriate one.

html
<img style="width: 400px; height: 400px;"
     src="https://www.cyotek.com/image-src.png" 
     srcset="image-1x.png 1x, image-2x.png 2x, image-3x.png 3x, image-4x.png 4x"
/>

In the above example, that single img tag references five different image files. WebCopy will now detect and process all five images.

The WebCopy demo site has been updated to include custom attributes and srcset attributes.

"Multiple Choices" Status Code Support

The 300 status code is often used by Linux based web servers (such as Apache) as a user friendly 404 - if you try to access a specific URL that doesn't exist, but it almost matches other URLs, the web server will return the list of matches. I don't think I've ever seen an IIS website return 300, it always seems to just 404.

Previously WebCopy ignored this status code - it knew it was a redirect, but couldn't do anything with it as it didn't include a location header. Now, WebCopy will download the body containing the list of URLs and crawl each of these.

I doubt this feature will see much real world use, but you never know!

Performance Improvements and Bug Fixes

Although I don't go out of my way to profile WebCopy for performance (most of the time will be spent downloading files after all), I do keep an eye out for areas that could do with improvement. While adding support for srcset (which is fairly unique in terms of HTML attributes as it lets you specify multiple values in a single attribute), I refactored the crawling code and got a small performance improvement.

No WebCopy update would be complete without a bug fix or 10, and so we have a number of fixes implemented, including (finally) a fix for a bug which could leave a project unreadable by WebCopy. A full list of corrections can be found in the release notes.

And next?

Even though all the tests (old and new) pass, due to the changes to crawling, multi value attribute reading and writing and the other new features, I'm still classing this as a beta build - there's bound to be some edge case I haven't come across yet.

The next update is going to (finally!) tackle WebCopy's woeful support for query strings and vastly improve that, perhaps then making it possible to copy forums more easily.

However, I stress again that we need your support - if you're using WebCopy and it is useful to you, please donate!

Update History

  • 2015-09-12 - First published
  • 2020-11-23 - Updated formatting

Like what you're reading? Perhaps you like to buy us a coffee?

Donate via Buy Me a Coffee

Donate via PayPal


Comments