A new beta of WebCopy has been released, one I hope will resolve some of the plethora of uninstall feedback I've been receiving recently relating to users having problems logging into web sites using WebCopy.

Form Posting Fixes

When WebCopy posts a form, it downloads the existing form, pulls out all the values and merges these with the specific values the user has configured. That way, you can easily post login forms that include hidden parameters containing anti-forgery tokens or other meta data that may be generated dynamically.

Unfortunately, the original implementation of this only supported input elements, and to add to the tale of woe, the unit tests for this feature only contained matching elements. It was only by chance when I was testing the a new demo added to our demonstration website that omission was noted. WebCopy now correctly detects input, output, select, textarea, object and button elements, while excluding reset, button and image input types.

If you come across a website that uses basic form posting and you still can't log into it, please let us know the URL of the login form so we can investigate. Please do not send credentials for any website to Cyotek!

Capture Form Improvements

In addition to this, the Capture Form tool has had a slight overhaul. Previously, it used the HTML at the time the document was rendered. Now, it uses the current state of the document, allowing you to fill in form fields and have those values reflected in the form definition that WebCopy will create. Hopefully, this will help users avoid having to guess what parameters may not may not be required to fill in and lead to an easier experience.

Local File Remapping

Previous versions of WebCopy were somewhat inflexible in how local files were remapped, leading to the incredibly confusing situation where you could tell WebCopy to download a PHP website, and the local filenames would still have a .php extension instead of HTML (remember that WebCopy cannot download raw source). As it also relied on content types registered on the users computer, it was also possible that it would use the wrong extension, or not be able to supply an extension.

WebCopy now includes an embedded mime database of all registered types, courtesy of mime-db so that any valid content type should be remapped appropriately. In addition, there's a new default setting where WebCopy will remap all downloaded files unless the content type is application/octet-stream (the type usually used for binary downloads like executable programs). Hopefully this too will address one of the poor initial impressions that WebCopy can give.

HEAD Checking

Another poor experience users have received from WebCopy is web sites that don't support the HEAD method. This is supposed to allow you to return the pertinent details of a resources, such as its type and size, without having to download the entire resource. This is a great feature for web crawlers, as it means it can quickly check to see if it should download something without downloading content it doesn't need. For this reason, head checking is enabled by default in WebCopy projects.

If a server doesn't support the HEAD method for whatever reason, then WebCopy's default behaviour would be to skip processing that URL. Not a great experience if was the entry point for the website and nothing was downloaded! WebCopy now tries to be a bit smarter and keeps a map of each domain it visited during the current crawl, and, if a HEAD check fails for a given domain WebCopy will proceed with a normal GET and disable head checking for that particular domain.

And more

In addition to the major user experience improvements above, there are quite a number of bug fixes and minor tweaks, details of which can be seen in the release notes.

We would be very grateful for continued feedback into WebCopy, every little helps!

Update History

  • 2018-06-11 - First published
  • 2020-11-23 - Updated formatting

Like what you're reading? Perhaps you like to buy us a coffee?

Donate via Buy Me a Coffee

Donate via PayPal


Comments