A new beta of WebCopy has been released, one I hope will resolve some of the plethora of uninstall feedback I've been receiving recently relating to users having problems logging into web sites using WebCopy.
When WebCopy posts a form, it downloads the existing form, pulls out all the values and merges these with the specific values the user has configured. That way, you can easily post login forms that include hidden parameters containing anti-forgery tokens or other meta data that may be generated dynamically.
Unfortunately, the original implementation of this only
input elements, and to add to the tale of woe, the
unit tests for this feature only contained matching elements. It
was only by chance when I was testing the a new demo added to
our demonstration website that omission was noted. WebCopy
now correctly detects
button elements, while excluding
image input types.
If you come across a website that uses basic form posting and you still can't log into it, please let us know the URL of the login form so we can investigate. Please do not send credentials for any website to Cyotek!
In addition to this, the Capture Form tool has had a slight overhaul. Previously, it used the HTML at the time the document was rendered. Now, it uses the current state of the document, allowing you to fill in form fields and have those values reflected in the form definition that WebCopy will create. Hopefully, this will help users avoid having to guess what parameters may not may not be required to fill in and lead to an easier experience.
Previous versions of WebCopy were somewhat inflexible in how
local files were remapped, leading to the incredibly confusing
situation where you could tell WebCopy to download a PHP
website, and the local filenames would still have a
extension instead of HTML (remember that WebCopy cannot
download raw source). As it also relied on content types
registered on the users computer, it was also possible that it
would use the wrong extension, or not be able to supply an
WebCopy now includes an embedded mime database of all registered
types, courtesy of mime-db so that any valid content type
should be remapped appropriately. In addition, there's a new
default setting where WebCopy will remap all downloaded files
unless the content type is
application/octet-stream (the type
usually used for binary downloads like executable programs).
Hopefully this too will address one of the poor initial
impressions that WebCopy can give.
Another poor experience users have received from WebCopy is web
sites that don't support the
HEAD method. This is supposed to
allow you to return the pertinent details of a resources, such
as its type and size, without having to download the entire
resource. This is a great feature for web crawlers, as it means
it can quickly check to see if it should download something
without downloading content it doesn't need. For this reason,
head checking is enabled by default in WebCopy projects.
If a server doesn't support the
HEAD method for whatever
reason, then WebCopy's default behaviour would be to skip
processing that URL. Not a great experience if was the entry
point for the website and nothing was downloaded! WebCopy now
tries to be a bit smarter and keeps a map of each domain it
visited during the current crawl, and, if a
HEAD check fails
for a given domain WebCopy will proceed with a normal
disable head checking for that particular domain.
In addition to the major user experience improvements above, there are quite a number of bug fixes and minor tweaks, details of which can be seen in the release notes.
We would be very grateful for continued feedback into WebCopy, every little helps!
- 2018-06-11 - First published
- 2020-11-23 - Updated formatting
Like what you're reading? Perhaps you like to buy us a coffee?