One of the biggest sources of support requests for WebCopy are to do with posting forms, and WebCopy's ongoing inability to handle dynamic values. Thankfully, with WebCopy 1.0.10.0 this issue has finally been resolved as we have introduced a variety of improvements with forms, including value merging and a new tool to capture form data.
How it used to work
So what was the problem? Well, consider a typical login form. The HTML (at the barest minimum) will be similar to the following fragment.
WebCopy handles this type of form very well.
One of the scourges of the internet are spammers, and as a result of this, forms often include dynamically generated tokens used to validate the form values, as demonstrated in the snippet below.
These tokens will change (generally for each particular session) and therefore WebCopy has been completely unable to submit such forms. Even if you opened the form in your favourite browser and extracted the tokens and pasted them into WebCopy it was unlikely to work as they would likely be treated as different sessions and have different values.
Forms often tend to set cookies now as well, so when you
download the page containing the form, a validation cookie is
saved. Then, when the page is submitted the cookie is used to
help with the validation process. WebCopy only did the POST
action, but not the GET
and so no cookies would ever be set.
How it works now
A new property, Merge Values, has been added to form definitions. This is enabled by default for new forms, but disabled for any existing projects (although of course you can turn it on).
When this property is set, WebCopy will automatically get the page containing the form, apply any cookies, then attempt to extract the form data. Any parameters that haven't been explicitly defined will be automatically merged and then the merged data will be posted.
A simple correction, but one that should hopefully reduce the number of support requests, and disappointed users for that matter.
What happens if multiple forms are present?
Some pages may include multiple forms, for example searching,
registration, etc. WebCopy tries to be smart about this - if
multiple forms are present, it will try and match a single form
who's action
attribute matches the form URI. If it can't do
that, then it tries to match a single form without an action
attribute. And if that fails, it won't do anything.
Note: You can use the Test URI tool to verify that your login form actually logs you in before doing a fully copy or analyse
There's also a third option - I added the ability for a form
definition to include an XPath query to select the FORM
element. However, as one of the other frequent criticisms of
WebCopy seems to be that the UI is too complicated, I have
chosen not to add it to the UI at this point in time. The only
way to specify the value currently is by direct editing of the
project file. I probably will not expose this option unless
users start reporting being unable to post to complicated pages
(in which case I'll try to make the auto detection cleverer) or
until 2.0 which ought to include simple / advanced display
modes.
That's nice, but how about helping me create the form
I agree that currently it's not exactly the easiest of tasks to create a form definition as you have to know the URI and then the different (static) values to submit, which means you have to poke around in a sites HTML. Ok, maybe some of the criticism about WebCopy's UI is justified.
For this reason, a rudimentary (and somewhat experimental) Capture Form tool has been added. This tool will display a window containing a web browser displaying the root page for the website you want to copy, and a list of detected forms.
Simply navigate to the login page, select the form to use, and then tick any parameters to include - ie user name and password fields, but not fields with odd values. (A future update will probably automatically exclude the hidden fields automatically).
Note that you don't have to fill in the form and submit the page as WebCopy is simply parsing the HTML that you have navigated to.
The submit the dialog and you have a new form definition. Hopefully that will become a useful feature for users of the product!
Anything outstanding? (with form support!)
Currently the main bug left outstanding is it isn't possible to submit multi-lined form values due to the way the UI is a single edit field. I don't really want to start breaking the UI up in 1.x, that is a 2.0 planned task.
I also don't want to keep the new Merge Values property hanging around for long - it's currently really only there so that existing projects behave the same way they always have. Once I'm certain the new code is performing as expected this setting will removed and all form posting will use the new behaviour.
It is possible that there will be issues with the new implementation, for example cases where the auto detection fails to identify the right form on pages with multiple forms.
This release also includes quite a few other bug fixes and is definitely recommended that all users upgrade to this version.
If you find any problems (or have any suggestions) please let us know!
Update History
- 2015-03-22 - First published
- 2020-11-23 - Updated formatting
Like what you're reading? Perhaps you like to buy us a coffee?