One of the biggest sources of support requests for WebCopy are to do with posting forms, and WebCopy's ongoing inability to handle dynamic values. Thankfully, with WebCopy 1.0.10.0 this issue has finally been resolved as we have introduced a variety of improvements with forms, including value merging and a new tool to capture form data.

How it used to work

So what was the problem? Well, consider a typical login form. The HTML (at the barest minimum) will be similar to the following fragment.

html
<form>
  Username:  <input type="text" name="username" />
  Password:  <input type="password" name="password" />
  <button type="submit">Login</button>
</form>

WebCopy handles this type of form very well.

One of the scourges of the internet are spammers, and as a result of this, forms often include dynamically generated tokens used to validate the form values, as demonstrated in the snippet below.

html
<form>
  Username:  <input type="text" name="username" />
  Password:  <input type="password" name="password" />
  <input name="__RequestVerificationToken" type="hidden" value="SVPMPZEjIMFD-Ne5cUm7IiMWKUSyHk3aU1mRGRJNvtmjkSfAusVyOgseFgVXdcZZdxlTHpdUJKIqKgwkqSgYUM7T8tDtCOZighFwIhgc_QW_Ccrr2_QaZ0jD9EVgYSZdVQlgaA2">
  <button type="submit">Login</button>
</form>

These tokens will change (generally for each particular session) and therefore WebCopy has been completely unable to submit such forms. Even if you opened the form in your favourite browser and extracted the tokens and pasted them into WebCopy it was unlikely to work as they would likely be treated as different sessions and have different values.

Forms often tend to set cookies now as well, so when you download the page containing the form, a validation cookie is saved. Then, when the page is submitted the cookie is used to help with the validation process. WebCopy only did the POST action, but not the GET and so no cookies would ever be set.

How it works now

A new property, Merge Values, has been added to form definitions. This is enabled by default for new forms, but disabled for any existing projects (although of course you can turn it on).

When this property is set, WebCopy will automatically get the page containing the form, apply any cookies, then attempt to extract the form data. Any parameters that haven't been explicitly defined will be automatically merged and then the merged data will be posted.

A simple correction, but one that should hopefully reduce the number of support requests, and disappointed users for that matter.

What happens if multiple forms are present?

Some pages may include multiple forms, for example searching, registration, etc. WebCopy tries to be smart about this - if multiple forms are present, it will try and match a single form who's action attribute matches the form URI. If it can't do that, then it tries to match a single form without an action attribute. And if that fails, it won't do anything.

Note: You can use the Test URI tool to verify that your login form actually logs you in before doing a fully copy or analyse

There's also a third option - I added the ability for a form definition to include an XPath query to select the FORM element. However, as one of the other frequent criticisms of WebCopy seems to be that the UI is too complicated, I have chosen not to add it to the UI at this point in time. The only way to specify the value currently is by direct editing of the project file. I probably will not expose this option unless users start reporting being unable to post to complicated pages (in which case I'll try to make the auto detection cleverer) or until 2.0 which ought to include simple / advanced display modes.

That's nice, but how about helping me create the form

I agree that currently it's not exactly the easiest of tasks to create a form definition as you have to know the URI and then the different (static) values to submit, which means you have to poke around in a sites HTML. Ok, maybe some of the criticism about WebCopy's UI is justified.

For this reason, a rudimentary (and somewhat experimental) Capture Form tool has been added. This tool will display a window containing a web browser displaying the root page for the website you want to copy, and a list of detected forms.

Simply navigate to the login page, select the form to use, and then tick any parameters to include - ie user name and password fields, but not fields with odd values. (A future update will probably automatically exclude the hidden fields automatically).

Note that you don't have to fill in the form and submit the page as WebCopy is simply parsing the HTML that you have navigated to.

The submit the dialog and you have a new form definition. Hopefully that will become a useful feature for users of the product!

An example of capturing a form automatically
An example of capturing a form automatically

Anything outstanding? (with form support!)

Currently the main bug left outstanding is it isn't possible to submit multi-lined form values due to the way the UI is a single edit field. I don't really want to start breaking the UI up in 1.x, that is a 2.0 planned task.

I also don't want to keep the new Merge Values property hanging around for long - it's currently really only there so that existing projects behave the same way they always have. Once I'm certain the new code is performing as expected this setting will removed and all form posting will use the new behaviour.

It is possible that there will be issues with the new implementation, for example cases where the auto detection fails to identify the right form on pages with multiple forms.

This release also includes quite a few other bug fixes and is definitely recommended that all users upgrade to this version.

If you find any problems (or have any suggestions) please let us know!

Update History

  • 2015-03-22 - First published
  • 2020-11-23 - Updated formatting

Like what you're reading? Perhaps you like to buy us a coffee?

Donate via Buy Me a Coffee

Donate via PayPal


Comments