Cyotek News

Cyotek News https://blog.cyotek.com/tag/cyotek-webcopy/atom.xml 2024-03-31T17:19:25Z Introducing WebView2 support, part 1 urn:uuid:67843d0c-abc4-435a-863b-16450d9ba487 2024-03-31T17:19:25Z 2024-03-31T18:01:05Z <blockquote>
Note that WebCopy 1.10 and Sitemap Creator 1.3 onwards <a href="https://blog.cyotek.com/post/removal-of-support-for-windows-vista-windows-8-and-early-versions-of-windows-10">no
longer supports Windows Vista, 8 and earlier versions of
Windows 10</a>.
</blockquote>
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/wc110-ie11.png" class="gallery" title="Cyotek's documentation pages don't work in IE very well." ><img src="https://images.cyotek.com/image/thumbnail/blog/wc110-ie11.png" alt="Cyotek's documentation pages don't work in IE very well." decoding="async" loading="lazy" /></a><figcaption>Cyotek's documentation pages don't work in IE very well.</figcaption></figure>
<a href="https://blog.cyotek.com/post/webcopy-1-7-web-browser-authentication">WebCopy 1.7</a> introduced website authentication using
Internet Explorer. Even at the time I knew it was a stop-gap
solution and long term the plan was to have WebCopy support both
Gecko and Chromium engines for advanced support. Of course, in
Cyotek-land plans never come to fruition.
In the interim period Microsoft introduced the new Chromium
based Edge and then Edge WebView2. WebCopy users can now choose
to use either Internet Explorer or a system installation of
WebView2 for advanced GUI based tasks.
<blockquote>
To clarify, you cannot (yet) use it to scan websites to work
around WebCopy's inability to execute JavaScript whilst
crawling. Its main use case is for manually authenticating
with a website to allow WebCopy to borrow the cookies.
</blockquote>
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/wc110-webview2.png" class="gallery" title="WebCopy now (partially!) supports WebView2" ><img src="https://images.cyotek.com/image/thumbnail/blog/wc110-webview2.png" alt="WebCopy now (partially!) supports WebView2" decoding="async" loading="lazy" /></a><figcaption>WebCopy now (partially!) supports WebView2</figcaption></figure>
Although you almost certainly have an uninstallable version of
Edge installed that doesn't necessarily mean you have WebView2
installed - see the Microsoft <a href="https://developer.microsoft.com/en-us/microsoft-edge/webview2/?form=MA13LH#download" rel="external nofollow noopener">download page</a> for
installation details. Note that WebCopy currently doesn't
support fixed versions, but only the evergreen system component.
<h2 id="choosing-an-embedded-browser-engine">Choosing an Embedded Browser Engine</h2>
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/wc110-options.png" class="gallery" title="The new options" ><img src="https://images.cyotek.com/image/thumbnail/blog/wc110-options.png" alt="The new options" decoding="async" loading="lazy" /></a><figcaption>The new options</figcaption></figure>
The Options dialogue now includes a new Embedded Web
Browser category which allows you to choose between IE and
WebView2.
One of the interesting things about WebView2 is it doesn't
share global state with the Edge browser itself. A new option
has been provided allowing you to configure the location where
the WebView2 profile is stored, along with an option to clean it
up after each use (e.g. for deleting cookies and clearing cached
files).
The dialogue will also check to see if WebView2 is installed
before allowing it to be selected.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/wc110-options-no-webview2.png" class="gallery" title="Warning if WebView2 isn't installed" ><img src="https://images.cyotek.com/image/thumbnail/blog/wc110-options-no-webview2.png" alt="Warning if WebView2 isn't installed" decoding="async" loading="lazy" /></a><figcaption>Warning if WebView2 isn't installed</figcaption></figure><h2 id="caveats">Caveats</h2>
This should still be considered experimental functionality. And
by using WebView2, Microsoft probably siphons off half your hard
disk to their servers as per their egregious data collection
practices.
What is also quite not-funny is one of the reasons users
requested this functionality is to log into Google websites. But
I understand that Google in their position of &quot;Our Way Or The
Highway&quot; actively block embedded browsers such as WebView2 for
being used for authentication. But ignoring that, at least it
can be used for modern websites where IE now falters.
The default option remains as Internet Explorer given it is
already present on users machines and WebView2 may not be.
Finally, although technically WebCopy can still be used on
Windows Vista, WebView2 is only available on Windows 7, Windows
8.1, later versions of Windows 10 and any version of Windows 11
(and even then the Windows 7 and 8.1 versions are no longer
supported).
<h2 id="what-about-using-webview2-to-scan-websites-created-using-javascript">What about using WebView2 to scan websites created using JavaScript?</h2>
The title of this post is part 1 - suggesting a part 2. And
this is the part 2 to follow!
WebCopy is a relatively old piece of software - the first
version was released June 15th, 2010. In addition, it
deliberately used old versions of the .NET Framework to allow it
to be used on older versions of Windows. These restrictions have
gradually been removed, but the underlying code is still using
old coding methods. So while modern code (like WebView2) uses
the &quot;async and await&quot; patterns, WebCopy uses older constructs
like <code>AsyncOperationManager</code> and <code>IAsyncResult</code>. A minimum of
refactoring has been done to inject WebView2 for use in GUI
dialogues like External Login and Capture Form, but changing the
crawl engine to be able to make use of this is a bigger task. So
for now, it can be used for manual authentication, and in a
future update you will (should) be able use it to complement
WebCopy's crawl engine if desired.


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/introducing-webview2-support-part-1 .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com CrowdStrike Falcon False Positives urn:uuid:8cde1d62-c4d3-4aed-9de4-a2f8cf219729 2021-05-19T07:20:59Z 2021-05-19T07:20:59Z For the past several months the CrowdStrike Falcon endpoint
protection platform has been flagging builds of our WebCopy and
Sitemap Creator products as malicious.
A few weeks after this originally started I contacted their
support to try and get a solution. Each time, they would check
the builds, state they were clean and whitelist that one build.
Of course, as soon as our CI server pushed out a new build, they
automatically flagged it as malicious again.
It has now been several months and their support doesn't answer
emails or provide any reason why they keep flagging the software
as malicious. As we are quite certain these are false positives
(firstly, every build is sent to <a href="https://www.virustotal.com/" rel="external nofollow noopener">VirusTotal</a> for analysis by
multiple engines, second, each time we originally contacted them
with one of the file hashes they investigated and reported
clean) we have decided to add CrowdStrike detections
<code>Win/malicious_confidence_80% (D)</code> and
<code>Win/malicious_confidence_90% (D)</code> to an ignore list. Therefore,
if one of these is the only detection, the build will be made
available for download.
Of course, there are no guarantees and so you should still be
cautious when downloading files from the internet.


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/crowdstrike-falcon-false-positives .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 1.8 - JavaScript Support urn:uuid:2899371e-241f-4385-9f92-0abfdcb1ca38 2019-06-29T19:18:35Z 2019-06-29T19:18:35Z One of the long standing requests/complaints is for WebCopy to
support JavaScript enabled websites, e.g. modern SPA's where
JavaScript is used to build the page. Traditionally this is
something I have always put onto the furthest of back burners as
in order to support this natively I'd have to essentially write
half a browser, something that would be a full time job and a
half and not something I'm interested in doing. Other solutions
did exist but I never really looked into them.
It recently occurred to me however, that I'd put into place all
the building blocks I needed to have WebCopy support JavaScript
execution (in a limited fashion, more on this later) using
Internet Explorer. And it was easy, in fact, the hardest part
was sorting out threading issues - despite the fact that WebCopy
currently only crawls on a single thread, it does run on a
different thread to the UI in order not to freeze it, which COM
can have a problem with.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-1.8-browser-option.png" class="gallery" title="That is a big warning message!" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-1.8-browser-option.png" alt="That is a big warning message!" decoding="async" loading="lazy" /></a><figcaption>That is a big warning message!</figcaption></figure>
The end result? A new Use Web Browser option can be found in
the Project Properties dialog. When set, WebCopy will do its
own downloading and remapping of content, but it will use an
embedded Internet Explorer session to do the crawling.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-1.8-browser-off.png" class="gallery" title="The current version of WebCopy can't detect links generated via JavaScript" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-1.8-browser-off.png" alt="The current version of WebCopy can't detect links generated via JavaScript" decoding="async" loading="lazy" /></a><figcaption>The current version of WebCopy can't detect links generated via JavaScript</figcaption></figure>
The screenshot above shows a scan of the <a href="https://demo.cyotek.com">WebCopy demonstration
site</a>. The page <code>dom.php</code> has a few lines of JavaScript to
build a list of links. As seem above, previous versions of
WebCopy are completely oblivious to these extra links.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-1.8-browser-on.png" class="gallery" title="A small step for software applications, a giant leap for WebCopy" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-1.8-browser-on.png" alt="A small step for software applications, a giant leap for WebCopy" decoding="async" loading="lazy" /></a><figcaption>A small step for software applications, a giant leap for WebCopy</figcaption></figure>
The image above is the same website scanned using WebCopy 1.8
and the new option enabled - you can see how it has detected
additional links, due to allowing JavaScript to execute. If you
peer hard enough you will also see that it was significantly
slower due to scanning the website using this technique.
<h2 id="listing-the-cons">Listing the cons</h2>
Although I'm pleased to be able to finally offer this
functionality, there are a few caveats.
<blockquote>
This functionality is very new, and very experimental. It is
by no means certain that I have ironed out all the potential
issues. Caveat Emptor!
</blockquote>
<ul>
<li>Crawling may be substantially slower. HTML documents will be
downloaded twice, and the headless web browsing will also add
significant overhead</li>
<li>JavaScript is being executed. This can lead to your sessions
being finger printed, tracked, malicious content being
downloaded, any number of things</li>
<li>This functionality currently uses the latest version Internet
Explorer that is installed on your computer. Not all websites
play nicely with IE</li>
<li>Keeping with the Internet Explorer theme, it will share and
use global cookies</li>
<li>Some options won't apply - for example the user agent. If a
website is particularly unfriendly, it may serve different
content to WebCopy than it does to the hosted Internet
Explorer session</li>
<li>WebCopy will remap only the original document it downloads,
not the JavaScript executed version. I don't plan on changing
this behaviour</li>
<li>This system only supports non-interactive scripts, e.g.
JavaScript that executes when the page loads. I have no
intention of supporting scripts that normally require user
interaction to run, e.g. clicking a button or scrolling a
window</li>
<li>It occurs to me as I write this post that I have no idea what
will happen if the scripts try to open a popup window.
Probably nothing good!</li>
<li>Potentially more issues. Experimental code!</li>
</ul>
<h2 id="i-dont-want-to-use-internet-explorer-cant-i-use-chrome-or-firefox">I don't want to use Internet Explorer, can't I use Chrome or Firefox?</h2>
Neither do I. Microsoft have dropped the ball so many times with
web browsers I'm amazed they are still in the game. Although I
wish they'd just decoupled Edge from the OS and updated it more
frequently than giving into Google and adopting Chromium. But
I've probably stated this before, plus, as usual, I digress.
To get back to the point, I expect future versions of WebCopy
will support both Firefox and Chromium. However, as these
browsers are several times larger than WebCopy, they won't be
included by default. So I also need to have a nice system so
that you can easily add extra browser engines to WebCopy from
within the application and without needing to install anything.
I'm also considering supporting Edge as Microsoft appear to be
adding support for this to .NET, as long as you're on the latest
Windows 10. However, given that it's probably &quot;old&quot; Edge then
this may not happen as adding support for two obsolete
browsers and with one only available to a fraction of users
is going to be a waste of the time I simply don't have to waste.
I'll have more to write about this in future I'm sure!
<h2 id="update-history">Update History</h2>
<ul>
<li>2019-06-29 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-1-8-javascript-support .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 1.8 - New Project Wizard urn:uuid:d783e81a-a96a-4b27-a9f8-6a601d5c4591 2019-06-29T19:03:12Z 2019-06-29T19:03:12Z In my previous post regarding WebCopy 1.8, I briefly covered a
general grab-bag of some of the new features in this version.
This post is dedicated to another new feature, the New Project
Wizard.
Whilst you can still create a new blank project as with previous
version of WebCopy, there's also a new GUI that will ask a
series of questions and create a neatly configured project.
This is very much a work in progress, the first couple of pages
in particular need to be made more user friendly rather than
just reproducing parts of the main (and complicated) properties
window.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-1.8-newprojectwizard-filetypes.png" class="gallery" title="Using the New Project Wizard to only copy files of a given type" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-1.8-newprojectwizard-filetypes.png" alt="Using the New Project Wizard to only copy files of a given type" decoding="async" loading="lazy" /></a><figcaption>Using the New Project Wizard to only copy files of a given type</figcaption></figure>
This page is potentially the most useful part of the new Wizard.
It allows you to select the types of files you're interested in,
be it images, audio or several other types and will use use this
select to generate a set of pre-built rules using either content
types or extensions as appropriate.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-1.8-newprojectwizard-rules.png" class="gallery" title="Entering rules is a little easier in this Wizard, with more to come" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-1.8-newprojectwizard-rules.png" alt="Entering rules is a little easier in this Wizard, with more to come" decoding="async" loading="lazy" /></a><figcaption>Entering rules is a little easier in this Wizard, with more to come</figcaption></figure>
The Rules page makes it a little easier to enter a bunch of
exclusion rules.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-1.8-newprojectwizard-summary.png" class="gallery" title="Showing the summary of changes" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-1.8-newprojectwizard-summary.png" alt="Showing the summary of changes" decoding="async" loading="lazy" /></a><figcaption>Showing the summary of changes</figcaption></figure>
The Summary page shows how the choices you made in the previous
pages are going to be used to create a WebCopy project.
There's a few more pages, but they are basic and don't offer
anything really new over the existing UI.
<h2 id="update-history">Update History</h2>
<ul>
<li>2019-06-29 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-1-8-new-project-wizard .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com Introducing WebCopy 1.8 urn:uuid:9d8ac927-4914-4e7c-a2e6-35229b6c2a7f 2019-06-29T19:02:08Z 2019-06-29T19:02:08Z It's been over two months since the last CI build of WebCopy was
made, and during this time I've been working quite hard on some
major internal refactoring and adding a long requested
feature. It hope it's worth the wait, I need a break!
WebCopy 1.8 nightly builds are now available for download and so
this series of posts will describe some of the changes and new
functionality that have been made to the software. This first
post will cover a grab bag of smaller changes.
<blockquote>
Files saved with WebCopy 1.8 are not compatible with older
versions. Please ensure you have a backup of WebCopy projects
so you can revert to an older version of WebCopy if required.
</blockquote>
<h2 id="rule-improvements">Rule Improvements</h2>
The most obvious change is that when you open WebCopy the Rules
List now has three columns rather than two. Although I wrote
back in 2014 about WebCopy 2.0 and a more <a href="https://blog.cyotek.com/post/webcopy-2-0-rules-and-query-strings">extensive rule
system</a>, the simple fact is WebCopy has 10 years of bug fixes
and improvements to the core engine and while there is work to
do, I'm not in a hurry to rewrite it. However, the rules system
could do with some TLC to avoid users having to become regular
expression wizards.
I've added the ability for you to choose which part of the URI
is used by the rule. By default this is the path and query
string (same as previous versions), but you can also tell it to
look at just the path, or just the query, or you could include
the domain too. As a result of this change the Use full URL
flag has been removed. Any rules with this flag currently set
will be automatically upgraded to new settings.
I've also added the ability for rules to run against the content
type of a given URL. This will make it much simpler to scan a
website and only keep images or other resources, rather than
having to set up a rule that tries to check URL extensions.
<blockquote>
Content type rules currently only work if HEAD checking is
enabled (this is the default). If this option is not set, the
rules probably won't work - I'll address that in a future 1.8
nightly
</blockquote>
<h2 id="scan-depth">Scan Depth</h2>
A new UI option has been added allowing you to control the scan
depth. This option only applies if the domain matches the
primary domain being copied.
<h2 id="link-distance">Link Distance</h2>
You can now choose to only include files that are directly
linked to the source page being copied, or within a certain
&quot;click distance&quot;.
<h2 id="cookies">Cookies</h2>
Cookie name/value pairs can now be set for each project.
The UI where cookies are displayed has also had various
usability tweaks.
<h2 id="setup">Setup</h2>
WebCopy setup is now using InnoSetup version 6. This should make
it easier to add local installs as currently Setup requires
power user access to install. This isn't available yet however
as I have looking into resolving the last few hurdles preventing
me allowing the &quot;portable&quot; zips to be enabled.
<h2 id="memory-improvements">Memory Improvements</h2>
Last but not least, some technical changes to make things run a
little smoother.
The code WebCopy uses to build site maps is 10 years old at this
point, being the same code used by Sitemap Creator. Saying that
it isn't very efficient is quite the understatement. In
addition, it doesn't handle the ability to load portions on
demand, so in order for WebCopy to display a sitemap it has to
build the entire map upfront and this results in a wide range of
out of memory exceptions for some users. For WebCopy 1.8, this
has been replaced with a simpler system that &quot;walks&quot; the
sitemap, creating only the bits needed at that point. Expand a
branch and only the immediate children are loaded - this should
mean that initial load time of the tree is substantially faster
and memory requirements are much lower.
The sitemap and diagram extensions have also been updated to use
this new code. While both of these still need to scan the entire
map, the fact it isn't building this map out of inefficient
string operations means these shouldn't cause a crash either,
although this doesn't solve all problems with the diagram
extension.
I still need to do performance profiling, but I'm quite happy
with the functionality thus far and am confident that this
change will wipe out a great deal of the out of memory
exceptions that users experience. The old code isn't used at all
now by WebCopy (Sitemap Creator still uses it at the moment).
In a related note, the way links are identified has changed too,
resulting in a nice reduction in memory required for large
projects. Unfortunately as a result of these changes, older
versions of WebCopy won't be able to open project files saved
with WebCopy 1.8.
<h2 id="update-history">Update History</h2>
<ul>
<li>2019-06-29 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/introducing-webcopy-1-8 .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 1.7 - local file name generation urn:uuid:36aa7518-d4b5-4a9d-903e-ecf9fb8132aa 2018-11-17T12:28:42Z 2018-11-17T12:28:42Z As part of WebCopy 1.7's mission to reduce user confusion and
make the product more appealing, a pair of new options for
controlling local file name generation have been introduced, as
well as correcting a potentially confusing bug.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-localfileoptions-1a.png" class="gallery" title="A screenshot of the adjusted Local Files configuration settings" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-localfileoptions-1a.png" alt="A screenshot of the adjusted Local Files configuration settings" decoding="async" loading="lazy" /></a><figcaption>A screenshot of the adjusted Local Files configuration settings</figcaption></figure><h2 id="preserving-the-original-extension">Preserving the original extension</h2>
By default, WebCopy will name local files to match their content
type. For example, if you download the homepage of a website
which is named <code>index.php</code>, WebCopy will save a local file named
<code>index.html</code> - end users would probably very confused trying to
open a <code>.php</code> file and either the operating system doesn't know
how to handle it, or it executes the PHP runtime.
While this approach works, it does mean the original extension
is lost. For 1.7, we've introduced a new Keep original
extension option, located in the Local Files category.
When set, if WebCopy needs to change the extension, it includes
the original extension as well. Our hypothetical <code>index.php</code>
file would be called <code>index.php.html</code> when saved locally.
This option is currently enabled for all new projects, although
we are currently evaluating this. As with most new options, it
is not set for existing projects and must be explicitly enabled.
We've also fixed a bug where WebCopy would change extensions
when it shouldn't. For example, downloading <code>jpg</code> images would
cause the local files to have a <code>jpeg</code> extension. WebCopy now
only changes extensions if they don't match any registered
extension for the appropriate content type.
<h2 id="using-query-strings-to-construct-the-local-file-name">Using query strings to construct the local file name</h2>
The query string <a href="https://demo.cyotek.com/features/querystrings.php">demonstration page</a> includes links to a
page with two different query string
<ul>
<li><code>https://demo.cyotek.com/features/querystringstest.php?section=alpha</code></li>
<li><code>https://demo.cyotek.com/features/querystringstest.php?section=beta</code></li>
</ul>
Current versions of WebCopy only consider the page name for
local file generation, therefore when copying the demonstration
website the above examples are copied locally as
<ul>
<li><code>querystringstest.html</code></li>
<li><code>querystringstest-1.html</code></li>
</ul>
This can make very difficult to identify which page the offline
file originally belonged to.
WebCopy 1.7 introduces another new option, Use query string in
local file names, also found in the Local Files category.
When enabled, WebCopy will consider the query string of the URL
as well as the page name. With the option set, the pages above
would now be saved as
<ul>
<li><code>querystringstest-section-alpha.html</code></li>
<li><code>querystringstest-section-beta.html</code></li>
</ul>
Each key pair in the query string will be present in the
filename, separated by dashes. This option is currently not
enabled by default for new projects.
<h2 id="are-these-options-helpful">Are these options helpful?</h2>
As always we hope these new options will be useful to end users.
Would they be helpful for you? Do you think we should offer
other options? Please <a href="https://www.cyotek.com/contact">let us know</a>!
<h2 id="update-history">Update History</h2>
<ul>
<li>2018-11-17 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-1-7-local-file-name-generation .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 1.7 - tls/ssl invalid certificate handling urn:uuid:1938366c-5c89-457e-a421-a7a7fa59cdc6 2018-11-04T19:33:40Z 2018-11-04T19:30:56Z <blockquote>
You should add an option to ignore checking for an SSL
certificate.
</blockquote>
The above quote is the last part of a piece of uninstallation
feedback I received about WebCopy on Friday. This isn't the
first time I've had an anonymous feedback about ignoring SSL
errors and each time it has happened it has been frustrating and
even bewildering as the option is already there and has been
since 2013!
As a result a small usability change has been made to 1.7. Now,
when you copy a website, if your project hasn't been configured
to ignore SSL errors and it detects the primary website is using
an invalid certificate it will prompt you on what to do.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-ssl-1a.png" class="gallery" title="An example of WebCopy prompting how to handle an invalid certificate" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-ssl-1a.png" alt="An example of WebCopy prompting how to handle an invalid certificate" decoding="async" loading="lazy" /></a><figcaption>An example of WebCopy prompting how to handle an invalid certificate</figcaption></figure>
In the above dialog, clicking Cancel will abort the crawl.
Clicking Ignore will resume crawling the website, ignoring
any certificate issues. There is also a View button to
display the actual certificate itself.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-ssl-1b.png" class="gallery" title="Viewing an invalid certificate using built in Windows tools" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-ssl-1b.png" alt="Viewing an invalid certificate using built in Windows tools" decoding="async" loading="lazy" /></a><figcaption>Viewing an invalid certificate using built in Windows tools</figcaption></figure>
As it stands, I'm not really happy with the Ignore Certificate
Errors option as it will apply to any URL detected by the
crawl, including 3rd party domains. I've logged #338 to
investigate adding more control so you can ignore errors on
specific domains for example. Of course I'm fully aware this
will add even more complexity... in between a rock and a hard
place.
In 1.7, the SSL options have also been moved - previously they
were under the Advanced category, now they are under the
General category and so hopefully are easier to find.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-ssl-1c.png" class="gallery" title="The slightly reorganised Project Properties dialog" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-ssl-1c.png" alt="The slightly reorganised Project Properties dialog" decoding="async" loading="lazy" /></a><figcaption>The slightly reorganised Project Properties dialog</figcaption></figure>
This new prompt is currently available in <a href="https://www.cyotek.com/cyotek-webcopy/downloads">1.7 nightly
builds</a>, please <a href="https://www.cyotek.com/contact">let us know</a> if the dialog helps or
hinders, or if you have any other feedback about either this
feature of WebCopy in general.
<h2 id="update-history">Update History</h2>
<ul>
<li>2018-11-04 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-1-7-tls-ssl-invalid-certificate-handling .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 1.7 - web browser authentication urn:uuid:9e50a583-41b0-4ac5-8ec9-69188caa94ec 2018-10-31T17:05:59Z 2018-10-31T17:05:59Z There are five main features WebCopy (and Sitemap Creator) need
based on user feedback and our own observations. In no
particular order, these are making the product easier to use,
supporting multiple downloads at once, being able to pause and
resume a copy, JavaScript support and authentication. The
current plan is to address three of the five in WebCopy 1.7,
starting with authentication.
Since the earliest days of WebCopy, it has supported challenge
authentication (where a web browser prompts you for credentials)
and form based authentication (where you enter credentials into
a web page). Almost all web sites use the latter approach and
with WebCopy this can either be tricky to configure or
impossible due to websites using interactive methods such as
authenticators or captcha codes.
In order to resolve this, WebCopy 1.7 includes a new option
named Log in use web browser, found in the Passwords
property page.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-externallogin-1a.png" class="gallery" title="The new option to enable using a web browser for authentication" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-externallogin-1a.png" alt="The new option to enable using a web browser for authentication" decoding="async" loading="lazy" /></a><figcaption>The new option to enable using a web browser for authentication</figcaption></figure>
When this new option is set, WebCopy will display a browser
window when you copy a website to allow you to authenticate with
the web site. Once authenticated, the cookies associated with
the site are applied to WebCopy's crawler and copying will
commence.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-externallogin-1b.png" class="gallery" title="Using an embedded web browser to authenticate with the website to be copied" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-externallogin-1b.png" alt="Using an embedded web browser to authenticate with the website to be copied" decoding="async" loading="lazy" /></a><figcaption>Using an embedded web browser to authenticate with the website to be copied</figcaption></figure>
This new feature is currently available in <a href="https://www.cyotek.com/cyotek-webcopy/downloads">1.7 nightly
builds</a>, please <a href="https://www.cyotek.com/contact">let us know</a> if this feature helps
or hinders!
You can learn more about this feature and any caveats on the
<a href="https://docs.cyotek.com/cyowcopy/current/externalauthentication.html">documentation page</a>.
<h2 id="update-history">Update History</h2>
<ul>
<li>2018-10-31 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-1-7-web-browser-authentication .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 1.4 beta released urn:uuid:13d1eeeb-31d9-4758-a70d-0da1a6f6efa7 2018-04-15T13:22:38Z 2018-04-15T13:19:52Z A beta version of WebCopy 1.4 complete with a fundamental change
to how rules are ran, various performance improvements, UI
tweaks and miscellaneous bug fixes has been released.
<h2 id="rule-changes">Rule Changes</h2>
In previous versions of WebCopy, rule processing would stop as
soon as the first rule was matched. This made it impossible to
do standard tasks like exclude all HTML pages from being
downloaded (but still scan them) and only download image
resources, as an example.
Now rule processing will continue and the last match is the
final result. This allows for much better control over
processing. We've added a new Stop Processing flag too, so
that you can halt the processing when a desirable match is
found.
<blockquote>
All projects created with older versions of WebCopy will
automatically have the Stop Processing flag applied to
existing rules so that it behaves in a backwards compatible
manner.
</blockquote>
As a result of this change, the hacked in Reverse and Do not
allow children to inherit this rule flags are deprecated will
be removed in the next version of the software.
<h2 id="improved-quick-scan">Improved Quick Scan</h2>
When I was looking at metrics, I was shocked to see huge amount
of calls to the documentation for the Quick Scan dialog. On the
one hand, it was interesting to note that hyperlinks to more
information were being used, but on the other hand Quick Scan
was a hack I put in and the dialog was almost completely
useless. Oh, and the documentation for it wasn't very helpful
anyway.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-quickscan-1d.png" class="gallery" title="The original Quick Scan dialog" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-quickscan-1d.png" alt="The original Quick Scan dialog" decoding="async" loading="lazy" /></a><figcaption>The original Quick Scan dialog</figcaption></figure>
We've made some improvements to the dialog so that it is now
hopefully useful. The main part of the dialog is now taken up
with a diagram showing the results of the quick scan. This
dialog updates in real time as you change options so you can get
a feel for how to configure the crawl.
You can also include or exclude domains and pages via a context
menu - this will set up additional hosts or rules as
appropriate.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-quickscan-1a.png" class="gallery" title="The new and improved Quick Scan dialog" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-quickscan-1a.png" alt="The new and improved Quick Scan dialog" decoding="async" loading="lazy" /></a><figcaption>The new and improved Quick Scan dialog</figcaption></figure>
There's still more work to be done - the diagram control doesn't
support keyboard users at all, and really needs to be able to
zoom and so we'll continue to improve this over future updates.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-quickscan-1b.png" class="gallery" title="The diagram updates with a real-time preview of what will be downloaded" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-quickscan-1b.png" alt="The diagram updates with a real-time preview of what will be downloaded" decoding="async" loading="lazy" /></a><figcaption>The diagram updates with a real-time preview of what will be downloaded</figcaption></figure><h2 id="tweaked-editors">Tweaked Editors</h2>
It has been a long annoyance to me the strange way list based
editors for Rules and Forms worked - it felt as though they went
out of their way to make it difficult to add or edit items.
These have now been rewrote to actually make sense, although
visually they look the same as they did previously.
You can also now reorder rules in their list by dragging and
dropping.
<h2 id="rule-checker">Rule Checker</h2>
A new tool for quickly checking rules has been added - this can
be useful if you want to test what rules will match a given URI.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-quickscan-1e.png" class="gallery" title="The new Rule Checker dialog in action" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-quickscan-1e.png" alt="The new Rule Checker dialog in action" decoding="async" loading="lazy" /></a><figcaption>The new Rule Checker dialog in action</figcaption></figure>
You can activate this tool from the new button on the main
window although I expect this will be removed in a future update
when we try to de clutter the UI (it's also available from the
window's menu)
<h2 id="performance">Performance</h2>
Quite a few changes have been made to improve memory usage to
avoid &quot;Out of memory&quot; crashes which can occur just about
anywhere.
<h3 id="url-lists">URL Lists</h3>
Previously WebCopy would load all data into these lists at once.
Apart from the slow performance of filling lists with tens of
thousands of items, it doesn't help with memory usage. Now, the
lists are &quot;virtual&quot;, meaning they actually only contain enough
items to fill what's currently visible and the rest are fetched
when required. You can still sort the lists by any column or
search the lists, it's just a lot more efficient than it was.
<h3 id="site-maps">Site maps</h3>
We've also began work on reducing the memory requirements of
site maps, and while we've made some progress (viewing a website
diagram should be less likely to crash), there's some major work
needs to be done - the site map you see in the application can
have scant resemblance to its internal structure, and this will
take more time to resolve.
<h2 id="bug-fixes">Bug fixes</h2>
Quite a number of additional minor bugs have been fixed, view
the <a href="https://www.cyotek.com/cyotek-webcopy/revision-history">release notes</a> for more information.
<h2 id="documentation">Documentation</h2>
We've continued to iterate on the documentation in an effort to
improve it.
<h2 id="continual-improvement">Continual improvement</h2>
We hope this update is useful. Of course we'll continue to
improve WebCopy, there's still lots more to be done!
<h2 id="update-history">Update History</h2>
<ul>
<li>2018-04-15 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-1-4-beta-released .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com Transforming hyperlinks when copying websites urn:uuid:36bbfb34-0cc9-414b-9ade-00773500c8e2 2017-05-29T14:03:55Z 2017-05-29T12:57:44Z Recently a website I infrequently use was badly defaced, and in
the course of repairing the damage the owners of the site
temporarily took it down. As I found it to be a very useful
resource I lamented not having an offline copy and so when the
site was restored, I decided to make a copy without further ado.
However, as I swiftly discovered, that was a problem - the site
used JavaScript for many internal links, and WebCopy doesn't
support JavaScript. Somewhat fortunately, when I looked at how
the JavaScript links functioned, I discovered they were all of a
predicable nature - a call to a single function with two string
arguments. The destination URL was a simple concatenation of
these arguments with no extra processing.
Although WebCopy is our most popular product, it actually got
started accidentally as an offshoot of Sitemap Creator to make a
copy of a long forgotten website. And the reason for bringing up
that trivia is that right from the start Sitemap Creator had a
feature where it could transform page titles to remove the extra
text these typically have. This functionality has now been
re-purposed to allow WebCopy to intercept a URI at the detection
stage and transform it into something different.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-uri-transforms-1b.png" class="gallery" title="New options for replacing detected URI strings are now available" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-uri-transforms-1b.png" alt="New options for replacing detected URI strings are now available" decoding="async" loading="lazy" /></a><figcaption>New options for replacing detected URI strings are now available</figcaption></figure><h2 id="what-can-you-use-it-for">What can you use it for?</h2>
The initial use case is to transform values from one form into
another in a predicable fashion, for example to remove calling
an interim page or to handle very simple JavaScript.
<h2 id="how-do-you-use-it">How do you use it?</h2>
You can find the configuration settings in the URI
Transforms section of the Project Properties dialog.
Each replacement is comprised of a Pattern, a Replacement
and an optional URI. The pattern is a regular expression which
is used to both match the source link and define any result
groups. Replacement is another expression which defines how the
URI is transformed. Finally, URI can be used to only perform the
pattern matching on links belonging to a given URI.
Regular expressions are a vast and complicated topic and it
would be nice if WebCopy didn't depend so much on them - they
don't make WebCopy very easy to use in many respects. WebCopy
does include a basic editor for expressions which can be quite
handy for testing patterns and replacements but it could be
improved.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-uri-transforms-1a.png" class="gallery" title="The built in editor can help with testing regular expressions" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-uri-transforms-1a.png" alt="The built in editor can help with testing regular expressions" decoding="async" loading="lazy" /></a><figcaption>The built in editor can help with testing regular expressions</figcaption></figure><h2 id="usage-scenario-cutting-out-the-middle-man">Usage Scenario: Cutting out the middle man</h2>
One use case is for cutting out an interim page. For example,
one page may ultimately link to another, but it does this by
first calling an interim page with a query string argument
describing the destination. The interim page will perform some
action (such as logging the &quot;click&quot;, showing a timed advert,
etc.) and then navigate to the destination. By using a
transform, we can manipulate the URL to discard the interim page
and just go directly to the destination, remapping the source
link appropriately.
The WebCopy <a href="https://demo.cyotek.com/javascript/uritransform.php">demonstration page</a> includes an example of this
behaviour. The Middleman Redirect link will navigate to
<code>redirecttracker.php?url=uritransformfinal.php</code>. We can use a
simple pattern to strip out the bulk of the URL and just keep
the query string parameter.
<ul>
<li>Pattern: <code>redirecttracker\.php\?url=(.*)</code></li>
<li>Replacement: <code>$1</code></li>
</ul>
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-uri-transforms-1d.svg" class="gallery" title="Breakdown of a pattern for capturing simple redirection links" ><img src="https://images.cyotek.com/image/blog/webcopy-uri-transforms-1d.svg" alt="Breakdown of a pattern for capturing simple redirection links" decoding="async" loading="lazy" /></a><figcaption>Breakdown of a pattern for capturing simple redirection links</figcaption></figure>
This pattern matches <code>redirecttracker.php?url=</code> and captures
everything after in a group. The replacement then simply outputs
the contents of the group, <code>uritransformfinal.php</code> in the above
example.
If you tell WebCopy to crawl the demonstration site without the
above transform, it will find <code>uritransformfinal.php</code>, but the
source page will still point to the original redirection page.
With the above transform in place, WebCopy will never know that
<code>redirecttracker.php</code> exists - it will skip directly the final
page.
<h2 id="usage-scenario-converting-simple-javascript-links">Usage Scenario: Converting simple JavaScript links</h2>
For a more advanced example, the <a href="https://demo.cyotek.com/javascript/uritransform.php">demonstration page</a> also
has three hyperlinks with the following <code>href</code> attributes
<ul>
<li><code>javascript:openPage('1', 'index')</code></li>
<li><code>javascript:openPage('1', 'second')</code></li>
<li><code>javascript:openPage('2', 'index')</code></li>
</ul>
Clicking the first link will navigate to <code>1-index.php</code>, the
second to <code>1-second.php</code> and the third to <code>2-index.php</code>. While
not really best practice for modern websites, this mirrors the
behaviour of original site I wanted to copy.
If you do a normal scan using WebCopy, while it will detect all
three links above, it will silently ignore them. To get WebCopy
to correctly process these links we need to detect the calls to
the <code>openPage</code> function, and construct a replacement URI using
the two parameters, plus an extension.
This can be done with the following transform
<ul>
<li>Pattern: <code>javascript:openPage$'(.*)',\s?'(.*)'$</code></li>
<li>Replacement: <code>$1-$2.php</code></li>
</ul>
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-uri-transforms-1e.svg" class="gallery" title="Breakdown of a pattern for capturing simple JavaScript links" ><img src="https://images.cyotek.com/image/blog/webcopy-uri-transforms-1e.svg" alt="Breakdown of a pattern for capturing simple JavaScript links" decoding="async" loading="lazy" /></a><figcaption>Breakdown of a pattern for capturing simple JavaScript links</figcaption></figure>
The above expression will first try and match
<code>javascript:openPage('</code> (braces are escaped with <code>\</code> as they are
special characters). It will then capture any characters between
the first set of single quotes into a capture group. After the
closing quote, it will then match a <code>,</code> character. The <code>\s</code>
token means to match any white space character, and the <code>?</code>
makes it optional. Next the pattern captures any characters
between the second set of single quote characters and matches a
closing brace. In fairness, the pattern could be simplified
further, but then it would look even more confusing to newcomers
so I've tried to keep it more explicit.
The replacement expression basically combines the two groups
(the <code>$1</code> and <code>$2</code> tokens represent capture groups from the
pattern) with a <code>-</code> between them and then adding the <code>.php</code>
extension.
Now when scanning the demonstration website, WebCopy will find
those three links and automatically transform them, therefore
finding and downloading the linked pages. At the end of the
copy, when WebCopy remaps downloaded HTML to ensure links are
local to the copy, it will also replace the source links with
the transformed name.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-uri-transforms-1c.png" class="gallery" title="Links dialog showing that WebCopy has successfully downloaded pages previously only accessible via JavaScript" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-uri-transforms-1c.png" alt="Links dialog showing that WebCopy has successfully downloaded pages previously only accessible via JavaScript" decoding="async" loading="lazy" /></a><figcaption>Links dialog showing that WebCopy has successfully downloaded pages previously only accessible via JavaScript</figcaption></figure><h2 id="getting-the-build">Getting the build</h2>
Currently this functionality is only available in nightly
builds, available from the WebCopy <a href="https://www.cyotek.com/cyotek-webcopy/downloads">download page</a>.
<h2 id="update-history">Update History</h2>
<ul>
<li>2017-05-29 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/transforming-hyperlinks-when-copying-websites .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com srcset attribute support, custom attributes, 300 status support and more urn:uuid:dca0bf79-0b66-4167-87a3-3c9a0b2e600b 2015-09-12T12:35:26Z 2015-09-12T12:10:07Z A new beta version of WebCopy has been released, containing a
range of features and bug fixes.
<blockquote>
If you're finding WebCopy useful, please <a href="https://www.cyotek.com/donate">donate</a> to keep
the project alive
</blockquote>
<h2 id="custom-attributes">Custom Attributes</h2>
If you tried to use WebCopy to copy a responsive website, then
it is possible that WebCopy wouldn't pick up custom images if
they were referenced in ways that <a href="https://docs.cyotek.com/cyowcopy/current/crawlsupport.html">WebCopy won't detect by
default</a>, such as on custom data attributes.
With WebCopy 1.1.1 or higher, you can now define your own HTML
crawler expressions. It sounds complicated, but it's not - in
the simplest fashion, you can just enter the name of an
attribute, and WebCopy will check for any match on any HTML
element.
For example, consider the HTML fragment below, which references
background3.png and background1.png.
<figure class="lang-html highlight"><figcaption>html</figcaption><pre class="code">
&lt;img data-original=&quot;/assets/img/background3.png&quot; src=&quot;https://www.cyotek.com/assets/img/background1.png&quot; alt=&quot;Background&quot; style=&quot;width: 100%;&quot;/&gt;
</pre>
</figure>
By default, WebCopy will only find background1.png as it is
on the standard <code>src</code> attribute. By simply adding
<code>data-original</code> to the custom attributes list, WebCopy will now
find and process background3.png too.
In addition to this basic form, you can also do more advanced
expressions by entering XPath statements. Continuing the above
example, if you add <code>//img/@data-original</code> as a custom
attribute, then WebCopy will only look at attributes named
<code>data-original</code> that belong to <code>img</code> elements.
Hopefully this approach strikes a nice balance between something
easy for users, and then something for the power user.
<h2 id="srcset-support">srcset Support</h2>
Staying with the responsive theme, WebCopy now supports the
<code>srcset</code> attribute. This attribute allows you to specifying
multiple images for a single <code>img</code> element, and the browser will
choose the most appropriate one.
<figure class="lang-html highlight"><figcaption>html</figcaption><pre class="code">
&lt;img style=&quot;width: 400px; height: 400px;&quot;
 src=&quot;https://www.cyotek.com/image-src.png&quot; 
 srcset=&quot;image-1x.png 1x, image-2x.png 2x, image-3x.png 3x, image-4x.png 4x&quot;
/&gt;
</pre>
</figure>
In the above example, that single <code>img</code> tag references five
different image files. WebCopy will now detect and process all
five images.
The <a href="http://demo.cyotek.com/">WebCopy demo site</a> has been updated to include custom
attributes and srcset attributes.
<h2 id="multiple-choices-status-code-support">&quot;Multiple Choices&quot; Status Code Support</h2>
The 300 status code is often used by Linux based web servers
(such as Apache) as a user friendly 404 - if you try to access a
specific URL that doesn't exist, but it almost matches other
URLs, the web server will return the list of matches. I don't
think I've ever seen an IIS website return 300, it always seems
to just 404.
Previously WebCopy ignored this status code - it knew it was a
redirect, but couldn't do anything with it as it didn't include
a location header. Now, WebCopy will download the body
containing the list of URLs and crawl each of these.
I doubt this feature will see much real world use, but you never
know!
<h2 id="performance-improvements-and-bug-fixes">Performance Improvements and Bug Fixes</h2>
Although I don't go out of my way to profile WebCopy for
performance (most of the time will be spent downloading files
after all), I do keep an eye out for areas that could do with
improvement. While adding support for <code>srcset</code> (which is fairly
unique in terms of HTML attributes as it lets you specify
multiple values in a single attribute), I refactored the
crawling code and got a small performance improvement.
No WebCopy update would be complete without a bug fix or 10, and
so we have a number of fixes implemented, including (finally) a
fix for a bug which could leave a project unreadable by WebCopy.
A full list of corrections can be found in the <a href="https://www.cyotek.com/cyotek-webcopy/revision-history">release
notes</a>.
<h2 id="and-next">And next?</h2>
Even though all the tests (old and new) pass, due to the changes
to crawling, multi value attribute reading and writing and the
other new features, I'm still classing this as a beta build -
there's bound to be some edge case I haven't come across yet.
The next update is going to (finally!) tackle WebCopy's woeful
support for query strings and vastly improve that, perhaps then
making it possible to copy forums more easily.
However, I stress again that we need your support - if you're
using WebCopy and it is useful to you, please <a href="https://www.cyotek.com/donate">donate</a>!
<h2 id="update-history">Update History</h2>
<ul>
<li>2015-09-12 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/srcset-attribute-support-custom-attributes-300-status-support-and-more .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com On WebCopy, Continuous Integration, .NET Framework 4.5 and end of Windows XP support urn:uuid:a1e5351f-50b2-4a87-ba2f-9b55ba6ddafe 2015-04-11T08:01:43Z 2015-04-11T08:01:43Z <blockquote>
This is quite a long post so I'm just going to add an
important bit of news here - WebCopy 1.1 will not be able
to be installed or ran on Windows XP
</blockquote>
WebCopy, like most Cyotek products, is built in C# using
Microsoft .NET Framework 3.5, thus allowing it to run on Windows
XP onwards. Each time the product is built, a batch file is
manually ran which goes away and compiles the solution, signs
the files, does some &quot;deployment ready&quot; checks, generates the
documentation and then generates the setup. Tests are not run as
part of this process as generally they are always running in the
IDE via <a href="http://www.ncrunch.net/" rel="external nofollow noopener">NCrunch</a>.
Once a build has been completed and we're thinking about
deploying it, we then fire up a Windows XP virtual machine in
<a href="https://www.virtualbox.org/" rel="external nofollow noopener">VirtualBox</a> and proceed to run a basic smoke test - this
usually involves opening the demo project, downloading the demo
website, then creating a new project, pointing it to a small 3rd
party website, downloading that, then saving the project. Just
the basics in other words.
If this smoke test passes, then we upload the setup to
cyotek.com, copy the <a href="https://www.cyotek.com/cyotek-webcopy/upcoming-changes">Upcoming Changes</a> content into a new
document, then link it all together - and a new release is ready
for end users to download.
It's actually not a laborious process (except for the smoke
test) as for the most part, while not fully automatic, all the
major actions are handled by several batch scripts and
specialist actions. But it would be very nice to have this all
done automatically.
Which is why I feel an intense surge of glee whenever I look at
this display:
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/ci-1a.png" class="gallery" title="Starport Successfull! Wait, wrong universe. Well, it makes me happy anyway." ><img src="https://images.cyotek.com/image/thumbnail/blog/ci-1a.png" alt="Starport Successfull! Wait, wrong universe. Well, it makes me happy anyway." decoding="async" loading="lazy" /></a><figcaption>Starport Successfull! Wait, wrong universe. Well, it makes me happy anyway.</figcaption></figure>
This is <a href="https://www.jetbrains.com/teamcity/" rel="external nofollow noopener">TeamCity</a>, the product we are using for <a href="http://en.wikipedia.org/wiki/Continuous_integration" rel="external nofollow noopener">continuous
integration</a>- basically, whenever changes are checked into
version control, TeamCity goes away and creates the build
without us having to manually trigger it. It also gathers the
build artefacts ready for more processing.
I evaluated both <a href="http://jenkins-ci.org/" rel="external nofollow noopener">Jenkins</a> and TeamCity for this job,
starting with Jenkins. It didn't require any configuration
changes on the server or agent hosts, and produced a build.
(Eventually anyway, it took a lot of effort to get the WebCopy
build into a shape where it would work purely via checkouts.
Un-versioned files and hard coded tool paths do not help!).
But, I didn't warm to Jenkins, mainly because it's so incredibly
ugly, and because out of the box functionality is quite limited
(for .NET shops) unless you want to start installing a multitude
of plugins.
Next I tried TeamCity. I originally looked at this back in 2013
(with version 7) and ended up shelving it then (mostly because
the builds just wouldn't run correctly). While Jenkins worked
out the box, TeamCity needs a full blown database server of some
sort (SQL Server for us!) and need tinkering to the firewalls on
both the server and the agent. But in terms of UI user
friendliness it's light years ahead of Jenkins. And as the
screenshot shows, WebCopy is happily being built.
But what does this mean?
<h2 id="more-frequent-builds">More Frequent Builds?</h2>
Firstly, more work needs to be done with the build process - the
batch files are sometimes a bit fragile and don't make for easy
error detection. I'd been looking at <a href="http://fsharp.github.io/FAKE/" rel="external nofollow noopener">FAKE</a> a couple of weeks
back as a possible replacement, but there are other options such
as <a href="https://github.com/nant/nant" rel="external nofollow noopener">NANT</a> and probably a bucketful of others. Once the build
has been converted to use something else, then I'll start adding
the automated unit/integration tests into the process.
And then... I plan on having TeamCity talk to cyotek.com to
automatically publish &quot;nightly&quot; (except they won't be that
frequent!) builds to hopefully get new versions of the software
out quicker.
We've also been spending substantial time recently rewriting
core tests for WebCopy's crawling engine so that we can be much
more confident in it's abilities, know immediately if
regressions are introduced, and to ensure oversights and bugs
are swiftly rectified. Oh yes, and to make the things faster.
<h2 id="moving-to.net-4.5-removal-of-xp-support">Moving to .NET 4.5, removal of XP support</h2>
Some parts of WebCopy are crying out for performance
improvements. Lets start with the basics, such as threading.
Threading is a way of a program doing more than one thing at
once. WebCopy does use threads, but only to keep the UI from
being blocked, and for supporting actions such as update checks
or RSS updates. The core crawling is done on a separate thread,
but essentially all ancillary actions are a big chain, one link
after another.
While there's nothing really stopping use from staying with .NET
3.5 and expanding the threading capacity, newer versions of .NET
make this much easier with lots of built in goodness like
concurrent collections, parallel workers and much more easy ways
of making asynchronous calls - the ones in WebCopy currently are
difficult to debug when things go wrong.
.NET 4.5 is built into Windows 8 onwards, and is available as a
download for Vista and Windows 7. It is not available for
Windows XP however, so it is highly likely that WebCopy
1.0.10.1 will be the last version available for this OS.
Based on submitted analytic's data, the breakdown of WebCopy
users per operating system looks like this
<table>
<thead>
<tr>
<th>Operating System</th>
<th style="text-align: right;">User %</th>
</tr>
</thead>
<tbody>
<tr>
<td>Windows XP / Server 2003</td>
<td style="text-align: right;">12%</td>
</tr>
<tr>
<td>Windows Vista / Server 2008</td>
<td style="text-align: right;">1%</td>
</tr>
<tr>
<td>Windows 7 / Server 2008 R2</td>
<td style="text-align: right;">68%</td>
</tr>
<tr>
<td>Windows 8 / Server 2012</td>
<td style="text-align: right;">12%</td>
</tr>
<tr>
<td>Windows 8.1 / Server 2012 R2</td>
<td style="text-align: right;">6%</td>
</tr>
<tr>
<td>Windows 10</td>
<td style="text-align: right;">1%</td>
</tr>
</tbody>
</table>
At 12% the XP user base is larger than I would have expected,
but I think it's time to drop support regardless.
<h2 id="update-history">Update History</h2>
<ul>
<li>2015-04-11 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/on-webcopy-continuous-integration-net-framework-4-5-and-end-of-windows-xp-support .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy and some welcome form updates urn:uuid:f330a31b-aab3-4bd2-b506-020adceca1ae 2015-03-22T07:28:56Z 2015-03-22T07:21:49Z One of the biggest sources of support requests for
<a href="https://www.cyotek.com/cyotek-webcopy">WebCopy</a> are to do with posting forms, and
WebCopy's ongoing inability to handle dynamic values.
Thankfully, with WebCopy 1.0.10.0 this issue has finally been
resolved as we have introduced a variety of improvements with
forms, including value merging and a new tool to capture form
data.
<h2 id="how-it-used-to-work">How it used to work</h2>
So what was the problem? Well, consider a typical login form.
The HTML (at the barest minimum) will be similar to the
following fragment.
<figure class="lang-html highlight"><figcaption>html</figcaption><pre class="code">
&lt;form&gt;
 Username: &lt;input type=&quot;text&quot; name=&quot;username&quot; /&gt;
 Password: &lt;input type=&quot;password&quot; name=&quot;password&quot; /&gt;
 &lt;button type=&quot;submit&quot;&gt;Login&lt;/button&gt;
&lt;/form&gt;
</pre>
</figure>
WebCopy handles this type of form very well.
One of the scourges of the internet are spammers, and as a
result of this, forms often include dynamically generated tokens
used to validate the form values, as demonstrated in the snippet
below.
<figure class="lang-html highlight"><figcaption>html</figcaption><pre class="code">
&lt;form&gt;
 Username: &lt;input type=&quot;text&quot; name=&quot;username&quot; /&gt;
 Password: &lt;input type=&quot;password&quot; name=&quot;password&quot; /&gt;
 &lt;input name=&quot;__RequestVerificationToken&quot; type=&quot;hidden&quot; value=&quot;SVPMPZEjIMFD-Ne5cUm7IiMWKUSyHk3aU1mRGRJNvtmjkSfAusVyOgseFgVXdcZZdxlTHpdUJKIqKgwkqSgYUM7T8tDtCOZighFwIhgc_QW_Ccrr2_QaZ0jD9EVgYSZdVQlgaA2&quot;&gt;
 &lt;button type=&quot;submit&quot;&gt;Login&lt;/button&gt;
&lt;/form&gt;
</pre>
</figure>
These tokens will change (generally for each particular session)
and therefore WebCopy has been completely unable to submit such
forms. Even if you opened the form in your favourite browser and
extracted the tokens and pasted them into WebCopy it was
unlikely to work as they would likely be treated as different
sessions and have different values.
Forms often tend to set cookies now as well, so when you
download the page containing the form, a validation cookie is
saved. Then, when the page is submitted the cookie is used to
help with the validation process. WebCopy only did the <code>POST</code>
action, but not the <code>GET</code> and so no cookies would ever be set.
<h2 id="how-it-works-now">How it works now</h2>
A new property, Merge Values, has been added to form
definitions. This is enabled by default for new forms, but
disabled for any existing projects (although of course you can
turn it on).
When this property is set, WebCopy will automatically get the
page containing the form, apply any cookies, then attempt to
extract the form data. Any parameters that haven't been
explicitly defined will be automatically merged and then the
merged data will be posted.
A simple correction, but one that should hopefully reduce the
number of support requests, and disappointed users for that
matter.
<h2 id="what-happens-if-multiple-forms-are-present">What happens if multiple forms are present?</h2>
Some pages may include multiple forms, for example searching,
registration, etc. WebCopy tries to be smart about this - if
multiple forms are present, it will try and match a single form
who's <code>action</code> attribute matches the form URI. If it can't do
that, then it tries to match a single form without an <code>action</code>
attribute. And if that fails, it won't do anything.
<blockquote>
Note: You can use the Test URI tool to verify that your login
form actually logs you in before doing a fully copy or analyse
</blockquote>
There's also a third option - I added the ability for a form
definition to include an XPath query to select the <code>FORM</code>
element. However, as one of the other frequent criticisms of
WebCopy seems to be that the UI is too complicated, I have
chosen not to add it to the UI at this point in time. The only
way to specify the value currently is by direct editing of the
project file. I probably will not expose this option unless
users start reporting being unable to post to complicated pages
(in which case I'll try to make the auto detection cleverer) or
until 2.0 which ought to include simple / advanced display
modes.
<h2 id="thats-nice-but-how-about-helping-me-create-the-form">That's nice, but how about helping me create the form</h2>
I agree that currently it's not exactly the easiest of tasks to
create a form definition as you have to know the URI and then
the different (static) values to submit, which means you have to
poke around in a sites HTML. Ok, maybe some of the criticism
about WebCopy's UI is justified.
For this reason, a rudimentary (and somewhat experimental)
Capture Form tool has been added. This tool will display a
window containing a web browser displaying the root page for the
website you want to copy, and a list of detected forms.
Simply navigate to the login page, select the form to use, and
then tick any parameters to include - ie user name and password
fields, but not fields with odd values. (A future update will
probably automatically exclude the hidden fields automatically).
<blockquote>
Note that you don't have to fill in the form and submit the
page as WebCopy is simply parsing the HTML that you have
navigated to.
</blockquote>
The submit the dialog and you have a new form definition.
Hopefully that will become a useful feature for users of the
product!
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/captureformtool.png" class="gallery" title="An example of capturing a form automatically" ><img src="https://images.cyotek.com/image/thumbnail/blog/captureformtool.png" alt="An example of capturing a form automatically" decoding="async" loading="lazy" /></a><figcaption>An example of capturing a form automatically</figcaption></figure><h2 id="anything-outstanding-with-form-support">Anything outstanding? (with form support!)</h2>
Currently the main bug left outstanding is it isn't possible to
submit multi-lined form values due to the way the UI is a single
edit field. I don't really want to start breaking the UI up in
1.x, that is a 2.0 planned task.
I also don't want to keep the new Merge Values property hanging
around for long - it's currently really only there so that
existing projects behave the same way they always have. Once I'm
certain the new code is performing as expected this setting will
removed and all form posting will use the new behaviour.
It is possible that there will be issues with the new
implementation, for example cases where the auto detection fails
to identify the right form on pages with multiple forms.
This release also includes quite a few other bug fixes and is
definitely recommended that all users upgrade to this version.
If you find any problems (or have any suggestions) please let us
know!
<h2 id="update-history">Update History</h2>
<ul>
<li>2015-03-22 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-and-some-welcome-form-updates .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 1.0.9.0 released - multiple hosts and proxy server support urn:uuid:34e9fe8a-2f76-4bc5-a1e6-156f4b8f0213 2014-06-01T20:36:22Z 2014-06-01T20:36:22Z The latest update to WebCopy has just been released, and
includes two new features which expand the usefulness of the
product.
<blockquote>
These features are considered experimental at this stage -
they haven't been as fully tested as some other features, and
as a result they either might not work properly or have
unintended side effects.
</blockquote>
<h2 id="multiple-hosts">Multiple Hosts</h2>
One of the more odd omissions of WebCopy was the fact it
wouldn't crawl other hosts. You could copy sub domains, but what
about if you used a CDN with a completely different domain name?
Fortunately that deficit has now been rectified. The
Additional Hosts configuration page lets you specify
additional domains to crawl.
Now, when WebCopy finds an external URI, it will check to see if
the domain is listed as safe to crawl. If it is, it will
promptly download the linked resource, and then attempt to scan
it for further links, and expand from there.
As these additional hosts can be jumped into from any level,
some project settings won't apply to the additional hosts - for
example the Crawl Above Root setting. Therefore it is
important to make sure you use rules to control how content is
downloaded.
<h2 id="proxy-server-support">Proxy Server Support</h2>
Previously, WebCopy would use the system defined proxy server
settings. Now you can config your own independent settings on a
per-project based. This allows all requests during a crawl to be
sent via the proxy.
<h2 id="odds-and-ends">Odds and ends</h2>
With these features being new and only tested in a limited
fashion, there could be bugs or side effects - please let us
know if you experience any problems.
As is usual for these updates, there is also a handful of bug
fixes and minor new functionality, mostly around the UI
interactions, but also including a fix where WebCopy would treat
certain URI's as sub domains even though they weren't.
We hope you enjoy this update to the product!
<h2 id="update-history">Update History</h2>
<ul>
<li>2014-06-01 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-1-0-9-0-released-multiple-hosts-and-proxy-server-support .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com WebCopy 2.0 - Rules and Query Strings urn:uuid:c1572356-7507-44b6-ad29-d125349576ea 2014-03-27T18:40:11Z 2014-03-27T18:40:11Z <blockquote>
Note: WebCopy 2.0 isn't even in alpha yet. Images and
descriptions below are from prototype's testing the
feasibility of new functionality and how it might work. These
features might never see the light of day, look different,
work different, etc.
</blockquote>
As soon as WebCopy was originally completed, I knew it had a
rather large flaw. The system of rules it uses is quite
powerful, but fatally limited. Essentially you have a regular
expression to match a URL, and a number that tells WebCopy what
to do. Great if you can write regular expressions and want to do
a limited number of things. Utterly useless if you want more
control - such as skipping files which are 1GB in size, or that
are &gt; 10 levels deep.
Some of the most frustrating support requests we deal with tend
to be people trying to copy forums using WebCopy. While it's
technically possible, it's not an easy thing. Forums can have
many thousands of links, and a lot of them are replicated many
times for &quot;new posts&quot; and &quot;replies&quot; and so on. In most cases,
copying this is not what you want as it adds zero value to the
copy.
You could use regular expressions to filter out URI's containing
query strings of particular keys and values, that much at least
is possible (if regular expressions don't scare you aware that
is). But you can't manipulate the query string at all. Ok, you
can clear it completely, but that's going to be useful in a tiny
percentage of cases.
I wanted to add a proper rules system to WebCopy, but doing so
means a substantial rewrite of much of the crawling engine, and
that means essentially it's a &quot;version next&quot; feature. For the
last several months WebCopy updates have slowed down in terms of
bug fixes and enhancements that are directly to do with the
crawling engine. It's stable enough for most people it would
seem.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-ruletest.png" class="gallery" title="An example of a prototype rule engine" ><img src="https://images.cyotek.com/image/thumbnail/blog/webcopy-ruletest.png" alt="An example of a prototype rule engine" decoding="async" loading="lazy" /></a><figcaption>An example of a prototype rule engine</figcaption></figure>
The screenshot above shows a prototype GUI for a new rule
engine. The bottom half of the screen shows the conditions of a
rule, and the upper half shows the results of the rule being
evaluated against a list of URI's - everything highlighted in
reddish orange has been matched against the rule, and so if this
were a fully functional application, the specified actions would
have been carried out.
It is currently my intention that this rule engine will control
all aspects of crawling in WebCopy 2.0, be it simple exclusions
of URI's, manipulation of query strings, or changes to the
downloaded HTML. Of course, the GUI might not look like you are
configuring rules of this nature, but behind the scenes that's
what it will be doing.
One of the great things about this type of system is that each
rule is self contained, making it so much easier to test, which
should mean for a more stable product from the start without
daft mistakes.
My final comments are regarding query strings, and the animation
below should describe nicely how that is being planned so far.
<figure class="screenshot" ><a href="https://images.cyotek.com/image/blog/webcopy-ruletest-wizard.gif" class="gallery" title="An Outlook style Rule Wizard GUI that lets you filter on multiple values for a string query string and then re-evaluates the URI list" ><img src="https://images.cyotek.com/image/blog/webcopy-ruletest-wizard.gif" alt="An Outlook style Rule Wizard GUI that lets you filter on multiple values for a string query string and then re-evaluates the URI list" decoding="async" loading="lazy" /></a><figcaption>An Outlook style Rule Wizard GUI that lets you filter on multiple values for a string query string and then re-evaluates the URI list</figcaption></figure>
As I mentioned in the block quote at the start of this post,
nothing is final, not the functionality, nor the types of rules
that will be available, nor how the GUI will work - but I'll
touch upon the GUI specifically in another post.
<h2 id="update-history">Update History</h2>
<ul>
<li>2014-03-27 - First published</li>
<li>2020-11-23 - Updated formatting</li>
</ul>


All content <a href="https://blog.cyotek.com/copyright-and-trademarks">Copyright (c) by Cyotek Ltd</a> or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else. Original URL of this content is https://blog.cyotek.com/post/webcopy-2-0-rules-and-query-strings .
 Richard Moss https://www.cyotek.com/ richard.moss@cyotek.com