Short:        V1.02 Extract URL's from any file+sort++
Author:       frans@xfilesystem.freeserve.co.uk (francis swift)
Uploader:     frans xfilesystem freeserve co uk (francis swift)
Type:         comm/www
Replaces:     urlx.lha
Architecture: m68k-amigaos
Url:          www.xfilesystem.freeserve.co.uk

Some quick'n'nasty hacks, but I've included the source for you to look
at, especially as urlx uses btree routines and there aren't that many
simple examples of using btrees.

The btree routines used are by Christopher R. Hertel and are available
in full on the Aminet as BinaryTrees.lzh in dev/c.


V1.02
-----
Some bugfixes/improvements in scanv, plus new template option in urlx,
for which I've included an example template file for one particular
version of Voyager. Use something like

urlx -p -a -u -t temp_voyager infile Bookmarks.html

to get an html bookmarks file.

V1.01
-----
Added functionality to scanv to enable it to be used instead of treecat
for Voyager cache only. This is to eliminate some of the bogus url's
that would be thrown up by the previous method (below) using treecat|urlx.
The new method for scanning the Voyager cache (from sh/pdksh) is eg

scanv -c dh0:Voyager/cache | urlx -p -u - outfile

which uses the new -c flag to cat (output) the contents of each file
which are then piped through urlx for processing. Of course, treecat is
still necessary for other caches eg AWeb and Netscape.


urlx
----
This program searches a file for url's (http:// etc) and prints them
or outputs them to a file. Internally it stores them in a btree to
allow duplicates to be eliminated and optionally to allow the output
to be sorted. There are various options:
 -s        selects a simple alphabetic sort for the output
 -u        selects a special url sort that should provide better grouping
           of similar site names (basically it sorts the first url element
           in groups backwards)
 -h        select html output format for making quick bookmark files,
           instead of the default straight text output
 -t <file> use a template file for output formatting
 -p        retain parameters after url's, by default these are ignored
 -a        allow accented characters in url's (i.e. characters > 127).
 -.<ext>   select just files with extension .<ext>, for example to show
           only .jpg url's you would use -.jpg, and for .html you would
           use -.htm (which matches both .htm and .html)
 -i        a special file selection option which tries to intelligently
           select only url's that are likely to be html's, both by using
           the extension and by examining the path

Basically there are lots of options but you'll probably just end up using:

urlx -u infile outfile

which uses the special url sort, or

urlx -u -h infile outfile.html

for making a bookmark file.

In both above examples you might want to use -p to retain parameters,
(the bits after the question marks, eg http://yes.or.no?wintel=crap).

treecat
-------
This is just a quick hack to let shell (sh/pdksh) users grab url's from
a complete directory tree. urlx accepts a single dash as meaning input
is from stdin, so you can use something like

treecat cachedirectorypath | urlx -u - outfilename

to produce a file containing every url in every file in your cache.
You can use this on any browser cache tree.

scanv
-----
This is used specifically to pick out the url's from the headers on the files
in a voyager cache. This is just the url of the file itself, the contents are
by default not examined.
NEW (1.01): -c flag to cat (output) contents of file for piping to urlx.

urlv
----
This is used specifically to grab url's from a Voyager history file, usually
called URL-History.1.

urla
----
This is used specifically to grab url's from an AWeb cache index file,
usually called AWCR.

stricmp_test
------------
Just a quick test prog to see which order the compiler (libc really) sorts
strings in stricmp calls. Different compilers use different orders :-(