Aminet - comm/www/urlx.lha

Search:

85039 packages online

• About

• Recent

• Browse

• Search

• Upload

• Setup

• Services

comm/www/urlx.lha

Mirror:	Random
Showing:

No screenshot available

Short:	V1.02 Extract URL\'s from any file+sort++
Author:	fransxfilesystem.freeserve.co.uk (francis swift)
Uploader:	frans xfilesystem freeserve co uk (francis swift)
Type:	comm/www
Architecture:	m68k-amigaos
Date:	1999-09-06
Replaces:	urlx.lha
URL:	www.xfilesystem.freeserve.co.uk
Download:	comm/www/urlx.lha - View contents
Readme:	comm/www/urlx.readme
Downloads:	1810

Some quick'n'nasty hacks, but I've included the source for you to look
at, especially as urlx uses btree routines and there aren't that many
simple examples of using btrees.

The btree routines used are by Christopher R. Hertel and are available
in full on the Aminet as BinaryTrees.lzh in dev/c.


V1.02
-----
Some bugfixes/improvements in scanv, plus new template option in urlx,
for which I've included an example template file for one particular
version of Voyager. Use something like

urlx -p -a -u -t temp_voyager infile Bookmarks.html

to get an html bookmarks file.

V1.01
-----
Added functionality to scanv to enable it to be used instead of treecat
for Voyager cache only. This is to eliminate some of the bogus url's
that would be thrown up by the previous method (below) using treecat|urlx.
The new method for scanning the Voyager cache (from sh/pdksh) is eg

scanv -c dh0:Voyager/cache | urlx -p -u - outfile

which uses the new -c flag to cat (output) the contents of each file
which are then piped through urlx for processing. Of course, treecat is
still necessary for other caches eg AWeb and Netscape.


urlx
----
This program searches a file for url's (http:// etc) and prints them
or outputs them to a file. Internally it stores them in a btree to
allow duplicates to be eliminated and optionally to allow the output
to be sorted. There are various options:
 -s        selects a simple alphabetic sort for the output
 -u        selects a special url sort that should provide better grouping
           of similar site names (basically it sorts the first url element
           in groups backwards)
 -h        select html output format for making quick bookmark files,
           instead of the default straight text output
 -t <file> use a template file for output formatting
 -p        retain parameters after url's, by default these are ignored
 -a        allow accented characters in url's (i.e. characters > 127).
 -.<ext>   select just files with extension .<ext>, for example to show
           only .jpg url's you would use -.jpg, and for .html you would
           use -.htm (which matches both .htm and .html)
 -i        a special file selection option which tries to intelligently
           select only url's that are likely to be html's, both by using
           the extension and by examining the path

Basically there are lots of options but you'll probably just end up using:

urlx -u infile outfile

which uses the special url sort, or

urlx -u -h infile outfile.html

for making a bookmark file.

In both above examples you might want to use -p to retain parameters,
(the bits after the question marks, eg http://yes.or.no?wintel=crap).

treecat
-------
This is just a quick hack to let shell (sh/pdksh) users grab url's from
a complete directory tree. urlx accepts a single dash as meaning input
is from stdin, so you can use something like

treecat cachedirectorypath | urlx -u - outfilename

to produce a file containing every url in every file in your cache.
You can use this on any browser cache tree.

scanv
-----
This is used specifically to pick out the url's from the headers on the files
in a voyager cache. This is just the url of the file itself, the contents are
by default not examined.
NEW (1.01): -c flag to cat (output) contents of file for piping to urlx.

urlv
----
This is used specifically to grab url's from a Voyager history file, usually
called URL-History.1.

urla
----
This is used specifically to grab url's from an AWeb cache index file,
usually called AWCR.

stricmp_test
------------
Just a quick test prog to see which order the compiler (libc really) sorts
strings in stricmp calls. Different compilers use different orders :-(

Contents of comm/www/urlx.lha

 PERMSSN    UID  GID    PACKED    SIZE  RATIO     CRC       STAMP          NAME
---------- ----------- ------- ------- ------ ---------- ------------ -------------
[generic]                 6374   10692  59.6% -lh5- 57c8 Sep  3  1999 urlx/bin/scanv
[generic]                 4847    8476  57.2% -lh5- 2d87 Aug 29  1999 urlx/bin/stricmp_test
[generic]                 5994   10068  59.5% -lh5- 686a Aug 26  1999 urlx/bin/treecat
[generic]                 5455    9144  59.7% -lh5- 9ccc Aug 29  1999 urlx/bin/urla
[generic]                 5914    9820  60.2% -lh5- ea23 Sep  1  1999 urlx/bin/urlv
[generic]                 8778   15076  58.2% -lh5- 302a Sep  3  1999 urlx/bin/urlx
[generic]                  497    1824  27.2% -lh5- 08b0 Aug 29  1999 urlx/Makefile
[generic]                 1370    3697  37.1% -lh5- fad4 Sep  3  1999 urlx/scanv.c
[generic]                  324    1318  24.6% -lh5- 0b80 Aug 25  1999 urlx/stricmp_test.c
[generic]                  206     296  69.6% -lh5- 9791 Sep  3  1999 urlx/temp_voyager
[generic]                  812    1984  40.9% -lh5- 49b3 Aug 26  1999 urlx/treecat.c
[generic]                11348   42918  26.4% -lh5- 7439 Jul 26  1997 urlx/ubi_BinTree.c
[generic]                 9193   35348  26.0% -lh5- c7dd Jul 26  1997 urlx/ubi_BinTree.h
[generic]                 1018    2480  41.0% -lh5- 6a26 Aug 29  1999 urlx/urla.c
[generic]                 1496    3657  40.9% -lh5- 0ee5 Sep  1  1999 urlx/urlv.c
[generic]                 3716   12723  29.2% -lh5- b651 Sep  3  1999 urlx/urlx.c
[generic]                 1789    3949  45.3% -lh5- 8eb3 Sep  3  1999 urlx/urlx.readme
---------- ----------- ------- ------- ------ ---------- ------------ -------------
 Total        17 files   69131  173470  39.9%            Sep  6  1999

Page generated in 0.02 seconds

aminet net>