Pavuk web/ftp grabber: http://www.idata.sk/~ondrej/pavuk/about.html X-Authentication-Warning: divine.city.tvnet.hu: szaka owned process doing -bs Date: Mon, 4 Oct 1999 02:16:51 +0200 (MEST) From: Szabolcs Szakacsits X-Sender: szaka@divine.city.tvnet.hu To: "L. Cranswick" cc: wget@sunsite.auc.dk Subject: Re: mirroring (fwd) On Sun, 3 Oct 1999, L. Cranswick wrote: > > > It sounds like the copy command you are using is not > > > keeping the time-stamps intact. Thus Wget can > > > tar cf - . | (cd /to/this/directory/ ; tar xpf - ) > > cp -a dir1 dir2 > The above tar command will also duplicate directory > structures (like the DOS xcopy). As do the above mentioned cp command. It also takes care of sparse files, special file types, soft and hard links, timestamps, permissions, etc. I (and many others) even use it for copying disk partitions if their sizes are different (i.e. dd doesn't help). However I forgot to mention I was talking about GNU cp. I know some commercial Unix vendor ships ancient cp. The magic is by the -a option but this is really RTFM. > With this Pavuk software that has been mentioned - > will it duplicate a web and ftp site with directory > structure This is the default and I prefer its extension how it stores the files, i.e.: protocol/host_portnumber/and_here_is_the_same_dir_structure e.g.: http/www.somehost.org_8080/and_here_is_the_same_dir_structure All in all it won't mess up things if you get http://host/index.html and ftp://host/index.html or http://host/index.html and http://host:8000/index.html, etc But you also have the options to download the whole internet into only one directory because the remote and local filename mappings can be stored. > and and internal links like WGET will? This is also the default. Futhermore you have options when, what or not convert (some or all of) them. > Doesn't hurt to know what else is out there. I'm using wget for many years but I got fustrated with it in time because of its limited, malfunctioning features and doesn't maintained, developed further. One of wget most disturbing bugs is that it can't download [properly] recursive in depth/level n. Yes, there is the -l option but not for Real World(TM) usage. Here is what download strategies pavuk can use: -url_strategie $strategie - scheduling strategie for URLs (this means order how URLs will be downloaded) $strategie is one of : level - level order in URL tree leveli - level order in URL tree, but inline objects go first pre - pre-order in URL tree prei - pre-order in URL tree, but inline objects go first wget implements only 'level'. This is the worst all of them e.g. if you kill a recursive donwload or it hangs/stops/etc then you won't get the whole web pages usually. Even it is successful you can't view whole pages under the download process and get partially and unnecessarily downloaded pages because wget doesn't discriminate what's an inside object and what's not. But wget has some advantage too. Its name is only 4 characters long against 5 -- but this can be fixed by an alias. It has short options as well while pavuk not -- this can be easily fixed and I think wget "compatible" short form options would be nice. It takes less space on the disk (for the price it has less functionality and not having the optional GUI): % ll ={wget,pavuk} -rwxr-xr-x 1 root root 110388 Oct 11 1998 /usr/bin/wget* -rwxr-xr-x 1 root root 361492 Oct 4 01:30 /usr/local/bin/pavuk* Szaka