wget to mass download files

Sometimes I want to download all files on a page. The flashgot plugin works, but it involves clicking which can be a pain if you have a lot of pages to download.

Recently I’ve been wanting to download pdf’s off a page. Found out that I can do so with wget on the command line:

 <pre class="src src-sh">wget -U firefox -r -l1 -nd -e <span style="color: #eedd82;">robots</span>=off -A <span style="color: #ffa07a;">'*.pdf'</span> <span style="color: #ffa07a;">'url'</span>

However, the page I wanted to download off of has the same name for a lot of the pdf’s. The above command should not overwrite files (.1, .2, …, is appended). However, the exception list (pdf) deletes these appended files. The issue is brought up here. To get around this, do

 <pre class="src src-sh">wget -U firefox -r -l1 -nd -e <span style="color: #eedd82;">robots</span>=off -A <span style="color: #ffa07a;">'*.pdf,*.pdf.*'</span> <span style="color: #ffa07a;">'url'</span>

Note that it is best to but the url in quotes since I was having an issue where the same files were being downloaded.

I use wget more in downloading from now on!

Also, check here for an example with cookies and referrers.

UPDATE 7/27/2010

Suppose I want to download a list of links that differ very little (say a number). For example, http://example.com/whatever1 (-whatever100). A simple bash script with wget:

 <pre class="src src-sh">wget_script.sh

#! /usr/bin/env bash

## wget “$1.fmatt” -O fmatt.pdf ## wget “$1.bmatt” -O bmatt.pdf for chapnum in seq 1 $2; do wget “$1$chapnum” -O ch$chapnum.pdf ##echo $1$chapnum done

I can now do wget_script.sh "http://example.com/whatever" 100 to download the 100 files.

About Vinh Nguyen

Statistician

7 comments

  1. The url you posted is to an html website. From there, there is a link to the pdf file. The tricky thing is the page contain frames, one that has the pdf file; the right side one also has the pdf link. I’m sure you can tweak wget to pick up the pdf link. The file still ends in .pdf.

  2. Thanks for your help. Yeah, that’s the tricky part. I don’t want to have to enter every single individual pdf link into Wget, that would defeat the whole purpose of using it in the first place.

  3. I don’t think you’ve searched hard enough into wget. I am nearly 100% positive it would work. Did you even try what I posted? If that doesn’t work, continue searching and reading up on wget.

  4. I’m sure it will work too, I just have to learn more about Wget (today is the first time I’ve used the program). I can’t try it right now because I’m at work, so I’ll have to wait until I get home.

  5. But what about a site where you need to login (with password) in order to get to the page were the sought files are?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>