Downloading, cURL vs. wget

I do more and more downloading using the command line these days, mainly using wget and cURL. This is a good comparison of the two. Both are great at downloading. cURL supports more protocol (outside of http, https, ftp) and is bi-directional. wget can download files recursively (links on a webpage, and links that appear on pages linked from that website, and on).

Automate downloading books and pdf’s on springerlink.com

Having an electronic version of a book is great. I can skim and search through it very easily. Although I do find hardcopies useful at times, I prefer softcopies 99% of the time due to their accessibility and searchability.

Most universities have deals with publishers where students can access the electronic version of a book at the publisher’s website. This saves me the trip of running to the library when I need a book and solves the “book checked out” issue.

Springer books can be found online. You have to be on your school’s network (VPN) to access them. The crappy thing is they put up books by chapters, so you have to manually save them if you need to look at the entire book. I’ve used wget before to easily download the pdf files. However, wget doesn’t seem to work anymore because the files are no longer html links. A quick query on google (“springerlink download whole book”) yielded the springer_downlad python script. It depends on stapler which in turn depends on pyPDF. To install and use:

 <pre class="src src-sh">git clone git://github.com/milianw/springer_download

git clone http://github.com/hellerbarde/stapler.git git clone http://github.com/mfenniak/pyPdf cd pyPDF sudo ./setup.py –install cd ../stapler/ cp ./stapler.sh ~/Documents/bin/ ## or copy it to /usr/local/bin cd ../springer_download cp springer_download.py ~/Documents/bin ## or copy it to /usr/local/bin ## to download springer_download.py -l http://springerlink.com/content/HASH/STUFF ## output: a concatenated, full pdf file of the book

Very neat!

wget to mass download files

Sometimes I want to download all files on a page. The flashgot plugin works, but it involves clicking which can be a pain if you have a lot of pages to download.

Recently I’ve been wanting to download pdf’s off a page. Found out that I can do so with wget on the command line:

 <pre class="src src-sh">wget -U firefox -r -l1 -nd -e <span style="color: #eedd82;">robots</span>=off -A <span style="color: #ffa07a;">'*.pdf'</span> <span style="color: #ffa07a;">'url'</span>

However, the page I wanted to download off of has the same name for a lot of the pdf’s. The above command should not overwrite files (.1, .2, …, is appended). However, the exception list (pdf) deletes these appended files. The issue is brought up here. To get around this, do

 <pre class="src src-sh">wget -U firefox -r -l1 -nd -e <span style="color: #eedd82;">robots</span>=off -A <span style="color: #ffa07a;">'*.pdf,*.pdf.*'</span> <span style="color: #ffa07a;">'url'</span>

Note that it is best to but the url in quotes since I was having an issue where the same files were being downloaded.

I use wget more in downloading from now on!

Also, check here for an example with cookies and referrers.

UPDATE 7/27/2010

Suppose I want to download a list of links that differ very little (say a number). For example, http://example.com/whatever1 (-whatever100). A simple bash script with wget:

 <pre class="src src-sh">wget_script.sh

#! /usr/bin/env bash

## wget “$1.fmatt” -O fmatt.pdf ## wget “$1.bmatt” -O bmatt.pdf for chapnum in seq 1 $2; do wget “$1$chapnum” -O ch$chapnum.pdf ##echo $1$chapnum done

I can now do wget_script.sh "http://example.com/whatever" 100 to download the 100 files.