How to download multiple PDFs from webpages and prepare them for text analysis

I’m just picking up an old, neglected research project that involves downloading lots of individual article PDFs linked from webpages, stringing these PDFs together into yearly journal volumes, then turning the PDFs into plain text in order to run them through text analysis software. I’ve figured out a process for doing this relatively efficiently, so I’m sharing my process here in case it can be helpful to someone.

Downloading multiple PDFs by hand is a big pain in the neck. But if the PDFs are linked directly from the article citations on the webpage, you may be able to automate the download following the directions below. (Caveat: if Zotero doesn’t have a site translator for a website, Zotero won’t “see” the citations listed on the site <– see page 2 of the document linked here. This is a drag. You’ll need to download the PDFs by hand.) Here’s a screenshot from Project Muse that shows citations with links to PDFs that can be downloaded following my method.

Screen Shot 2015-03-15 at 12.16.22 PM

To automate the process, you need Firefox, Zotero, and a Firefox/Zotero plugin called Zotfile. Zotero is free citation management software that will run as a plugin to Firefox. Zotfile is a Firefox/Zotero plugin to manage your attachments (including renaming and moving them, which is what I describe below). I also include the steps I took to string the PDFs together using Adobe Acrobat Pro and then turn these into text files.

A. Export article PDFs from a webpage and save them to your computer hard drive:

  1. You will use Zotero to collect the files from the website and Zotfile to rename the files and export them into a folder on your hard drive.
  2. In Firefox > Tools > Add-ons, select Zotfile preferences:
    • a. Configure PDF export location:
      • under “General Settings tab, next to “Custom Location” select “Choose” and select the folder where you want the renamed PDFs to end up on your hard drive.
    • b. To make the article filenames human readable and consistent, use Zotfile to rename the files based on the information Zotero collects from the citations. Configure the PDF re-naming scheme this way:
      • select Zotfile’s “Renaming Rules” tab
      • under “Format for all Item Types except Patents” put in appropriate codes for the naming scheme you want to use. Mine was: {%y}_{%v}.{%e}_{%f}
        • translation: year_volume.journalIssue_pages (so following my renaming rules, a renamed filename might look like: 2005_66.1_84-85)
      • Note: you can find all your renaming options on the Zotfile website
  3. In Zotero, make a “Saved Search” with the rule “Date added is the last 1 day.” This will make it easy to find the files you are downloading.
  4. Go to the website with the files you want to download
    • a. Click on Zotero’s little yellow folder icon in the browser bar
    • b. “Select all” (or select whichever files you want from the list) and click “OK”
    • c. wait patiently until all citations + PDF files are downloaded
  5. For QA, count the number of articles on the website. Remember this number.
  6. Once the citations and PDFs are collected by Zotero, Zotfile should automatically rename the files and putting them in the correct folder on your hard drive. Sometimes it does, but sometimes it doesn’t and then you need to do it manually.
    • a. check the destination folder to see if Zotfile automatically renamed the files and put them there. If so, go to #7 below. If not, follow these directions:
    • b. in Zotero highlight all the citations you just downloaded,
    • c. control click (or right click) and select: Manage Attachments > Rename Attachments
    • d. A dialog box will appear that says “Do you want to move and rename X attachments?”
      • X should be the number of articles that you downloaded from the website. If it isn’t, there was a problem with the download. Delete the Zotero citations/files and do the download again.
      • If the number of attachments matches the number of articles on the website, click “OK”
  7. Navigate to the export folder where you told Zotfile to export your files. Double check that the number of exported files is correct. It’s a good idea to look at the PDFs to verify that they’re OK and didn’t get corrupted during download.
  8. Delete the citations & files from Zotero (unless for some reason you want to keep them in your Zotero library)
  9. Repeat on a new webpage until you are done downloading all the files you need from the web.

B. Combine individual issues into journal volume PDFs:

Using Adobe Acrobat Pro’s “Combine Files into PDF” function, select all the PDFs for one volume, combine them, then save the file.

Note: I usually prefer to use free tools, but I happen to have Adobe Acrobat Pro on my computer. To find a free tool to merge PDFs, do a Google search for “combine pdfs free.”

C. Turn volume PDFs into text:

  1. Open TextWrangler. (If you want to maintain the original PDF’s line breaks go to Preferences > Editor Defaults tab and de-select Soft Wrap.)
  2. Open PDF in Adobe Acrobat Pro or other PDF tool
  3. select all and copy (if your file is very large it may take some time for the data to be saved to the clipboard)
  4. paste into TextWrangler
  5. save as .txt

Now you’re ready to clean your data. I’m still learning how to prepare text data for analysis, so I have no cleaning tips for you yet. Stay tuned!