Archive | April 2016

Web Scraping in Python

Sometimes information that you want is clearly available on the web, but it isn’t in a format that is easily obtainable. For example, data may be in an html table where you have to click a few links to get to each data point.

Last summer I was trying to do some model comparisons using IPCC climate date from the CMIP5 suite of experiments. I needed to choose models that output the right variables. Some websites were down, and I was looking for a workaround. I found one, but it required a lot of clicking around to check each model. Since the sequence of clicks and command-F’s was the same for each model, I knew there would be a way to script the process. That’s how I came across a beautifulsoup tutorial from Miguel Grinberg: http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python

Following that tutorial, I came up with this script.

Web Scraping to Check CMIP5 Model Output

The first thing after loading the packages will be to load the IPCC overview page, which has a table of all of the models used in CMIP5 and the experiments performed.

The HTML of the page will be available in response.text.

In [1]:
from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.ipcc-data.org/sim/gcm_monthly/AR5/Reference-Archive.html')

In order to use this, we’ll have to look at the HTML ourselves. To do that, go to the IPCC website, select something, right click, and choose “Inspect Element”. This will open the developer pane, so you can see all the HTML. In Safari, you’ll need to go to Preferences > Advanced > Show Develop menu in menu bar.

The next thing we’ll do is specify the experiments and variables we’re interested in, then collect links. I’ll set it up so that experiments and variables are both lists.

In [2]:
experiments = ['historicalExt','historical']
variables = ['sic', 'ridgice','divice','eshrice','nshrice']
  # We'll store the models in a dictionary, with 
                # the name of the model as the key.

All of the links that we’re looking for are inside the element

<table id="customers" width="75%" border="1" cellpadding="5" 
bgcolor="#f0fFf0" align="center">...</table>

We load the HTML text into the variable soup, then select the table we want. Luckily, it happens to have an id tag.

We can get rid of the links that point to descriptions of the experiments, because we want to be able to search for the experiment and get links to specific model output data.

In [3]:
soup = BeautifulSoup(response.text)
table = soup.find('table', {'id':'customers'})
links = [d for d in table.find_all('a') if "CMIP5-Experiments" not in d.get('href')]

The variable links now includes all of the hyperlinks in the table. We can select out the ones in our experiment variables simply:

In [4]:
for x in links:
    if x.text in experiments:
        print(x.get('href'), x.text)

Now we need to figure out how to use the links. For exploration, we’ll pick a link:

In [5]:
link = 'http://cera-www.dkrz.de/WDCC/CMIP5/Compact.jsp?acronym=BCB1hi'
response = requests.get(link)
soup = BeautifulSoup(response.text)

The title is an automaticaly generated sentence with the model name as the first word. The word title is in the header3 class, but the text is just text below it, it’s not inside any tags. What we can then do is use the next_sibling command. We’ll find the div tag immediately prior to the Title text, and the next line is what we’ll grab.

In [6]:
for div in soup.find_all('div', {'header3'}):
    if div.text=="Title":
        model = div.next_sibling
        
    if "WDCC Data Access" in div.text:
        cera_link = div.next_sibling
        
model
cera_link
Out[6]:
'\n'

Python is great for string processing. To break this into words, I use the split method, splitting by spaces. I know that I want the first element, which is the 0th position, and the first two characters are print instructions, so I omit them.

In [7]:
model = model.split(' ')[0]
model = model[2:]
model
Out[7]:
'bcc-csm1-1'

Next we need to grab the link to the CERA database. This time, the “next sibling” is the newline character. That’s not very helpful. But we can move one more step and land on the link we want.

In [10]:
for div in soup.find_all('div', {'header3'}):    
    if "WDCC Data Access" in div.text:
        cera_link = div.next_sibling.next_sibling.get('href')
cera_link  
Out[10]:
'http://cera-www.dkrz.de/WDCC/ui/EntryList.jsp?acronym=BCB1hi'

Summarize this step into a function:

In [11]:
def getModelInfo(link):
    # link must be a string
    
    response = requests.get(link)
    soup = BeautifulSoup(response.text)

    for div in soup.find_all('div', {'header3'}):
        if div.text=="Title":
            model = div.next_sibling
            
        if "WDCC Data Access" in div.text:
            cera_link = div.next_sibling.next_sibling.get('href')
  
    model = model.split(' ')[0][2:]
    return(model, cera_link)
    
In [12]:
# Make sure it works:
model, cera_link = getModelInfo(link)
print(model, cera_link)
bcc-csm1-1 http://cera-www.dkrz.de/WDCC/ui/EntryList.jsp?acronym=BCB1hi

Awesome! The last thing we need to do is open up the CERA link, and search for our variables. All that we are trying to do is determine which, if any, of a set of variables appear in the model output. If any of the variables show up in the HTML text, we’ll note the value True in var_present.

In [13]:
response = requests.get(cera_link)
soup = BeautifulSoup(response.text)

By inspecting the HTML, I found that all the information that I need is in the table tagged “List”. I’ll load that table, then look at the text in a typical row. There’s a special character that appears before the variable name, \r, and so I can use more string manipulation to get the variable.

In [14]:
table = soup.find('table',{'list'})
row = table.find_all('tr')[10]
row.text
Out[14]:
'\n\n\n\n\n\nBCB1hiDADprsn111v1\ncmip5 output1 BCC bcc-csm1-1 historical day atmos day r1i1p1 v1 prsn\r\n\ndataset\ncompletely archived\n\n\n\n'

A little helper function to process text:

In [15]:
def getVar(row):
    var = row.text.split('\r')[0]
    var = var.split(' ')[-1]
    return(var)

var_present_full = [getVar(row) for row in table.find_all('tr')][1:]
var_present = [var in var_present_full for var in variables]
var_present
Out[15]:
[True, False, False, False, False]

Wrap this together and try it out:

In [19]:
def checkForVariables(cera_link, variables):
    
    response = requests.get(cera_link)
    soup = BeautifulSoup(response.text)
    table = soup.find('table',{'list'})
    var_present_full = [getVar(row) for row in table.find_all('tr')][1:]
    var_present = [var in var_present_full for var in variables]
    
    return(var_present)
print(variables)

checkForVariables(cera_link, variables)
['sic', 'ridgice', 'divice', 'eshrice', 'nshrice']
Out[19]:
[True, False, False, False, False]

Finally, we can take all these pieces and put it into one script.

In [20]:
# Webscraping script that checks the CERA database to see which climate models
# experiments include a list of variables of interest.
# Author: Daniel Watkins

from bs4 import BeautifulSoup
import requests


# Specify the CMIP5 experiments and variables that you want to investigate
experiments = ['historicalExt','historical']
variables = ['ridgice','divice','eshrice','nshrice']


def getModelInfo(link):
    # Grab the name of the model and the link to the model output
    # from the CERA info page.
    
    response = requests.get(link)
    soup = BeautifulSoup(response.text)

    for div in soup.find_all('div', {'header3'}):
        if div.text=="Title":
            model = div.next_sibling
            
        if "WDCC Data Access" in div.text:
            cera_link = div.next_sibling.next_sibling.get('href')
  
    model = model.split(' ')[0][2:]
    return(model, cera_link)
    
def getVar(row):
    var = row.text.split('\r')[0]
    var = var.split(' ')[-1]
    return(var)

def checkForVariables(cera_link, variables):
    
    response = requests.get(cera_link)
    soup = BeautifulSoup(response.text)
    table = soup.find('table',{'list'})
    var_present_full = [getVar(row) for row in table.find_all('tr')][1:]
    var_present = [var in var_present_full for var in variables]
    
    return(var_present)

def gatherLinks(cmip5_link, experiments, variables):
    
    response = requests.get(cmip5_link)
    soup = BeautifulSoup(response.text)
    
    table = soup.find('table', {'id':'customers'})
    
    # The first few links are to descriptions of the experiments rather than to 
    # experimental data, so we discard them.
    links = [d for d in table.find_all('a') if "CMIP5-Experiments" not in d.get('href')]
    
    links_experiments = [[x.text, x.get('href')] for x in links if x.text in experiments]
    
    links_dict = {}
    for exp in links_experiments:
        try:
            links_dict[exp[0]].append(exp[1])
        except KeyError:
            links_dict.update({exp[0]:[exp[1]]})
    return(links_dict)
    




link_dict = gatherLinks('http://www.ipcc-data.org/sim/gcm_monthly/AR5/Reference-Archive.html', 
                        experiments, variables)

for dk in link_dict.keys():
    filename = dk + '_info.csv'
    f = open(filename, 'w')
    
    # Write header
    f.write('Model,' +','.join(variables) + '\n')
    
    for url in link_dict[dk]:
        model, cera_link = getModelInfo(url)
        var_present = checkForVariables(cera_link, variables)
        f.write(model + ',' + ','.join([str(var) for var in var_present]) + '\n')
        print('Done with' + dk + model + '\n')
    f.close()