Web Scraping in Python

Sometimes information that you want is clearly available on the web, but it isn’t in a format that is easily obtainable. For example, data may be in an html table where you have to click a few links to get to each data point.

Last summer I was trying to do some model comparisons using IPCC climate date from the CMIP5 suite of experiments. I needed to choose models that output the right variables. Some websites were down, and I was looking for a workaround. I found one, but it required a lot of clicking around to check each model. Since the sequence of clicks and command-F’s was the same for each model, I knew there would be a way to script the process. That’s how I came across a beautifulsoup tutorial from Miguel Grinberg: http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python

Following that tutorial, I came up with this script.

Web Scraping to Check CMIP5 Model Output

The first thing after loading the packages will be to load the IPCC overview page, which has a table of all of the models used in CMIP5 and the experiments performed.

The HTML of the page will be available in response.text.

In [1]:
from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.ipcc-data.org/sim/gcm_monthly/AR5/Reference-Archive.html')

In order to use this, we’ll have to look at the HTML ourselves. To do that, go to the IPCC website, select something, right click, and choose “Inspect Element”. This will open the developer pane, so you can see all the HTML. In Safari, you’ll need to go to Preferences > Advanced > Show Develop menu in menu bar.

The next thing we’ll do is specify the experiments and variables we’re interested in, then collect links. I’ll set it up so that experiments and variables are both lists.

In [2]:
experiments = ['historicalExt','historical']
variables = ['sic', 'ridgice','divice','eshrice','nshrice']
  # We'll store the models in a dictionary, with 
                # the name of the model as the key.

All of the links that we’re looking for are inside the element

<table id="customers" width="75%" border="1" cellpadding="5" 
bgcolor="#f0fFf0" align="center">...</table>

We load the HTML text into the variable soup, then select the table we want. Luckily, it happens to have an id tag.

We can get rid of the links that point to descriptions of the experiments, because we want to be able to search for the experiment and get links to specific model output data.

In [3]:
soup = BeautifulSoup(response.text)
table = soup.find('table', {'id':'customers'})
links = [d for d in table.find_all('a') if "CMIP5-Experiments" not in d.get('href')]

The variable links now includes all of the hyperlinks in the table. We can select out the ones in our experiment variables simply:

In [4]:
for x in links:
    if x.text in experiments:
        print(x.get('href'), x.text)

Now we need to figure out how to use the links. For exploration, we’ll pick a link:

In [5]:
link = 'http://cera-www.dkrz.de/WDCC/CMIP5/Compact.jsp?acronym=BCB1hi'
response = requests.get(link)
soup = BeautifulSoup(response.text)

The title is an automaticaly generated sentence with the model name as the first word. The word title is in the header3 class, but the text is just text below it, it’s not inside any tags. What we can then do is use the next_sibling command. We’ll find the div tag immediately prior to the Title text, and the next line is what we’ll grab.

In [6]:
for div in soup.find_all('div', {'header3'}):
    if div.text=="Title":
        model = div.next_sibling
        
    if "WDCC Data Access" in div.text:
        cera_link = div.next_sibling
        
model
cera_link
Out[6]:
'\n'

Python is great for string processing. To break this into words, I use the split method, splitting by spaces. I know that I want the first element, which is the 0th position, and the first two characters are print instructions, so I omit them.

In [7]:
model = model.split(' ')[0]
model = model[2:]
model
Out[7]:
'bcc-csm1-1'

Next we need to grab the link to the CERA database. This time, the “next sibling” is the newline character. That’s not very helpful. But we can move one more step and land on the link we want.

In [10]:
for div in soup.find_all('div', {'header3'}):    
    if "WDCC Data Access" in div.text:
        cera_link = div.next_sibling.next_sibling.get('href')
cera_link  
Out[10]:
'http://cera-www.dkrz.de/WDCC/ui/EntryList.jsp?acronym=BCB1hi'

Summarize this step into a function:

In [11]:
def getModelInfo(link):
    # link must be a string
    
    response = requests.get(link)
    soup = BeautifulSoup(response.text)

    for div in soup.find_all('div', {'header3'}):
        if div.text=="Title":
            model = div.next_sibling
            
        if "WDCC Data Access" in div.text:
            cera_link = div.next_sibling.next_sibling.get('href')
  
    model = model.split(' ')[0][2:]
    return(model, cera_link)
    
In [12]:
# Make sure it works:
model, cera_link = getModelInfo(link)
print(model, cera_link)
bcc-csm1-1 http://cera-www.dkrz.de/WDCC/ui/EntryList.jsp?acronym=BCB1hi

Awesome! The last thing we need to do is open up the CERA link, and search for our variables. All that we are trying to do is determine which, if any, of a set of variables appear in the model output. If any of the variables show up in the HTML text, we’ll note the value True in var_present.

In [13]:
response = requests.get(cera_link)
soup = BeautifulSoup(response.text)

By inspecting the HTML, I found that all the information that I need is in the table tagged “List”. I’ll load that table, then look at the text in a typical row. There’s a special character that appears before the variable name, \r, and so I can use more string manipulation to get the variable.

In [14]:
table = soup.find('table',{'list'})
row = table.find_all('tr')[10]
row.text
Out[14]:
'\n\n\n\n\n\nBCB1hiDADprsn111v1\ncmip5 output1 BCC bcc-csm1-1 historical day atmos day r1i1p1 v1 prsn\r\n\ndataset\ncompletely archived\n\n\n\n'

A little helper function to process text:

In [15]:
def getVar(row):
    var = row.text.split('\r')[0]
    var = var.split(' ')[-1]
    return(var)

var_present_full = [getVar(row) for row in table.find_all('tr')][1:]
var_present = [var in var_present_full for var in variables]
var_present
Out[15]:
[True, False, False, False, False]

Wrap this together and try it out:

In [19]:
def checkForVariables(cera_link, variables):
    
    response = requests.get(cera_link)
    soup = BeautifulSoup(response.text)
    table = soup.find('table',{'list'})
    var_present_full = [getVar(row) for row in table.find_all('tr')][1:]
    var_present = [var in var_present_full for var in variables]
    
    return(var_present)
print(variables)

checkForVariables(cera_link, variables)
['sic', 'ridgice', 'divice', 'eshrice', 'nshrice']
Out[19]:
[True, False, False, False, False]

Finally, we can take all these pieces and put it into one script.

In [20]:
# Webscraping script that checks the CERA database to see which climate models
# experiments include a list of variables of interest.
# Author: Daniel Watkins

from bs4 import BeautifulSoup
import requests


# Specify the CMIP5 experiments and variables that you want to investigate
experiments = ['historicalExt','historical']
variables = ['ridgice','divice','eshrice','nshrice']


def getModelInfo(link):
    # Grab the name of the model and the link to the model output
    # from the CERA info page.
    
    response = requests.get(link)
    soup = BeautifulSoup(response.text)

    for div in soup.find_all('div', {'header3'}):
        if div.text=="Title":
            model = div.next_sibling
            
        if "WDCC Data Access" in div.text:
            cera_link = div.next_sibling.next_sibling.get('href')
  
    model = model.split(' ')[0][2:]
    return(model, cera_link)
    
def getVar(row):
    var = row.text.split('\r')[0]
    var = var.split(' ')[-1]
    return(var)

def checkForVariables(cera_link, variables):
    
    response = requests.get(cera_link)
    soup = BeautifulSoup(response.text)
    table = soup.find('table',{'list'})
    var_present_full = [getVar(row) for row in table.find_all('tr')][1:]
    var_present = [var in var_present_full for var in variables]
    
    return(var_present)

def gatherLinks(cmip5_link, experiments, variables):
    
    response = requests.get(cmip5_link)
    soup = BeautifulSoup(response.text)
    
    table = soup.find('table', {'id':'customers'})
    
    # The first few links are to descriptions of the experiments rather than to 
    # experimental data, so we discard them.
    links = [d for d in table.find_all('a') if "CMIP5-Experiments" not in d.get('href')]
    
    links_experiments = [[x.text, x.get('href')] for x in links if x.text in experiments]
    
    links_dict = {}
    for exp in links_experiments:
        try:
            links_dict[exp[0]].append(exp[1])
        except KeyError:
            links_dict.update({exp[0]:[exp[1]]})
    return(links_dict)
    




link_dict = gatherLinks('http://www.ipcc-data.org/sim/gcm_monthly/AR5/Reference-Archive.html', 
                        experiments, variables)

for dk in link_dict.keys():
    filename = dk + '_info.csv'
    f = open(filename, 'w')
    
    # Write header
    f.write('Model,' +','.join(variables) + '\n')
    
    for url in link_dict[dk]:
        model, cera_link = getModelInfo(url)
        var_present = checkForVariables(cera_link, variables)
        f.write(model + ',' + ','.join([str(var) for var in var_present]) + '\n')
        print('Done with' + dk + model + '\n')
    f.close()
    

Advertisements

So you want to be a mathematician…

The following is a slightly edited email I sent to a friend who was considering a math major, since he enjoyed his calculus class. This was also before I switched from doing a math PhD to doing an atmospheric science PhD.

Whether you decide on majoring in mathematics depends on a few things.

One of the most difficult things in the world is deciding on a career. Some people find it easy, I guess, but for people that are interested in a lot of fields, its really hard. Part of the reason I chose math was to delay that decision – it is often said that a degree in mathematics can take you anywhere.

That claim requires a few caveats, however. Math as a degree can be useful in many fields but requires you to carefully plan your other classes, just as in studying biology before going to mad school.

If you want to do pure mathematics as a career, you can focus just on your math classes and do any minor. You will then need to get a PhD and are mostly limited to going into academia (becoming a professor). Not a bad option, but the field is becoming more and more crowded.

If you want to do applied mathematics as a career, you have more options. You can get a PhD and go into academia, as in the pure math track, or go work at a national lab or in industry. You can also get a masters degree for a more specific track: biomathematics, computational math, and financial mathematics being the main options.

At the national labs, you do interdisciplinary work. The last project I worked on had elements of biology, forensics, statistics, and computer science. Some of the coolest science happens at the intersections of fields, and that is where I think the future is. You can do especially interesting things if you have expertise in computer science and in another field. So doing math with a minor in CS is an excellent option.

What aspect of calculus is exciting to you? New methods of computation? Blanket statements about properties of functions? Applications to the “real world”?

If you enjoyed calculus and want more, here are some options for what to do next:

  • Applied mathematics: numerical analysis, ordinary differential equations, linear algebra, partial differential equations, scientific computing, complex variables
  • Pure mathematics: real analysis, number theory, abstract algebra
  • Statistics: mathematical statistics, regression, Bayesian statistics
  • Some things I didn’t look into very closely but use a lot of calculus: Mechanical engineering, physics, electrical engineering.

Look at the work being done at places like Oak Ridge, Pacific Northwest, Lawrence Berkeley/Livermore, Los Alamos national labs, and see what catches your interest.

If you were to take just a couple classes to get a flavor of a math degree, take a combined linear algebra and differential equations class, and take intro to mathematical analysis (a proofs class). Of course, talk to friends and professors at your school to see what class would be best for your specific interests.

I think mathematics was a good choice for me, although maybe I should have spent another year as an undergraduate rather than taking 17-18 credit semesters, taken more CS, statistics and physics, and taken at least one biology course. Most of my peers at Drexel are more interested in pure math, and my interests are mainly applied. I plan to take a lot of classes outside of mathematics, mainly in CS (data mining, AI, algorithms, distributed systems, perhaps parallel processing), and do supplemental reading in natural language processing, linguistics and biology. So I think that I’ll still be able to find my niche, although perhaps I should have done a better job researching graduate programs.

Since writing this, I did another internship at Los Alamos and met a bunch of the people who do climate modeling, and learned that atmospheric and oceanic science graduate programs prefer people with math or physics backgrounds over people with meteorology or oceanographic backgrounds. I left the Drexel math PhD program with a master’s degree and started a PhD in atmospheric science at Oregon State. 

Questioning the P-value

 

I came across an article by Regina Nuzzo in Nature today, Statistical Errors: P-values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume
The gist of the article is as follows: John Ioannidis suggested in 2005 that most published findings are false. That is not a comforting thought, but it is not ungrounded.

The P-value was introduced by Fisher as an informal way to judge whether evidence was worthy of a second look. Other statistical methods, for example the framework developed by Neyman and Pearson, pointedly left out the P-value completely.

Ideas from Fisher, Neyman and Pearson were mixed together by scientists creating manuals of statistical methods. Now, a low P-value is considered a stamp of approval that a result represents reality. The more unlikely your hypothesis, the greater the chance that an exciting finding is a false alarm, even if your P-value is miniscule.

A small P-value can also disguise the relevance of a result. Nuzzo cites a study that compared marital satisfaction of couples who met online to couples that met more traditionally. Although the P-value was very small (p < 0.002), the effect was very small – divorce rate changed from 7.67% to 5.96%, and the study’s happiness metric showed a change from 5.48 to 5.64 on a 7-point scale.

Another issue is termed P-hacking, where authors practice bad science by trying multiple methods until a “significant” P-value is found.

There are a few many ways to get around these problems. For example, reporting methods carefully – reporting how the sample size was determined, data exclusions (if any), and data transformations – helps; this is a fairly common practiced.

Another idea is two-stage analysis – essentially performing cross validation in your experimental design. Researchers following this method would perform a few exploratory studies on small samples in order to come up with hypothesis, and publish a report on this stating their intentions (perhaps on their own website, or in a database like the Open Science Framework). Then they would replicate the study themselves, and publish both results together.

The article reminds me of a few arguments I’ve heard about the merits of Bayesian statistics versus frequentist statistics. It seems obvious to me that in order to do quality science, prior knowledge is necessary to interpret results.

The ideas in this article are not new. Here is an older, more detailed article: The P-value fallacy

Why some food tastes nice, mathematically

I recently read two of Albert-Laszlo Barabasi’s books (Linked and Bursts) and in browsing his website I came across this paper: “Flavor network and the principles of food pairing,” authored by Ahn, Ahnert, Bagrow, and Barabasi.
Most of the fun is in exploring the graphics. There are some surprising relationships between foods.