BeautifulSoup fill missing information with "NA" in csv

BeautifulSoup fill missing information with "NA" in csv - python

I am working on a web scraper that creates a .csv file of all chemicals on the Sigma-Aldrich website. The .csv file would have the chemical name followed by variables such as product number, cas number, molecular weight and chemical formula. 1 chemical + info per row.
The issue I'm having is that not all chemicals have all their fields, many only have product and cas numbers. This results in my .csv file being offset and chemical rows having incorrect info associated with another chemical.
To right this wrong, I want to add 'N/A' if the field is empty.
Here is my scraping method:
def scraap(urlLi):
for url in urlLi:
content = requests.get(url).content
soup = BeautifulSoup(content, 'lxml')
containers = soup.find_all('div', {'class': 'productContainer-inner'})
for c in containers:
sub = c.find_all('div', {'class': 'productContainer-inner-content'})
names = c.find_all('div', {'class': 'searchResultSubstanceBlock clearfix'})
for n in names:
hope = n.find("h2").text
print(hope)
nombres.append(hope.encode('utf-8'))
for s in sub:
info = s.find_all('ul', {'class': 'nonSynonymProperties'})
proNum = s.find_all('div', {'class': 'product-listing-outer'})
for p in proNum:
ping = p.find_all('div', {'class': 'row clearfix'})
for po in ping:
pro = p.find_all('li', {'class': 'productNumberValue'})
pnPp = []
for pri in pro:
potus = pri.get_text()
pnPp.append(potus.encode('utf-8'))
ProductNumber.append(pnPp)
print(pnPp)
for i in info:
c = 1
for gling in i:
print(gling.get_text())
if c == 1:
formu.append(gling.get_text().encode('utf-8'))
elif c == 2:
molWei.append(gling.get_text().encode('utf-8'))
else:
casNum.append(gling.get_text().encode('utf-8'))
c += 1
c == 1
print("---")
here is my writing method:
def pipeUp():
with open('sigma_pipe_out.csv', mode='wb') as csv_file:
fieldnames = ['chem_name', 'productNum', 'formula', 'molWei', 'casNum']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
# writer.writeheader()
# csv_file.write(' '.join(fieldnames))
for n, p, f, w, c in zip(nombres, ProductNumber, formu, molWei, casNum):
# writer.writerow([n, p, f, w, c])
writer.writerow({'chem_name': n, 'productNum': p, 'formula': f, 'molWei': w, 'casNum': c})
The issue arises in the get i from info: section. The formu, molWei and casNum list are off.
How can I add "N/a" if formu and molWei are missing information?

I'm assuming get_text() returns an empty string if there's no information on the formula and molecular weight etc. In that case you can just add:
if not molWei:
molWei = "N/A"
Which updates molWei to be N/A if the string is empty.

you cannot use index as value checking (if c == 1:), use string check before adding to the list
replace:
for i in info:
....
....
print("---")
with:
rowNames = ['formu', 'molWei', 'casNum']
for li in info[0].find_all('li'):
textVal = li.text.encode('utf-8')
#print(textVal)
if b'Formula' in textVal:
formu.append(textVal)
rowNames.remove('formu')
elif b'Molecular' in textVal:
molWei.append(textVal)
rowNames.remove('molWei')
else:
casNum.append(textVal)
rowNames.remove('casNum')
# add missing row here
if len(rowNames) > 1:
for item in rowNames:
globals()[item].append('NA')
print("---")

Related

From Python to Excel - Building an excel worksheet

With the help of some very kind people on here I finally got a working script to scrape some data. I now desire to transfer this data from Python to Excel, in a specific format. I have tried multiple approaches, but did not manage to get the desired result.
My script is the following:
import requests
from bs4 import BeautifulSoup
def analyze(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find("span", attrs=('name')).text
title = soup.select_one("[class='team'] .name a").text
point = soup.select(".result .points")[0].text
image_titles = ', '.join([i['title'] for i in soup.select("[class$='dead'] > img")])
title_ano = soup.select("[class='team'] .name a")[1].text
point_ano = soup.select(".result .points")[1].text
image_titles_ano = ', '.join([i['title'] for i in soup.select("[class='class'] > img")])
print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)
for i in range(46270, 46394):
analyze(i)
To summarize, I scrape a couple of things:
Team names (title & title_ano)
Image titles (image_titles & image_titles_ano)
Team points (points & points_ano)
A string of text (arena)
One line of output currently looks like this:
('Thunder', '0 pts', 'roublard, huppermage, ecaflip') ('Tweaps', '60 pts', 'steamer, feca, sacrieur') A10
My goal is to transfer this output to excel, making it look like this:
To clarify, in terms of the variables I have it would be this:
Currently I can manage to transfer my data to excel, but I can't figure out how to format my data this way. Any help would be greatly appreciated :)

First of all, the code that you are using is not actually wholly correct. E.g.:
analyze(46275)
(('Grind', '10 pts', 'roublard, ecaflip'),
('SOLARY', '50 pts', 'enutrof, eniripsa, steamer, eliotrope'), 'A10')
Notice that the first player only has two image titles, and the second one has four. This is incorrect, and happens because your code assumes that img tags with the class ending in "dead" belong to the first player, and the ones that have a class named "class" belong to the second. This happens to be true for your first match (i.e. https://ktarena.com/fr/207-dofus-world-cup/match/46270), but very often this is not true at all. E.g. if I compare my result below with the same method applied to your analyze function, I end up with mismatches is 118 rows out of 248.
Here's a suggested rewrite:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def analyze_new(i):
# You don't need `/1` at the end of the url
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find('span',class_='name').get_text()
# find all teams, and look for info inside each team
teams = soup.findAll('div',class_='team')
my_teams = [tuple()]*2
for idx, team in enumerate(teams):
my_teams[idx] = my_teams[idx] + \
(team.select(".name a")[0].get_text(),)
my_teams[idx] = my_teams[idx] + \
(soup.select(".result .points")[idx].get_text(),)
my_teams[idx] = my_teams[idx] + \
(', '.join([img['title'] for img in team.findAll('img')[1:]]),)
# notice, we need `return` instead of `print` to use the data
return *my_teams,arena
print(analyze_new(46275))
(('Grind', '10 pts', 'roublard, ecaflip, enutrof'),
('SOLARY', '50 pts', 'eniripsa, steamer, eliotrope'), 'A10')
Before writing this data to excel, I would create a pd.DataFrame, which can then be exported very easily:
# capture info per player in a single row
rows = []
for i in range(46270, 46394):
one, two, arena = analyze_new(i)
# adding `i` to rows, as "Match" seems like a useful `column` to have!
# but if not, you can delete `i` here below (N.B. do NOT delete the COMMA!)
# and cut 'Match' twice below
rows.append(one+(arena,i))
rows.append(two+(arena,i))
cols = ['Team','Points', 'Images', 'Arena','Match']
# create df
df = pd.DataFrame(data=rows,columns=cols)
# split up the images strings in `df.Images` and make new columns for them
# finally, drop the `df.Images` column itself
df = pd.concat([df,
df.Images.str.split(',',expand=True)\
.rename(columns={i:f'Image Title {i+1}'
for i in range(3)})], axis=1)\
.drop('Images', axis=1)
# Strip " pts" from the strings in `df.Points` and convert the type to an `int`
df['Points'] = df.Points.str.replace(' pts','').astype(int)
# Re-order the columns
df = df.loc[:, ['Match', 'Arena','Team', 'Image Title 1', 'Image Title 2',
'Image Title 3', 'Points']]
print(df.head())
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46270 A10 Thunder roublard huppermage ecaflip 0
1 46270 A10 Tweaps steamer feca sacrieur 60
2 46271 A10 Shadow Zoo feca osamodas ouginak 0
3 46271 A10 UndisClosed eniripsa sram pandawa 60
4 46272 A10 Laugh Tale osamodas ecaflip iop 0
# Finally, write the `df` to an Excel file
df.to_excel('fname.xlsx')
Result:
If you dislike the default styles added to the header row and index column, you can write it away like so:
df.T.reset_index().T.to_excel('test.xlsx', index=False, header=False)
Result:
Incidentally, I assume you have a particular reason for wanting the function to return the relevant data as *my_teams,arena. If not, it would be better to let the function itself do most of the heavy lifting. E.g. we could write something like this, and return a df directly.
def analyze_dict(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
d = {'Match': [i]*2,
'Arena': [soup.find('span',class_='name').get_text()]*2,
'Team': [],
'Image Title 1': [],
'Image Title 2': [],
'Image Title 3': [],
'Points': [],
}
teams = soup.findAll('div',class_='team')
for idx, team in enumerate(teams):
d['Team'].append(team.select(".name a")[0].get_text())
d['Points'].append(int(soup.select(".result .points")[idx].get_text().split(' ')[0]))
for img_idx, img in enumerate(team.findAll('img')[1:]):
d[f'Image Title {img_idx+1}'].append(img['title'])
return pd.DataFrame(d)
print(analyze_dict(46275))
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46275 A10 Grind roublard ecaflip enutrof 10
1 46275 A10 SOLARY eniripsa steamer eliotrope 50
Now, we only need to do the following outside the function:
dfs = []
for i in range(46270, 46394):
dfs.append(analyze_dict(i))
df = pd.concat(dfs, axis=0, ignore_index=True)
print(df.head())
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46270 A10 Thunder roublard huppermage ecaflip 0
1 46270 A10 Tweaps steamer feca sacrieur 60
2 46271 A10 Shadow Zoo feca osamodas ouginak 0
3 46271 A10 UndisClosed eniripsa sram pandawa 60
4 46272 A10 Laugh Tale osamodas ecaflip iop 0

With hardly any changes from your post, you can use the openpyxl library to write the output to an excel file as shown below:
import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup
def analyze(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find("span", attrs=('name')).text
title = soup.select_one("[class='team'] .name a").text
point = soup.select(".result .points")[0].text
image_titles = image_titles = [i['title'] for i in soup.select("[class='team']:nth-of-type(1) [class^='class'] > img")]
try:
image_title_one = image_titles[0]
except IndexError: image_title_one = ""
try:
image_title_two = image_titles[1]
except IndexError: image_title_two = ""
try:
image_title_three = image_titles[2]
except IndexError: image_title_three = ""
ws.append([arena,title,image_title_one,image_title_two,image_title_three,point])
title_ano = soup.select("[class='team'] .name a")[1].text
point_ano = soup.select(".result .points")[1].text
image_titles_ano = [i['title'] for i in soup.select("[class='team']:nth-of-type(2) [class^='class'] > img")]
try:
image_title_ano_one = image_titles_ano[0]
except IndexError: image_title_ano_one = ""
try:
image_title_ano_two = image_titles_ano[1]
except IndexError: image_title_ano_two = ""
try:
image_title_ano_three = image_titles_ano[2]
except IndexError: image_title_ano_three = ""
ws.append([arena,title_ano,image_title_ano_one,image_title_ano_two,image_title_ano_three,point_ano])
print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)
if __name__ == '__main__':
wb = Workbook()
wb.remove(wb['Sheet'])
ws = wb.create_sheet("result")
ws.append(['Arena','Team','Image Title 1','Image Title 2','Image Title 3','Points'])
for i in range(46270, 46290):
analyze(i)
wb.save("output.xlsx")
I've fixed the selectors to grab the right number of image titles.

Scraping data beach volleyball on multiple pages

I am trying to scrape all the possible data from this webpage Gstaad 2017
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium.webdriver.support.ui import Select
#Starts the driver and goes to our starting webpage
driver = webdriver.Chrome( "C:/Users/aldi/Downloads/chromedriver.exe")
driver.get('http://www.bvbinfo.com/Tournament.asp?ID=3294&Process=Matches')
#Imports HTML into python
page = requests.get('http://www.bvbinfo.com/Tournament.asp?ID=3294&Process=Matches')
soup = BeautifulSoup(driver.page_source, 'lxml')
stages = soup.find_all('div')
stages = driver.find_elements_by_class_name('clsTournBracketHeader')[-1].text
#TODO the first row (country quota matches) has no p tag and therefore it is not included in the data
rows = []
paragraphs = []
empty_paragraphs = []
for x in soup.find_all('p'):
if len(x.get_text(strip=True)) != 0:
paragraph = x.extract()
paragraphs.append(paragraph)
if len(x.get_text(strip=True)) == 0:
empty_paragraph = x.extract()
empty_paragraphs.append(empty_paragraph)
# players
home_team_player_1 = ''
home_team_player_2 = ''
away_team_player_1 = ''
away_team_player_2 = ''
for i in range(0, len(paragraphs)):
#round and satege of the competition
round_n= paragraphs[i].find('u').text
paragraph_rows = paragraphs[i].text.split('\n')[1:-1]
counter = 0
for j in range(0,len(paragraph_rows)):
#TODO tournament info, these can vary from tournament to tournament
tournament_info = soup.find('td', class_ = 'clsTournHeader').text.strip().split()
tournament_category = [' '.join(tournament_info[0 : 2])][0]
tournament_prize_money = tournament_info[2]
#TODO tournament city can also have two elements, not just one
tournament_city = tournament_info[3]
tournament_year = tournament_info[-1]
tournament_days = tournament_info[-2][:-1].split("-")
tournament_starting_day = tournament_days[0]
tournament_ending_day = tournament_days[-1]
tournament_month = tournament_info[-3]
tournament_stars = [' '.join(tournament_info[5 : 7])][0]
players = paragraphs[i].find_all('a', {'href':re.compile('.*player.*')})
home_team_player_1 = players[counter+0].text
home_team_player_2 = players[counter+1].text
away_team_player_1 = players[counter+2].text
away_team_player_2 = players[counter+3].text
#matches
match= paragraph_rows[j].split(":")[0].split()[-1].strip()
#nationalities
nationalities = ["United", "States"]
if paragraph_rows[j].split("def.")[0].split("/")[1].split("(")[0].split(" ")[3] in nationalities:
home_team_country = "United States"
else:
home_team_country = paragraph_rows[j].split("def.")[0].split("/")[1].split("(")[0].split(" ")[-2]
if paragraph_rows[j].split("def.")[1].split("/")[1].split(" ")[3] in nationalities:
away_team_country = "United States"
else:
away_team_country = paragraph_rows[j].split("def.")[1].split("/")[1].split("(")[0].split(" ")[-2]
parentheses = re.findall(r'\(.*?\)', paragraph_rows[j])
if "," in parentheses[0]:
home_team_ranking = parentheses[0].split(",")[0]
home_team_ranking = home_team_ranking[1:-1]
home_team_qualification_round = parentheses[0].split(",")[1]
home_team_qualification_round = home_team_qualification_round[1:-1]
else:
home_team_ranking = parentheses[0].split(",")[0]
home_team_ranking = home_team_ranking[1:-1]
home_team_qualification_round = None
if "," in parentheses[1]:
away_team_ranking = parentheses[1].split(",")[0]
away_team_ranking = away_team_ranking[1:-1]
away_team_qualification_round = parentheses[1].split(",")[1]
away_team_qualification_round = away_team_qualification_round[1:-1]
else:
away_team_ranking = parentheses[1].split(",")[0]
away_team_ranking = away_team_ranking[1:-1]
match_duration = parentheses[2]
match_duration = match_duration[1:-1]
away_team_qualification_round = None
# sets
sets = re.findall(r'\).*?\(', paragraph_rows[j])
sets = sets[1][1:-1]
if len(sets.split(",")) == 2:
score_set1 = sets.split(",")[0]
score_set2 = sets.split(",")[1]
score_set3 = None
if len(sets.split(",")) == 3:
score_set1 = sets.split(",")[0]
score_set2 = sets.split(",")[1]
score_set3 = sets.split(",")[2]
row = { " home_team_player_1 ": home_team_player_1 ,
" home_team_player_2": home_team_player_2,
"away_team_player_1": away_team_player_1,
"away_team_player_2":away_team_player_1,
"match": match,
"home_team_country":home_team_country,
"away_team_country": away_team_country,
"home_team_ranking": home_team_ranking,
"away_team_ranking": away_team_ranking,
"match_duration": match_duration,
"home_team_qualification_round": home_team_qualification_round,
"away_team_qualification_round": away_team_qualification_round,
"score_set1":score_set1,
"score_set2":score_set2,
"score_set3":score_set3,
"tournament_category": tournament_category,
"tournament_prize_money": tournament_prize_money,
"tournament_city": tournament_city,
"tournament_year": tournament_year,
"tournament_starting_day": tournament_starting_day,
"tournament_ending_day":tournament_ending_day,
"tournament_month":tournament_month,
"tournament_stars":tournament_stars,
"round_n": round_n
}
counter += 4
rows.append(row)
data = pd.DataFrame(rows)
data.to_csv("beachvb.csv", index = False)
I am not really experienced in web scraping. I have just started as a self-taught and find the HTML source code quite messy and poorly structured.
I want to improve my code in two ways:
Include all the missing matches (country quota matches, semifinals, bronze medal, and gold medal) and the respective category for each match (country quota matches, pool, winner's bracket, semifinals, bronze medal, and gold medal)
iterate the code for more years and tournaments from the dropdown menu at the top of the webpage
I have tried to iterate through different years but my code does not work
tournament_years = {"FIVB 2015", "FIVB 2016"}
dfs = []
for year in tournament_years:
# select desired tournament
box_year = Select(driver.find_element_by_xpath("/html/body/table[3]/tbody/tr/td/table[1]/tbody/tr[1]/td[2]/select"))
box_year.select_by_visible_text(year)
box_matches = Select(driver.find_element_by_xpath("/html/body/table[3]/tbody/tr/td/table[1]/tbody/tr[2]/td[2]/select"))
box_matches.select_by_visible_text("Matches")
The main idea was to create a list of dataframes for each year and each tournament by adding a new loop at the beginning of the code.
If someone has a better idea and technique to do so, it is really appreciated!

How to convert text table to dataframe

I am trying to scrape the "PRINCIPAL STOCKHOLDERS" table from the linktext fileand convert it to a csv file. Right now I am only half successful. Namely, I can locate the table and parse it but somehow I cannot convert the text table to a standard one. My code is attached. Can someone help me with it?
url = r'https://www.sec.gov/Archives/edgar/data/1034239/0000950124-97-003372.txt'
# Different approach, the first approach does not work
filing_url = requests.get(url)
content = filing_url.text
splited_data = content.split('\n')
table_title = 'PRINCIPAL STOCKHOLDERS'
END_TABLE_LINE = '- ------------------------'
def find_no_line_start_table(table_title,splited_data):
found_no_lines = []
for index, line in enumerate(splited_data):
if table_title in line:
found_no_lines.append(index)
return found_no_lines
table_start = find_no_line_start_table(table_title,splited_data)
# I need help with locating the table. If I locate the table use the above function, it will return two locations and I have to manually choose the correct one.
table_start = table_start[1]
def get_start_data_table(table_start, splited_data):
for index, row in enumerate(splited_data[table_start:]):
if '<C>' in row:
return table_start + index
def get_end_table(start_table_data, splited_data ):
for index, row in enumerate(splited_data[start_table_data:]):
if END_TABLE_LINE in row:
return start_table_data + index
def row(l):
l = l.split()
number_columns = 8
if len(l) >= number_columns:
data_row = [''] * number_columns
first_column_done = False
index = 0
for w in l:
if not first_column_done:
data_row[0] = ' '.join([data_row[0], w])
if ':' in w:
first_column_done = True
else:
index += 1
data_row[index] = w
return data_row
start_line = get_start_data_table(table_start, splited_data)
end_line = get_end_table(start_line, splited_data)
table = splited_data[start_line : end_line]
# I also need help with convert the text table to a CSV file, somehow the following function does not #recognize my column.
def take_table(table):
owner = []
Num_share = []
middle = []
middle_1 = []
middle_2 = []
middle_3 = []
prior_offering = []
after_offering = []
for r in table:
data_row = row(r)
if data_row:
col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8 = data_row
owner.append(col_1)
Num_share.append(col_2)
middle.append(col_3)
middle_1.append(col_4)
middle_2.append(col_5)
middle_3.append(col_6)
prior_offering.append(col_7)
after_offering.append(col_8)
table_data = {'owner': owner, 'Num_share': Num_share, 'middle': middle, 'middle_1': middle_1,
'middle_2': middle_2, 'middle_3': middle_3, 'prior_offering': prior_offering,
'after_offering': after_offering}
return table_data
#print (table)
dict_table = take_table(table)
a = pd.DataFrame(dict_table)
a.to_csv('trail.csv')

I think what you need to do is
pd.DataFrame.from_dict(dict_table)
instead of
pd.DataFrame(dict_table)

Scraping JSON arrays nested tags

I am trying to scrape data from a JSON file. I am able to scrape data from some of the tags but few nested tags are giving problem. Following is a sample from the file -
{"orders":[{
"order_id":9000,
"flight_start":"2017-06-15T05:00:00.000Z",
"flight_end":"2017-06-22T05:00:00.000Z",
"spots":[{
"spot_id":7354259,
"spot_length":15}],
"constraints":{
"forbid":[{
"network":"BRVO"},
{"network":"DSE"},
{"network":"ESPN"},
{"network":"DFC"},
{"hours":[2,6],
"days_of_week":["Monday","Tuesday","Thursday","Friday"]},
{"hours":[2,6],
"days_of_week":["Saturday","Sunday"]}],
"allocation":[{
"hours":[6,9],
"impressions":{
"min":0.05,
"max":0.05},
"days_of_week":["Monday","Tuesday","Wednesday","Thursday","Friday"]},{
"hours":[20,0],
"impressions":{"min":0.5,"max":0.5},
"days_of_week":["Monday","Tuesday","Wednesday","Thursday","Friday"]},{
"budget":{
"min":1,
"max":1},
"spot_length":15}]}}]}
I am not able to scrape all values from network tag, it is only returning top value from all the network tabs for each order.
I am using the following code -
import urllib
import json
url = 'http://vw-test.elasticbeanstalk.com/test'
json_obj = urllib.request.urlopen(url).read().decode('UTF-8')
data = json.loads(json_obj)
for i in data["orders"]:
k = i["order_id"]
j = i["flight_start"]
l = i["flight_end"]
m = i ['spots']
for value in m:
a = value["spot_length"]
b = value["spot_id"]
n = i["constraints"]
c = n["forbid"]
d = c[0]
e = d["network"]
print(e)
If any one could help me figure this out I'll be very grateful.

The json data in your question isn't complete. Making some assumptions, this could work:
for i in data["orders"]:
k = i["order_id"]
j = i["flight_start"]
l = i["flight_end"]
m = i ['spots']
for value in m:
a = value["spot_length"]
b = value["spot_id"]
n = i["constraints"]
c = n["forbid"]
d = c[0]
networks = [d["network"] for d in c if "network" in d]
print(networks)

List comes back as empty when retrieveing data from website ; Python

I am trying to parse data from a website by inserting the data into a list, but the list comes back empty.
url =("http://www.releasechimps.org/resources/publication/whos-there-md- anderson")
http = urllib3.PoolManager()
r = http.request('Get',url)
soup = BeautifulSoup(r.data,"html.parser")
#print(r.data)
loop = re.findall(r'<td>(.*?)</td>',str(r.data))
#print(str(loop))
newLoop = str(loop)
#print(newLoop)
for x in range(1229):
if "\\n\\t\\t\\t\\t" in loop[x]:
loop[x] = loop[x].replace("\\n\\t\\t\\t\\t","")
list0_v2.append(str(loop[x]))
print(loop[x])
print(str(list0_v2))

Edit: Didn't really have anything else going on, so I made your data format into a nice list of dictionaries. There's a weird <td height="26"> on monkey 111, so I had to change the regex slightly.
Hope this helps you, I did it cause I care about the monkeys man.
import html
import re
import urllib.request
list0_v2 = []
final_list = []
url = "http://www.releasechimps.org/resources/publication/whos-there-md-anderson"
data = urllib.request.urlopen(url).read()
loop = re.findall(r'<td.*?>(.*?)</td>', str(data))
for item in loop:
if "\\n\\t\\t\\t\\t" or "em>" in item:
item = item.replace("\\n\\t\\t\\t\\t", "").replace("<em>", "")\
.replace("</em>", "")
if " " == item:
continue
list0_v2.append(item)
n = 1
while len(list0_v2) != 0:
form = {"n":0, "name":"", "id":"", "gender":"", "birthdate":"", "notes":""}
try:
if list0_v2[5][-1] == '.':
numb, name, ids, gender, birthdate, notes = list0_v2[0:6]
form["notes"] = notes
del(list0_v2[0:6])
else:
raise Exception('foo')
except:
numb, name, ids, gender, birthdate = list0_v2[0:5]
del(list0_v2[0:5])
form["n"] = int(numb)
form["name"] = html.unescape(name)
form["id"] = ids
form["gender"] = gender
form["birthdate"] = birthdate
final_list.append(form)
n += 1
for li in final_list:
print("{:3} {:10} {:10} {:3} {:10} {}".format(li["n"], li["name"], li["id"],\
li["gender"], li["birthdate"], li["notes"]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup fill missing information with "NA" in csv - python

I'm assuming get_text() returns an empty string if there's no information on the formula and molecular weight etc. In that case you can just add: if not molWei: molWei = "N/A" Which updates molWei to be N/A if the string is empty.

Related

From Python to Excel - Building an excel worksheet

Scraping data beach volleyball on multiple pages

How to convert text table to dataframe

Scraping JSON arrays nested tags

List comes back as empty when retrieveing data from website ; Python

Categories

Resources