I'm new to python(3.4) and have parsed an html table into headings[ ] , rows[ ] and cells [ ]
I wish to store each of these into a table (MySQL) eventually with the field names being the items in the headings[ ]
There are 4 headings("data0","data1","data2","data3")
There are 6 rows
The code to get there is very rudimentary using BeautifulSoup:
soup = BeautifulSoup(r.text)
table = soup.find("table")
cells = []
rows = table.findAll('tr')
headings = [th.get_text().strip() for th in table.findAll("th")]
for row in rows:
for cell in row.findAll('td')
cells .append(cell.get_text().strip())
I'm used to doing CASE statements, or, heaven forbid, a number of IF statements. I would normally place them under the
for cell in row.findAll('td') . But I would be working with a counter and be doing something like:
for row = 0 to len(rows)
for cell = 0 to len(headings)
select case cell
case = 0
(save cell contents to field called headings[0]
case = 1
(save cell contents to field called headings[1]
...
I'm not too worried about the saving part (yet) but I cant wrap my head around not being able to use counters.
I realize this is way beginner, but I should would appreciate any insight (and so would my brain)
Related
This question already has answers here:
scraping data from wikipedia table
(3 answers)
Closed 2 years ago.
I want to apply some statistics to data tables obtained directly from specific internet pages.
This tutorial https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059 helped me creating a data frame from a table at the webpage http://pokemondb.net/pokedex/all. However, I want to do the same for geographic data, such as population and gdp of several countries.
I found some tables at wikipedia, but it doesn't work quite well and I don't understand why. Here's my code, that follows the above mentioned tutorial:
import requests
import lxml.html as lh
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])
#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
print(df.head())
The output is the following:
Length of first 12 rows
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
1:"Ranks
"
2:"Countries(or dependent territory)
"
3:"Officialfigure(whereavailable)
"
4:"Date oflast figure
"
5:"Source
"
Data gathering: done!
Column lentgh:
[0, 0, 0, 0, 0]
Empty DataFrame
Columns: [Ranks
, Countries(or dependent territory)
, Officialfigure(whereavailable)
, Date oflast figure
, Source
]
Index: []
The length of the columns shouldn't be null. The format is not the same as the one of the tutorial. Any idea of how to make it right? Or maybe another data source that doesn't return this strange output format?
The length of your rows, as you've shown by your print statement in line 16 (which corresponds to the first line of your output), is not 10. It is 5. And your code breaks out of the loop in the very first iteration, instead of populating your col.
changing this statement:
if len(T)!=10:
break
to
if len(T)!=5:
break
should fix the problem.
Instead of using requests, use pandas to read the url data.
‘df = pd.read_html(url)
On Line 52 you are trying to edit a tuple. This is not possible in Python.
To correct this, use a list instead.
Change line 25 to col.append([name,[]])
In addition, when using the break it breaks the for loop, this causes it to have no data inside the array.
When doing these sorts of things you also must look at the html. The table isn't formatting as nice as one would hope. For example, it has a bunch of new lines, and also has the images of the countries flag. You can see this example of North America for how the format is different every time.
It seems like you want an easy way to do this.I would look into BeautifulSoup4. I have added a way that I would do this with bs4. You'll have to do some editing to make it look better
import requests
import bs4 as bs
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
column_names.append(th.get_text())
#gets all the rows of the table
rows = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
#Creates a list with each index being a different entry in the row.
values = [r for r in row]
#Gets each values that we care about
rank = values[1].get_text()
country = values[3].get_text()
pop = values[5].get_text()
date = values[7].get_text()
source = values[9].get_text()
temp_list = [rank,country,pop,date,source]
#Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
data.append(dict(zip(column_names, temp_list)))
print(column_names)
df = pd.DataFrame(data)
I'm quite new to Python and BeautifulSoup, and I've been trying to work this out for several hours...
Firstly, I want to extract all table data from below link with "general election" in the title:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
I do have another dataframe with names of each table (eg. "1961 general election", "1965 general election"), but am hoping to get away with just searching for "general election" on each table to confirm if it's what I need.
I then want to get all the names that are in Bold (which indicates they won) and finally I want another list of "Count 1" (or sometimes 1st Pref) in the original order, which I want to compare to the "Bold" list. I haven't even looked at this piece yet, as I haven't gotten past the first hurdle.
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)
Any help on this would be greatly appreciated...
This page requires some gymnastics, but it can be done:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)
The output should be all the general election tables on the page.
I have parsed a table and would like to convert two of those variables to a Pandas Dataframe to print to excel.
FYI:
I did ask a similar question, however, it was not answered thoroughly. There was no suggestion on how to create a Pandas DataFrame. This was the whole point of my question.
Caution:
There is small issue with the data that I parsed. The data contains "TEAM" and "SA/G" multiple times in the output.
The 1st variable that I would like in the DataFrame is 'TEAM'.
The 2nd variable that I would like in the DataFrame is 'SA/G'.
Here is my code so far:
# imports
from selenium import webdriver
from bs4 import BeautifulSoup
# make a webdriver object
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
# open some page using get method - url -- > parameters
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')
# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')
#close driver
driver.close()
#find table
table = soup.find('table')
#find_all table rows
t_rows = table.find_all('tr')
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
# print(row)
# print(row[9])
# print(row[1], row[9])
team = row[1]
sag = row[9]
# print(team, sag)
data = [(team, sag)]
print(data)
Here is the final output that I would like printed to excel using the Pandas DataFrame option:
Team SA/G
Nashville 30.1
Colorado 33.6
Washington 31.0
... ...
Thanks in advance for any help that you may offer. I am still learning and appreciate any feedback that I can get.
Looks like you want to create a DataFrame from a list of tuples, which has been answered here.
I would change your code like this:
# Initial empty list
data = []
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
team = row[1]
sag = row[9]
# Add tuple containing one row of data
data.append((team, sag))
# Create df from list of tuples
df = pd.DataFrame(data, columns=['Team', 'SA/G'])
# Remove lines where Team value is "TEAM"
df = df[df["Team"] != "TEAM"]
EDIT: Add line to remove ("TEAM", "SA/G") rows in df
First inside the "for loop" append tuples into a list (instead of doing data=[(x,y)] declare the data variable before the loop as a list data = list() and append the tuples to list in the loop data.append((x,y))) and do the following
import pandas as pd
data=[("t1","sag1"),("t2","sag2"),("t3","sag3")]
df = pd.DataFrame(data,columns=['Team','SA/G'])
print(df)
I am trying to scrape some webpages using bs4. I get the tables from the webpage as follows:
tables_list = soup.find_all("table")
However, I have some complex tables which contains multiple small table sections. I have provided the source code for the table here:
https://ufile.io/tee2c
Is there a way to split such a table into smaller table sections so that I can parse each one into its own dataframe? Thanks.
Here's my code thus far using python html extractor (https://pypi.python.org/pypi/html-table-extractor/1.3.0)
# table_doc => source code provided at https://ufile.io/tee2c
extractor = Extractor(str(table_doc))
extractor.parse()
list_of_lists = extractor.return_list()
new_list = []
# some cleanup
for row in list_of_lists:
stripped_list = [item.strip() for item in row]
if len(set(stripped_list))==1 or any([x for x in stripped_list if len(x)>200]): # skip any item with > 200 characters, its probably some paragraph string and hence not a valid table item
continue
new_list.append(stripped_list)
df = pandas.DataFrame(new_list)
"""need to understand how to get this dataframe in a more structured format; for complex tables (e.g. the one provided at https://ufile.io/tee2c), this df contains content from multiple table 'sections' """
This link contains the table I'm trying to parse.
I'm trying to use BeautifulSoup in Python. I'm very new to BeautifulSoup and HTML. This is my attempt to solve my problem.
soup = BeautifulSoup(open('BBS_student_grads.php'))
data = []
table = soup.find('table')
rows = table.find_all('tr') #array of rows in table
for x,row in enumerate(rows[1:]):# skips first row
cols = row.find_all('td') # finds all cols in rows
for y,col in enumerate(cols): # iterates through col
data.append([])
data[x].append(col) # puts table into a 2d array called data
print(data[0][0]) #prints top left corner
Sample Output
I'm trying to extract all the names in the table, then update the names in the list and then update the table. I'm also using a local copy of this HTML. Temporary fix till I learn how to do more web programming.
help is much appreciated
I think you need just the td elements in the tr element with class="searchbox_black".
You can use CSS Selectors to get to the desired td elements:
for cell in soup.select('tr.searchbox_black td'):
print cell.text
It prints:
BB Salsa
Adams State University Alamosa, CO
Sensei: Oneyda Maestas
Raymond Breitstein
...