How to convert wikipedia tables into pandas dataframes? [duplicate] - python

This question already has answers here:
scraping data from wikipedia table
(3 answers)
Closed 2 years ago.
I want to apply some statistics to data tables obtained directly from specific internet pages.
This tutorial https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059 helped me creating a data frame from a table at the webpage http://pokemondb.net/pokedex/all. However, I want to do the same for geographic data, such as population and gdp of several countries.
I found some tables at wikipedia, but it doesn't work quite well and I don't understand why. Here's my code, that follows the above mentioned tutorial:
import requests
import lxml.html as lh
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])
#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
print(df.head())
The output is the following:
Length of first 12 rows
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
1:"Ranks
"
2:"Countries(or dependent territory)
"
3:"Officialfigure(whereavailable)
"
4:"Date oflast figure
"
5:"Source
"
Data gathering: done!
Column lentgh:
[0, 0, 0, 0, 0]
Empty DataFrame
Columns: [Ranks
, Countries(or dependent territory)
, Officialfigure(whereavailable)
, Date oflast figure
, Source
]
Index: []
The length of the columns shouldn't be null. The format is not the same as the one of the tutorial. Any idea of how to make it right? Or maybe another data source that doesn't return this strange output format?

The length of your rows, as you've shown by your print statement in line 16 (which corresponds to the first line of your output), is not 10. It is 5. And your code breaks out of the loop in the very first iteration, instead of populating your col.
changing this statement:
if len(T)!=10:
break
to
if len(T)!=5:
break
should fix the problem.

Instead of using requests, use pandas to read the url data.
‘df = pd.read_html(url)

On Line 52 you are trying to edit a tuple. This is not possible in Python.
To correct this, use a list instead.
Change line 25 to col.append([name,[]])
In addition, when using the break it breaks the for loop, this causes it to have no data inside the array.
When doing these sorts of things you also must look at the html. The table isn't formatting as nice as one would hope. For example, it has a bunch of new lines, and also has the images of the countries flag. You can see this example of North America for how the format is different every time.
It seems like you want an easy way to do this.I would look into BeautifulSoup4. I have added a way that I would do this with bs4. You'll have to do some editing to make it look better
import requests
import bs4 as bs
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
column_names.append(th.get_text())
#gets all the rows of the table
rows = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
#Creates a list with each index being a different entry in the row.
values = [r for r in row]
#Gets each values that we care about
rank = values[1].get_text()
country = values[3].get_text()
pop = values[5].get_text()
date = values[7].get_text()
source = values[9].get_text()
temp_list = [rank,country,pop,date,source]
#Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
data.append(dict(zip(column_names, temp_list)))
print(column_names)
df = pd.DataFrame(data)

Related

How can I webscrape a Wikipedia table with lists of data instead of rows?

I am trying to get data from the Localities table located on the Wikipedia https://en.wikipedia.org/wiki/Districts_of_Warsaw page.
I would like to collect this data and put it into a dataframe with two columns ["Districts"] and ["Neighbourhoods"].
My code so far looks like this:
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html")
table = soup.find_all('table')[2]
A=[]
B=[]
for row in table.findAll('tr'):
cells=row.findAll('td')
if len(cells)==2:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
df=pd.DataFrame(A,columns=['Neighbourhood'])
df['District']=B
print(df)
This gives the following dataframe:
Dataframe
Certainly, scraping the Neighbourhood column is not right since they are contained in lists, but I don't know how it should be done so will be glad for any tips.
In addition to it, I will appreciate any hints why scraping gives me only 10 districts instead of 18.
Are you sure that you are scraping the right table? I understood that you need a second table with 18 districts and listed neighbourhoods.
Also, I'm not sure how you want to have districts and neighbourhoods arranged in a DataFrame, I've set districts as columns and neighbourhoods as rows. You can change it as you want.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find_all("table")[1]
def process_list(tr):
result = []
for td in tr.findAll("td"):
result.append([x.string for x in td.findAll("li")])
return result
districts = []
neighbourhoods = []
for row in table.findAll("tr"):
if row.find("ul"):
neighbourhoods.extend(process_list(row))
else:
districts.extend([x.string.strip() for x in row.findAll("th")])
# Check and arrange as you wish
for i in range(len(districts)):
print(f'District {districts[i]} has neighbourhoods: {", ".join(neighbourhoods[i])}')
df = pd.DataFrame()
for i in range(len(districts)):
df[districts[i]] = pd.Series(neighbourhoods[i])
Some tips:
Use element.string to get the text from an element
Use string.strip() to remove any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove) i.e. to clean the text
You can use the fact that odd rows are the Districts and even rows are the Neighbourhoods to walk the odd rows and use FindNext to grab the neighbourhoods , from row below, whilst iterating the District columns within the odd rows:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from itertools import zip_longest
soup = bs(requests.get('https://en.wikipedia.org/wiki/Districts_of_Warsaw').content, 'lxml')
table = soup.select_one('h2:contains("Localities") ~ .wikitable') #isolate table of interest
results = []
for row in table.select('tr')[0::2]: #walk the odd rows
for i in row.select('th'): #walk the districts
r = list(zip_longest([i.text.strip()] , [i.text for i in row.findNext('tr').select('li')], fillvalue=i.text.strip())) # zip the current district to the list of neighbourhoods in row below. Fill with District name to get lists of equal length
results.append(r)
results = [i for j in results for i in j] #flatten list of lists
df = pd.DataFrame(results, columns= ['District','Neighbourhood'])
print(df)

Get tables from Wiki that match specific text

I'm quite new to Python and BeautifulSoup, and I've been trying to work this out for several hours...
Firstly, I want to extract all table data from below link with "general election" in the title:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
I do have another dataframe with names of each table (eg. "1961 general election", "1965 general election"), but am hoping to get away with just searching for "general election" on each table to confirm if it's what I need.
I then want to get all the names that are in Bold (which indicates they won) and finally I want another list of "Count 1" (or sometimes 1st Pref) in the original order, which I want to compare to the "Bold" list. I haven't even looked at this piece yet, as I haven't gotten past the first hurdle.
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)
Any help on this would be greatly appreciated...
This page requires some gymnastics, but it can be done:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)
The output should be all the general election tables on the page.

How to Create Pandas DataFrame out of Parsed Code using bs4/selenium on Python?

I have parsed a table and would like to convert two of those variables to a Pandas Dataframe to print to excel.
FYI:
I did ask a similar question, however, it was not answered thoroughly. There was no suggestion on how to create a Pandas DataFrame. This was the whole point of my question.
Caution:
There is small issue with the data that I parsed. The data contains "TEAM" and "SA/G" multiple times in the output.
The 1st variable that I would like in the DataFrame is 'TEAM'.
The 2nd variable that I would like in the DataFrame is 'SA/G'.
Here is my code so far:
# imports
from selenium import webdriver
from bs4 import BeautifulSoup
# make a webdriver object
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
# open some page using get method - url -- > parameters
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')
# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')
#close driver
driver.close()
#find table
table = soup.find('table')
#find_all table rows
t_rows = table.find_all('tr')
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
# print(row)
# print(row[9])
# print(row[1], row[9])
team = row[1]
sag = row[9]
# print(team, sag)
data = [(team, sag)]
print(data)
Here is the final output that I would like printed to excel using the Pandas DataFrame option:
Team SA/G
Nashville 30.1
Colorado 33.6
Washington 31.0
... ...
Thanks in advance for any help that you may offer. I am still learning and appreciate any feedback that I can get.
Looks like you want to create a DataFrame from a list of tuples, which has been answered here.
I would change your code like this:
# Initial empty list
data = []
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
team = row[1]
sag = row[9]
# Add tuple containing one row of data
data.append((team, sag))
# Create df from list of tuples
df = pd.DataFrame(data, columns=['Team', 'SA/G'])
# Remove lines where Team value is "TEAM"
df = df[df["Team"] != "TEAM"]
EDIT: Add line to remove ("TEAM", "SA/G") rows in df
First inside the "for loop" append tuples into a list (instead of doing data=[(x,y)] declare the data variable before the loop as a list data = list() and append the tuples to list in the loop data.append((x,y))) and do the following
import pandas as pd
data=[("t1","sag1"),("t2","sag2"),("t3","sag3")]
df = pd.DataFrame(data,columns=['Team','SA/G'])
print(df)

Python3 Read Html Table With Pandas

Need some help here. Plan to extract all the statistical data of this site https://lotostats.ro/toate-rezultatele-win-for-life-10-20
My issue is that I am not able to read the table. I can't do this nor for the first page.
Can someone pls help?
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
url='https://lotostats.ro/toate-rezultatele-win-for-life-10-20'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
# if len(T)!=10:
# break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
print(df)
Data is dynamically added. You can find the source, returning json, in network tab
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=20&search%5Bvalue%5D=&search%5Bregex%5D=false&_=1564996040879').json()
You can decode that and likely (investigate that) remove timestamp part (or simply replace with random number)
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=20&search[value]=&search[regex]=false&_=1').json()
To see the lottery lines:
print(r['data'])
The draw parameter seems to be related to page of draws e.g. 2nd page:
https://lotostats.ro/all-rez/win_for_life_10_20?draw=2&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=20&length=20&search[value]=&search[regex]=false&_=1564996040880
You can alter the length to retrieve more results. For example, I can deliberately oversize it to get all results
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=100000&search[value]=&search[regex]=false&_=1').json()
print(len(r['data']))
Otherwise, you can set the length param to a set number, do an initial request, and calculate the number of pages from the total (r['recordsFiltered']) records count divided by results per page.
import math
total_results = r['recordsFiltered']
results_per_page = 20
num_pages = math.ceil(total_results/results_per_page)
Then do a loop to get all results (remembering to alter draw param). Obviously, the less requests the better.

Why is my for loop overwriting instead of appending CSV?

I am trying to scrape IB website. So, what I am doing, I have created the urls to iterate over, and I am able to extract the required information, but seems the dataframe keeps being overwritten vs appending.
import pandas as pd
from pandas import DataFrame as df
from bs4 import BeautifulSoup
import csv
import requests
base_url = "https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK&p=&cc=&limit=100"
n = 1
url_list = []
while n <= 2:
url = (base_url + "&page=%d" % n)
url_list.append(url)
n = n+1
def parse_websites(url_list):
for url in url_list:
html_string = requests.get(url)
soup = BeautifulSoup(html_string.text, 'lxml') # Parse the HTML as a string
table = soup.find('div',{'class':'table-responsive no-margin'}) #Grab the first table
df = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
df.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
print(df)
df.to_csv('path_to_file\\test1.csv')
parse_websites(url_list)
Can you please take a look at my code at advise what I am doing wrong ?
One solution if you want to append the data frames on the file is to write in append mode:
df.to_csv('path_to_file\\test1.csv', mode='a', header=False)
otherwise you should create the data frame outside as mentioned in the comments.
If you define a data structure from within a loop, each iteration of the loop
will redefine the data structure, meaning that the work is being rewritten.
The dataframe should be defined outside of the loop if you do not want it to be overwritten.

Categories