I am trying to read a content of a Wikipedia table in a pandas DataFrame.
In [110]: import pandas as pd
In [111]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")[0]
However, this dataframe contains gibberish values in certain columns:
0 1 2 \
0 City/Metropolitan area Country Geographical zone[1]
1 Aberdeen United Kingdom Northern Europe
2 Abidjan Côte d'Ivoire (Ivory Coast) Africa
3 Abu Dhabi United Arab Emirates Western Asia
4 Addis Ababa Ethiopia Africa
3 \
0 Official est. Nominal GDP ($BN)
1 7001113000000000000♠11.3 (2008)[5]
2 NaN
3 7002119000000000000♠119 [6]
4 NaN
4 \
0 Brookings Institution[2] 2014 est. PPP-adjuste...
1 NaN
2 NaN
3 7002178300000000000♠178.3
4 NaN
5 \
0 PwC[3] 2008 est. PPP-adjusted GDP ($BN)
1 NaN
2 7001130000000000000♠13
3 NaN
4 7001120000000000000♠12
6 7
0 McKinsey[4] 2010 est. Nominal GDP ($BN) Other est. Nominal GDP ($BN)
1 NaN NaN
2 NaN NaN
3 7001671009999900000♠67.1 NaN
4 NaN NaN
For example, in the above dataframe in the column for Official est. Nominal GDP, the first entry is 11.3(2008) but we see some big number before that. I thought that this must be problem with encoding and I tried passing ASCII as well as UTI encodings:
In [113]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP", encoding = 'ASCII')[0]
However, even this doesn't solve the problem. Any ideas?
This is because of the invisible (in the browser) "sort key" elements:
<td style="background:#79ff76;">
<span style="display:none" class="sortkey">7001130000000000000♠</span>
13
</td>
May be there is a better way to clean it up, but here is a working solution based on the idea of finding these "sort key" elements and removing them from the table, then let pandas parse the table HTML:
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")
soup = BeautifulSoup(response.content, "html.parser")
table = soup.select_one("table.wikitable")
for span in table.select("span.sortkey"):
span.decompose()
df = pd.read_html(str(table))[0]
print(df)
If you look at the HTML source of that page, you'll see that a lot of cells have a hidden <span> containing a "sortkey". These are the strange numbers you're seeing.
If you look at the documentation for read_html, you'll see this:
Expect to do some cleanup after you call this function. [...] We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.
Put them together and you have your answer: garbage in, garbage out. The table you're reading from has junk data in it, and you'll have to figure out how to handle that yourself.
Related
I created a data-capturing template. When imported into Python (as DataFrame), I noticed that some of the records spanned multiple rows.
I need to clean up the spanned record (see expected representation).
The 'Entity' column is the anchor column. Currently, it is not the definitive column, as one can see the row(s) underneath with NaN.
NB: Along the line, I'll be dropping the 'Unnamed:' column.
Essentially, for every row where df.Entity.isnull(), the value(s) must be joined to the row above where df.Entity.notnull().
NB: I can adjust the source, however, I'll like to keep the source template because of ease of capturing and #reproducibility.
[dummy representation]
Unnamed: 0
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig
Unit
Structure
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff
All
Covering Doc...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
General
NaN
5000 <10000
3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B)
All
Critical Anal...
Formal as
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Staff (Cat A)
NaN
15000
*g...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Edu & LLL l...
NaN
NaN
LLL not...
[expected representation] I expect to have
Unnamed
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig-------------
Unit
Structure -----
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff General
All
Covering Doc... 5000 <10000
Formal 3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B) Staff (Cat A) Edu & LLL l...
All
Critical Anal... 15000
Formal as *g... LLL not...
My instinct is to test for isnull() on [Entity] column. I would prefer not to do a if...then/'loop' check.
My mind wandered along stack, groupby or stack, merge/join, pop. Not sure these approaches are 'right'.
My preference will be some 'vectorisation' as much as it's possible; taking advantage of pandas' DataFrame
I took note of
Merging Two Rows (one with a value, the other NaN) in Pandas
Pandas dataframe merging rows to remove NaN
Concatenate column values in Pandas DataFrame with "NaN" values
Merge DF of different Sizes with NaN values in between
In my case, my anchor column [Entity] has the key values on one row; however, its values are on one row or span multiple rows.
NB: I'm dealing with one DataFrame and not two df.
I should also mention that I took note of the SO solution that 'explode' newline across multiple rows. This is the opposite for my own scenario. However, I take note as it might provide hints.
Pandas: How to read a DataFrame from excel-file where multiple rows are sometimes separated by line break (\n)
[UPDATE: Walkaround 1]
NB: This walkaround is not a solution. It simply provide an alternate!
With leads from a Medium and a SO post,
I attempted with success reading my dataset directly from the table in the Word document. For this, I installed the python-docx library.
## code snippet; #Note: incomplete
from docx import Document as docs
... ...
document = docs("datasets/IDA..._AppendixA.docx")
def read_docx_table(document, tab_id: int = None, nheader: int = 1, start_row: int = 0):
... ...
data = [[cell.text for cell in row.cells] for i, row in enumerate(table.rows)
if i >= start_row]
... ...
if nheader == 1: ## first row as column header
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
... ...
return df
... ...
## parse and show dataframe
df_table = read_docx_table(document, tab_id=3, nheader=1, start_row=0)
df_table
The rows are no longer spilling over multiple rows. The columns with newline are now showing the '\n' character.
I can, if I use df['col'].str.replace(), remove newlines '\n' or other delimiters, if I so desire.
Replacing part of string in python pandas dataframe
[dataframe representation: importing and parsing using python-docx library] Almost a true representation of the original table in Word
Unnamed
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig-------------
Unit
Structure -----
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal \n| Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff \nGeneral
All
Covering Doc... \n| 5000 <10000
Formal \n| 3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B) Staff (Cat A) \nEdu & LLL l...
All
Critical Anal... \n| 15000
Formal as \n|*g... \n| LLL not...
[UPDATE 2]
After my update: walkaround 1, I saw #J_H comments. Whiltst it is not 'data corruption' in the true sense, it is nonetheless an ETL strategy issue. Thanks #J_H. Absolutely, well-thought-through #design is of the essence.
Going forward, I'll either leave the source template practically as-is with minor modifications and use python-docx as I've used; or
I modify the source template for easy capture in Excel or 'csv' type repository.
Despite the two approaches outlined here or any other, I'm still keen on 'data cleaning code' that can clean-up the df, to give the expected df.
I currently have a long list of countries (234 values). For simplicity sake, picture 1 displays only 10. This is what my df currently looks like:
I want to create a matrix of some sort, where the countries listed in each row are also the col headers. This is what I want my output dataframe to look like:
Country
China
India
U.S
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
China
India
U.S.
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
So to reiterate the question, how do I take the value in each row of col 1 and copy it to be my dataframe column headers to create a matrix. This dataframe is also being scraped from a website using requests and beautiful soup, so it isn't like i can create a csv file of a pre-made dataframe. Is what I want to do possible?
Initialize a Pandas DataFrame as follows
countryList = ['China', 'India', 'U.S.']
pd.DataFrame(columns=countryList, index=countryList)
and just append all elements of ```countryList`` according to your use case. This yields an empty dataframe to insert data into.
China
India
U.S.
China
NaN
NaN
NaN
India
NaN
NaN
NaN
U.S.
NaN
NaN
NaN
Will something like this work?
data = ["US","China","England","Spain",'Brazil']
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df
Output:
Country US China England Spain Brazil
0 US
1 China
2 England
3 Spain
4 Brazil
You can even set the country as the index like:
data = [1,2,3,4,5]
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df = df.set_index(df.Country)[df.Country.values].rename_axis(index=None)
Output:
US China England Spain Brazil
US
China
England
Spain
Brazil
7Shoe's answer is good, but in case you already have a dataframe:
import pandas as pd
df = pd.DataFrame({'Country':['U.S.','Canada', 'India']})
pd.DataFrame(columns=df.Country, index=df.Country).rename_axis(None)
Output
Country U.S. Canada India
U.S. NaN NaN NaN
Canada NaN NaN NaN
India NaN NaN NaN
I want to scrape the table with postal codes of Toronto in this wikipedia page.
Regardless I use the pandas method or beautifulSoup it just won't work although others reported that it should work. Kindly asking you to give your hints:
Pandas:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
beautifulSoup:
According other guys the output table by use of codes in image should be as the table in image 'expexted output with beautifulSoup'
output of beautifulSoup
Expected output with beautifulSoup
You are asking too much! Pandas is great at parsing rectangular HTML tables, and here it does it perfectly:
it gives 9 columns from M1x to M9x
it gives 20 rows from MyA to MyZ (not using all letters)
Here you would like to have it parse inside the cells. Sorry but you will have to code it yourself. For example you could easily parse the first row with:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
dfA = df.iloc[0].str.extractall(r'(M\d\w)([^\(\(]*)(?:\((.*)\))?')
it gives for dfA:
0 1 2
match
0 0 M1A Not assigned NaN
1 0 M2A Not assigned NaN
2 0 M3A North York Parkwoods
3 0 M4A North York Victoria Village
4 0 M5A Downtown Toronto Regent Park / Harbourfront
5 0 M6A North York Lawrence Manor / Lawrence Heights
6 0 M7A Queen's Park / Ontario Provincial Government NaN
7 0 M8A Not assigned NaN
8 0 M9A Etobicoke Islington Avenue
I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified
I've written a script in python to parse some data from a webpage and write it to a csv file via panda. So far what I've written can parse all the tables available in that page but in case of writing to a csv file it is writing the last table from that page to that csv file. Definitely, the data are being overwritten because of the loop. How can I fix this flaw so that my scraper will be able to write all the data from different tables instead of only the last table? Thanks in advance.
import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
df.to_csv("table_item.csv")
print(df)
Btw, I expect to write data to a csv file using panda only. Thanks again.
You can use read_html what return list of DataFrames in webpage, so then need concat for one df:
dfs = pd.read_html('http://www.espn.com/nba/schedule/_/date/20171001')
df = pd.concat(dfs, ignore_index=True)
#if necessary rename columns
d = {'Unnamed: 1':'a', 'Unnamed: 7':'b'}
df = df.rename(columns=d)
print (df.head())
matchup a time (ET) nat tv away tv home tv \
0 Atlanta ATL Miami MIA NaN NaN NaN NaN
1 LA LAC Toronto TOR NaN NaN NaN NaN
2 Guangzhou Guangzhou Washington WSH NaN NaN NaN NaN
3 Charlotte CHA Boston BOS NaN NaN NaN NaN
4 Orlando ORL Memphis MEM NaN NaN NaN NaN
tickets b
0 2,401 tickets available from $6 NaN
1 284 tickets available from $29 NaN
2 2,792 tickets available from $2 NaN
3 2,908 tickets available from $6 NaN
4 1,508 tickets available from $3 NaN
And last to_csv for write to file:
df.to_csv("table_item.csv", index=False)
EDIT:
For learning is possible append each DataFrame to list and then concat:
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
dfs = []
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
print(df)
df.to_csv("table_item.csv")