Scraping table in wikipedia with Python - python

I want to scrape the table with postal codes of Toronto in this wikipedia page.
Regardless I use the pandas method or beautifulSoup it just won't work although others reported that it should work. Kindly asking you to give your hints:
Pandas:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
beautifulSoup:
According other guys the output table by use of codes in image should be as the table in image 'expexted output with beautifulSoup'
output of beautifulSoup
Expected output with beautifulSoup

You are asking too much! Pandas is great at parsing rectangular HTML tables, and here it does it perfectly:
it gives 9 columns from M1x to M9x
it gives 20 rows from MyA to MyZ (not using all letters)
Here you would like to have it parse inside the cells. Sorry but you will have to code it yourself. For example you could easily parse the first row with:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
dfA = df.iloc[0].str.extractall(r'(M\d\w)([^\(\(]*)(?:\((.*)\))?')
it gives for dfA:
0 1 2
match
0 0 M1A Not assigned NaN
1 0 M2A Not assigned NaN
2 0 M3A North York Parkwoods
3 0 M4A North York Victoria Village
4 0 M5A Downtown Toronto Regent Park / Harbourfront
5 0 M6A North York Lawrence Manor / Lawrence Heights
6 0 M7A Queen's Park / Ontario Provincial Government NaN
7 0 M8A Not assigned NaN
8 0 M9A Etobicoke Islington Avenue

Related

How can I fill some data of the cell of the new column that is in accord with a substring of the original data using pandas?

There are 2 dataframes, and they have simillar data.
A dataframe
Index Business Address
1 Oils Moskva, Russia
2 Foods Tokyo, Japan
3 IT California, USA
... etc.
B dataframe
Index Country Country Calling Codes
1 USA +1
2 Egypt +20
3 Russia +7
4 Korea +82
5 Japan +81
... etc.
I will add a column named 'Country Calling Codes' to A dataframe, too.
After this, 'Country' column in B dataframe will be compared with the data of 'Address' column. If the string of 'A.Address' includes string of 'B.Country', 'B.Country Calling Codes' will be inserted to 'A.Country Calling Codes' of compared row.
Result is:
Index Business Address Country Calling Codes
1 Oils Moskva, Russia +7
2 Foods Tokyo, Japan +81
3 IT California, USA +1
I don't know how to deal with the issue because I don't have much experience using pandas. I should be very grateful to you if you might help me.
Use Series.str.extract for get possible strings by Country column and then Series.map by Series:
d = B.drop_duplicates('Country').set_index('Country')['Country Calling Codes']
s = A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False)
A['Country Calling Codes'] = s.map(d)
print (A)
Index Business Address Country Calling Codes
0 1 Oils Moskva, Russia +7
1 2 Foods Tokyo, Japan +81
2 3 IT California, USA +1
Detail:
print (A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False))
0 Russia
1 Japan
2 USA
Name: Address, dtype: object

how to deal with a copy-pasted table in pandas- reshaping a column vector

I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified

Python remove row if cell value in dataframe contain characters less than 5

I have a dataframe like I am trying to keep rows that have more than 5 characters. Here is what I tried, but it removes 'of', 'U.', 'and','Arts',...etc. I just need to remove characters in a row that have len less than 5.
id schools
1 University of Hawaii
2 Dept in Colorado U.
3 Dept
4 College of Arts and Science
5 Dept
6 Bldg
wrong output from my code:
0 University Hawaii
1 Colorado
2
3 College Science
4
5
Looking for output like this:
id schools
1 University of Hawaii
2 Dept in Colorado U.
4 College of Arts and Science
Code:
l = [1,2,3,4,5,6]
s = ['University of Hawaii', 'Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
df1 = pd.DataFrame({'id':l, 'schools':s})
df1 = df1['schools'].str.findall('\w{5,}').str.join(' ') # not working
df1
Using a regex is a huge (and slow) overkill for this task. You can use simple pandas indexing:
filtrered_df = df1[df1['schools'].str.len() > 5] # or >= depending on the required logic
There is a simpler filter for your data.
mask = df1['schools'].str.len() > 5
Then create a new data frame from the filter
df2 = df1[mask].copy()
import pandas as pd
name = ['University of Hawaii','Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
labels =['schools']
df =pd.DataFrame.from_records([[i] for i in name],columns=labels)
df[df['schools'].str.len() >5 ]

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Improper rendering of numerical values while reading Wikipedia table in pandas

I am trying to read a content of a Wikipedia table in a pandas DataFrame.
In [110]: import pandas as pd
In [111]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")[0]
However, this dataframe contains gibberish values in certain columns:
0 1 2 \
0 City/Metropolitan area Country Geographical zone[1]
1 Aberdeen United Kingdom Northern Europe
2 Abidjan Côte d'Ivoire (Ivory Coast) Africa
3 Abu Dhabi United Arab Emirates Western Asia
4 Addis Ababa Ethiopia Africa
3 \
0 Official est. Nominal GDP ($BN)
1 7001113000000000000♠11.3 (2008)[5]
2 NaN
3 7002119000000000000♠119 [6]
4 NaN
4 \
0 Brookings Institution[2] 2014 est. PPP-adjuste...
1 NaN
2 NaN
3 7002178300000000000♠178.3
4 NaN
5 \
0 PwC[3] 2008 est. PPP-adjusted GDP ($BN)
1 NaN
2 7001130000000000000♠13
3 NaN
4 7001120000000000000♠12
6 7
0 McKinsey[4] 2010 est. Nominal GDP ($BN) Other est. Nominal GDP ($BN)
1 NaN NaN
2 NaN NaN
3 7001671009999900000♠67.1 NaN
4 NaN NaN
For example, in the above dataframe in the column for Official est. Nominal GDP, the first entry is 11.3(2008) but we see some big number before that. I thought that this must be problem with encoding and I tried passing ASCII as well as UTI encodings:
In [113]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP", encoding = 'ASCII')[0]
However, even this doesn't solve the problem. Any ideas?
This is because of the invisible (in the browser) "sort key" elements:
<td style="background:#79ff76;">
<span style="display:none" class="sortkey">7001130000000000000♠</span>
13
</td>
May be there is a better way to clean it up, but here is a working solution based on the idea of finding these "sort key" elements and removing them from the table, then let pandas parse the table HTML:
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")
soup = BeautifulSoup(response.content, "html.parser")
table = soup.select_one("table.wikitable")
for span in table.select("span.sortkey"):
span.decompose()
df = pd.read_html(str(table))[0]
print(df)
If you look at the HTML source of that page, you'll see that a lot of cells have a hidden <span> containing a "sortkey". These are the strange numbers you're seeing.
If you look at the documentation for read_html, you'll see this:
Expect to do some cleanup after you call this function. [...] We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.
Put them together and you have your answer: garbage in, garbage out. The table you're reading from has junk data in it, and you'll have to figure out how to handle that yourself.

Categories