Scraping of Web Page Tables using Beautiful Soup Python - python

I am trying to do a web scraping of table and its content from an Apple wikipedia page. I am using Beautiful Soup to extract the data. I have the following code:
from bs4 import BeautifulSoup
appleurl="https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products"
import requests
import pandas as pad
import lxml.html as html
_content = requests.get(appleurl)
soup = BeautifulSoup(_content.content)
_table = soup.findChildren('table')
rows = _table[0].findChildren(['th','tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print ("The value in this cell is %s"% value)
I am having the following values:
The value in this cell is 1976
The value in this cell is April 11
The value in this cell is Apple I
The value in this cell is Apple I
The value in this cell is September 1, 1977
The value in this cell is 1977
The value in this cell is April 1
The value in this cell is Apple II
The value in this cell is Apple II
The value in this cell is June 1, 1979
The value in this cell is 1978
The value in this cell is June 1
The value in this cell is Disk II
The value in this cell is Drives
The value in this cell is May 1, 1984
The value in this cell is 1979
The value in this cell is June 1
The value in this cell is Apple II Plus
The value in this cell is Apple II series
The value in this cell is December 1, 1982
The value in this cell is None
The value in this cell is None
The value in this cell is None
The value in this cell is Bell & Howell Disk II
The value in this cell is None
The value in this cell is Apple SilenType
The value in this cell is Printers
The value in this cell is October 1, 1982
The problem is that for the year 1979 the number of models are multiple which is not being extracted in my case. I need all the models for the year 1979. The code I have can extracting perfectly fine if there is a single row for each year. What shall I do if there are multiple rows for a single year as in the first table of the link I provided.
The values I need is Year, Release Date, Model. The other two columns can be eliminated.
I will really appreciate the help.

Yo can simply use pandas to do that.use pad.read_html()
import pandas as pad
df=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')[0]
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
Output:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
Update for all tables.
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
for df in dfs:
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
If you would like to do it in single dataframe then use this code.
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
dffinal=pd.DataFrame()
for df in dfs:
df1=pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False)
dffinal = dffinal.append(df1, ignore_index=True)
print(dffinal)
Output:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
9 1980 September 1 Apple III
10 1980 September 1 Modem IIB (Novation CAT)
11 1980 September 1 Printer IIA (Centronics 779)
12 1980 September 1 Monitor III
13 1980 September 1 Monitor II (various third party)
14 1980 September 1 Disk III
15 1981 September 1 Apple ProFile
16 1981 December 1 Apple III Revised[1]
17 1982 October 1 Apple Dot Matrix Printer
18 1982 October 1 Apple Daisy Wheel Printer
19 1983 January 1 Apple IIe
20 1983 January 1 Apple Lisa[2]
21 1983 December 1 Apple III Plus
22 1983 December 1 Apple ImageWriter
23 1984 January 1 Apple Lisa 2
24 1984 January 24 Macintosh (128K)
25 1984 January 24 Macintosh External Disk Drive (400K)
26 1984 January 24 Apple Modem 300
27 1984 January 24 Apple Modem 1200
28 1984 April 1 Apple IIc
29 1984 April 1 Apple Scribe Printer
.. ... ... ...
606 2019 March 18 iPad Mini (5th gen)
607 2019 March 19 iMac with Retina 4K display (21.5") (Early 2019)
608 2019 March 19 iMac with Retina 5K display (27") (Early 2019)
609 2019 March 20 AirPods (2nd gen)
610 2019 May 21 MacBook Pro with Touch Bar (4th gen) (13") (Mi...
611 2019 May 21 MacBook Pro with Touch Bar (4th gen) (15") (Mi...
612 2019 May 28 iPod Touch (7th gen)
613 2019 July 9 MacBook Air (13") (2019)
614 2019 July 9 Macbook Pro with Touch Bar (4th gen) (13") (Mi...
615 2019 September 20 Apple Watch Series 5
616 2019 September 20 Apple Watch Hermès Series 5
617 2019 September 20 Apple Watch Nike Series 5
618 2019 September 20 Apple Watch Edition Series 5
619 2019 September 20 iPhone 8 (128 GB)
620 2019 September 20 iPhone 8 Plus (128 GB)
621 2019 September 20 iPhone 11
622 2019 September 20 iPhone 11 Pro
623 2019 September 20 iPhone 11 Pro Max
624 2019 September 25 iPad (2019)
625 2019 October 30 AirPods Pro
626 2019 November 13 MacBook Pro with Touch Bar (16")
627 2019 December 10 Mac Pro (Late 2019)
628 2019 December 10 Pro Display XDR
629 2020 March 18 NaN
630 2020 March 18 iPad Pro (11") (2nd gen)
631 2020 March 18 iPad Pro (12.9") (4th gen)
632 2020 March 18 Magic Keyboard for iPad Pro
633 2020 March 18 MacBook Air (Early 2020)
634 2020 April 24 iPhone SE (2nd gen)
635 2020 May 4 MacBook Pro with Magic Keyboard (Mid 2020)
[636 rows x 3 columns]

Related

pd.read_html() not reading date

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:
One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF
Try setting the parse_dates parameter to True inside read_html method.

Deleting rows according to specific value in a row

Edition Reviews Ratings BookCategory Price Edition_year
165 Paperback,– Import, 5 Jul 1996 4.5 out of 5 stars 2 customer reviews Sports 270.00 1996
166 Hardcover,– 18 Aug 2009 4.5 out of 5 stars 2 customer reviews Language, Linguistics & Writing 61.00 2009
167 Paperback,– 26 Jul 2018 3.7 out of 5 stars 23 customer reviews Crime, Thriller & Mystery 184.00 2018
168 Paperback,– Import, 22 Mar 2018 4.2 out of 5 stars 50 customer reviews Romance 70.00 2018
169 Paperback,– Abridged, Import 5.0 out of 5 stars 2 customer reviews Action & Adventure 418.00 port
170 Paperback,– 10 Jan 2018 4.7 out of 5 stars 4 customer reviews Sports 395.00 2018
171 Paperback,– Apr 2011 4.0 out of 5 stars 197 customer reviews Language, Linguistics & Writing 179.00 2011
172 Paperback,– 17 Feb 2009 5.0 out of 5 stars 2 customer reviews Comics & Mangas 782.00 2009
173 Paperback,– 22 Aug 2000 3.5 out of 5 stars 4 customer reviews Language, Linguistics & Writing 475.44 2000
174 Paperback,– 5 Jan 2012 4.0 out of 5 stars 30 customer reviews Humour 403.00 2012
Suppose in these dataframe, under Edition_year column, i want to delete rows in which the values of edition year is NOT a numeric value. i.e. there are some values which are strings. I have tried .drop() method but cannot the output required.
This is what i tried:
df = df.drop(df[df['Edition_year'].apply(lambda x: str(x).isalpha())].index, inplace = True)
You can determinate the Edition_year that are numeric using
numeric_filter = df.Edition_year.astype(str).str.isnumeric()
and then use the filter to select only the desired rows
df = df.loc[numeric_filter]

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

I need help plotting a bar graph from a dataframe

I have the following dataframe:
AQI Year City
0 349.407407 2015 'Patna'
1 297.024658 2015 'Delhi'
2 283.007605 2015 'Ahmedabad'
3 297.619178 2016 'Delhi'
4 282.717949 2016 'Ahmedabad'
5 250.528701 2016 'Patna'
6 379.753623 2017 'Ahmedabad'
7 325.652778 2017 'Patna'
8 281.401216 2017 'Gurugram'
9 443.053221 2018 'Ahmedabad'
10 248.367123 2018 'Delhi'
11 233.772603 2018 'Lucknow'
12 412.781250 2019 'Ahmedabad'
13 230.720548 2019 'Delhi'
14 217.626741 2019 'Patna'
15 214.681818 2020 'Ahmedabad'
16 181.672131 2020 'Delhi'
17 162.251366 2020 'Patna'
I would like to group data for each year, i.e. 2015, 2016, 2017 2018...2020 on the x axis, with AQI on the y axis. I am a newbie and please excuse the lack of depth in my question.
You can "pivot" your data to support your desired plotting output. Here we set the rows as Year, columns as City, and values as AQI.
pivot = pd.pivot_table(
data=df,
index='Year',
columns='City',
values='AQI',
)
Year
Ahmedabad
Delhi
Gurugram
Lucknow
Patna
2015
283.007605
297.024658
NaN
NaN
349.407407
2016
282.717949
297.619178
NaN
NaN
250.528701
2017
379.753623
NaN
281.401216
NaN
325.652778
2018
443.053221
248.367123
NaN
233.772603
NaN
2019
412.781250
230.720548
NaN
NaN
217.626741
2020
214.681818
181.672131
NaN
NaN
162.251366
Then you can plot this pivot table directly:
pivot.plot.bar(xlabel='Year', ylabel='AQI')
Old answer
Are you looking for the mean AQI per year? If so, you can do some pandas chaining, assuming your data is in a DataFrame df:
df.groupby('Year').mean().plot.bar(xlabel='Year', ylabel='AQI')

Python pandas str.extract year information from unclean column

I have a DataFrame with over 111K rows. I'm trying to extract year information(19**, 20**) from uncleaned column Date and fill year info into a new Result column, some rows in Date column contains Chinese/English words.
df.Date.str.extract('20\d{2}') | df.Date.str.extract('19\d{2}')
I used str.extract() to match and extract the year but I got the ValueError: pattern contains no capture groups message. How can I get the year information and fill into a new Result column?
Rating Date
7.8 (June 22, 2000)
8.0 01 April, 1997
8.3 01 December, 1988
7.7 01 November, 2005
7.9 UMl Reprint University Illinois 1966 Ed
7.7 出版日期:2008-06
7.3 出版时间:2009.04
7.7 台北 : 橡樹林文化, 2006.
7.0 机械工业出版社; 第1版 (2014年11月13日)
8.1 民国57年(1968)
7.8 民国79 [1990]
8.9 2010-09-13
9.3 01 (2008)
8.8 1998年4月第11次印刷
7.9 2000
7.3 2004
Sample dataframe:
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期:2008-06
4 (June 22, 2000)
You can also do it as a one liner:
df['Year'] = df.Date.str.extract(r'(19\d{2}|20\d{2})')
Output:
Date Year
2000 2000
1998年4月第11次印刷 1998
01 November, 2005 2005
出版日期:2008-06 2008
(June 22, 2000) 2000
The error says the regex must have at least one capturing group, that is a sequence between a pair of parethesis.
In the solution I propose, I added a capturing group and two non-capturing ones. As you said the extracted data is then inserted into the Result column.
>>> df['Result'] = df.Date.str.extract(r'((?:19\d{2})|(?:20\d{2}))')
Rating Date Result
0 7.8 (June 22, 2000) 2000
1 8.0 01 April, 1997 1997
2 8.3 01 December, 1988 1988
3 7.7 01 November, 2005 2005
4 7.9 UMl Reprint University Illinois 1966 Ed 1966
5 7.7 出版日期:2008-06 2008
6 7.3 出版时间:2009.04 2009
7 7.7 �北 : 橡樹林文化, 2006. 2006
8 7.0 机械工业出版社; 第1版 (2014年11月13... 2014
9 8.1 民国57年(1968) 1968
10 7.8 民国79 [1990] 1990
11 8.9 2010-09-13 2010
12 9.3 01 (2008) 2008
13 8.8 1998年4月第11次�刷 1998
14 7.9 2000 2000
15 7.3 None NaN
Below Should the Job For you in the given case.
Just an example dataset:
>>> df
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期:2008-06
4 (June 22, 2000)
Solution:
>>> df.Date.str.extract(r'(\d{4})', expand=False)
0 2000
1 1998
2 2005
3 2008
4 2000
Or
>>> df['Year'] = df.Date.str.extract(r'(\d{4})', expand=False)
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期:2008-06 2008
4 (June 22, 2000) 2000
Another trick using assign , assigning values back to the new column Year.
>>> df = df.assign(Year = df.Date.str.extract(r'(\d{4})', expand=False))
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期:2008-06 2008
4 (June 22, 2000) 2000

Categories