Python pandas str.extract year information from unclean column

Python pandas str.extract year information from unclean column - python

I have a DataFrame with over 111K rows. I'm trying to extract year information(19**, 20**) from uncleaned column Date and fill year info into a new Result column, some rows in Date column contains Chinese/English words.
df.Date.str.extract('20\d{2}') | df.Date.str.extract('19\d{2}')
I used str.extract() to match and extract the year but I got the ValueError: pattern contains no capture groups message. How can I get the year information and fill into a new Result column?
Rating Date
7.8 (June 22, 2000)
8.0 01 April, 1997
8.3 01 December, 1988
7.7 01 November, 2005
7.9 UMl Reprint University Illinois 1966 Ed
7.7 出版日期：2008-06
7.3 出版时间：2009.04
7.7 台北 : 橡樹林文化, 2006.
7.0 机械工业出版社; 第1版 (2014年11月13日)
8.1 民国57年（1968）
7.8 民国79 [1990]
8.9 2010-09-13
9.3 01 (2008)
8.8 1998年4月第11次印刷
7.9 2000
7.3 2004

Sample dataframe:
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期：2008-06
4 (June 22, 2000)
You can also do it as a one liner:
df['Year'] = df.Date.str.extract(r'(19\d{2}|20\d{2})')
Output:
Date Year
2000 2000
1998年4月第11次印刷 1998
01 November, 2005 2005
出版日期：2008-06 2008
(June 22, 2000) 2000

The error says the regex must have at least one capturing group, that is a sequence between a pair of parethesis.
In the solution I propose, I added a capturing group and two non-capturing ones. As you said the extracted data is then inserted into the Result column.
>>> df['Result'] = df.Date.str.extract(r'((?:19\d{2})|(?:20\d{2}))')
Rating Date Result
0 7.8 (June 22, 2000) 2000
1 8.0 01 April, 1997 1997
2 8.3 01 December, 1988 1988
3 7.7 01 November, 2005 2005
4 7.9 UMl Reprint University Illinois 1966 Ed 1966
5 7.7 å‡ºç‰ˆæ—¥æœŸï¼š2008-06 2008
6 7.3 å‡ºç‰ˆæ—¶é—´ï¼š2009.04 2009
7 7.7 å�°åŒ— : æ©¡æ¨¹æž—æ–‡åŒ–, 2006. 2006
8 7.0 æœºæ¢°å·¥ä¸šå‡ºç‰ˆç¤¾; ç¬¬1ç‰ˆ (2014å¹´11æœˆ13... 2014
9 8.1 æ°‘å›½57å¹´ï¼ˆ1968ï¼‰ 1968
10 7.8 æ°‘å›½79 [1990] 1990
11 8.9 2010-09-13 2010
12 9.3 01 (2008) 2008
13 8.8 1998å¹´4æœˆç¬¬11æ¬¡å�°åˆ· 1998
14 7.9 2000 2000
15 7.3 None NaN

Below Should the Job For you in the given case.
Just an example dataset:
>>> df
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期：2008-06
4 (June 22, 2000)
Solution:
>>> df.Date.str.extract(r'(\d{4})', expand=False)
0 2000
1 1998
2 2005
3 2008
4 2000
Or
>>> df['Year'] = df.Date.str.extract(r'(\d{4})', expand=False)
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期：2008-06 2008
4 (June 22, 2000) 2000
Another trick using assign , assigning values back to the new column Year.
>>> df = df.assign(Year = df.Date.str.extract(r'(\d{4})', expand=False))
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期：2008-06 2008
4 (June 22, 2000) 2000

Related

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks

What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()

Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

I need help plotting a bar graph from a dataframe

I have the following dataframe:
AQI Year City
0 349.407407 2015 'Patna'
1 297.024658 2015 'Delhi'
2 283.007605 2015 'Ahmedabad'
3 297.619178 2016 'Delhi'
4 282.717949 2016 'Ahmedabad'
5 250.528701 2016 'Patna'
6 379.753623 2017 'Ahmedabad'
7 325.652778 2017 'Patna'
8 281.401216 2017 'Gurugram'
9 443.053221 2018 'Ahmedabad'
10 248.367123 2018 'Delhi'
11 233.772603 2018 'Lucknow'
12 412.781250 2019 'Ahmedabad'
13 230.720548 2019 'Delhi'
14 217.626741 2019 'Patna'
15 214.681818 2020 'Ahmedabad'
16 181.672131 2020 'Delhi'
17 162.251366 2020 'Patna'
I would like to group data for each year, i.e. 2015, 2016, 2017 2018...2020 on the x axis, with AQI on the y axis. I am a newbie and please excuse the lack of depth in my question.

You can "pivot" your data to support your desired plotting output. Here we set the rows as Year, columns as City, and values as AQI.
pivot = pd.pivot_table(
data=df,
index='Year',
columns='City',
values='AQI',
)
Year
Ahmedabad
Delhi
Gurugram
Lucknow
Patna
2015
283.007605
297.024658
NaN
NaN
349.407407
2016
282.717949
297.619178
NaN
NaN
250.528701
2017
379.753623
NaN
281.401216
NaN
325.652778
2018
443.053221
248.367123
NaN
233.772603
NaN
2019
412.781250
230.720548
NaN
NaN
217.626741
2020
214.681818
181.672131
NaN
NaN
162.251366
Then you can plot this pivot table directly:
pivot.plot.bar(xlabel='Year', ylabel='AQI')
Old answer
Are you looking for the mean AQI per year? If so, you can do some pandas chaining, assuming your data is in a DataFrame df:
df.groupby('Year').mean().plot.bar(xlabel='Year', ylabel='AQI')

Scraping of Web Page Tables using Beautiful Soup Python

I am trying to do a web scraping of table and its content from an Apple wikipedia page. I am using Beautiful Soup to extract the data. I have the following code:
from bs4 import BeautifulSoup
appleurl="https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products"
import requests
import pandas as pad
import lxml.html as html
_content = requests.get(appleurl)
soup = BeautifulSoup(_content.content)
_table = soup.findChildren('table')
rows = _table[0].findChildren(['th','tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print ("The value in this cell is %s"% value)
I am having the following values:
The value in this cell is 1976
The value in this cell is April 11
The value in this cell is Apple I
The value in this cell is Apple I
The value in this cell is September 1, 1977
The value in this cell is 1977
The value in this cell is April 1
The value in this cell is Apple II
The value in this cell is Apple II
The value in this cell is June 1, 1979
The value in this cell is 1978
The value in this cell is June 1
The value in this cell is Disk II
The value in this cell is Drives
The value in this cell is May 1, 1984
The value in this cell is 1979
The value in this cell is June 1
The value in this cell is Apple II Plus
The value in this cell is Apple II series
The value in this cell is December 1, 1982
The value in this cell is None
The value in this cell is None
The value in this cell is None
The value in this cell is Bell & Howell Disk II
The value in this cell is None
The value in this cell is Apple SilenType
The value in this cell is Printers
The value in this cell is October 1, 1982
The problem is that for the year 1979 the number of models are multiple which is not being extracted in my case. I need all the models for the year 1979. The code I have can extracting perfectly fine if there is a single row for each year. What shall I do if there are multiple rows for a single year as in the first table of the link I provided.
The values I need is Year, Release Date, Model. The other two columns can be eliminated.
I will really appreciate the help.

Yo can simply use pandas to do that.use pad.read_html()
import pandas as pad
df=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')[0]
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
Output:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
Update for all tables.
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
for df in dfs:
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
If you would like to do it in single dataframe then use this code.
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
dffinal=pd.DataFrame()
for df in dfs:
df1=pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False)
dffinal = dffinal.append(df1, ignore_index=True)
print(dffinal)
Output:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
9 1980 September 1 Apple III
10 1980 September 1 Modem IIB (Novation CAT)
11 1980 September 1 Printer IIA (Centronics 779)
12 1980 September 1 Monitor III
13 1980 September 1 Monitor II (various third party)
14 1980 September 1 Disk III
15 1981 September 1 Apple ProFile
16 1981 December 1 Apple III Revised[1]
17 1982 October 1 Apple Dot Matrix Printer
18 1982 October 1 Apple Daisy Wheel Printer
19 1983 January 1 Apple IIe
20 1983 January 1 Apple Lisa[2]
21 1983 December 1 Apple III Plus
22 1983 December 1 Apple ImageWriter
23 1984 January 1 Apple Lisa 2
24 1984 January 24 Macintosh (128K)
25 1984 January 24 Macintosh External Disk Drive (400K)
26 1984 January 24 Apple Modem 300
27 1984 January 24 Apple Modem 1200
28 1984 April 1 Apple IIc
29 1984 April 1 Apple Scribe Printer
.. ... ... ...
606 2019 March 18 iPad Mini (5th gen)
607 2019 March 19 iMac with Retina 4K display (21.5") (Early 2019)
608 2019 March 19 iMac with Retina 5K display (27") (Early 2019)
609 2019 March 20 AirPods (2nd gen)
610 2019 May 21 MacBook Pro with Touch Bar (4th gen) (13") (Mi...
611 2019 May 21 MacBook Pro with Touch Bar (4th gen) (15") (Mi...
612 2019 May 28 iPod Touch (7th gen)
613 2019 July 9 MacBook Air (13") (2019)
614 2019 July 9 Macbook Pro with Touch Bar (4th gen) (13") (Mi...
615 2019 September 20 Apple Watch Series 5
616 2019 September 20 Apple Watch Hermès Series 5
617 2019 September 20 Apple Watch Nike Series 5
618 2019 September 20 Apple Watch Edition Series 5
619 2019 September 20 iPhone 8 (128 GB)
620 2019 September 20 iPhone 8 Plus (128 GB)
621 2019 September 20 iPhone 11
622 2019 September 20 iPhone 11 Pro
623 2019 September 20 iPhone 11 Pro Max
624 2019 September 25 iPad (2019)
625 2019 October 30 AirPods Pro
626 2019 November 13 MacBook Pro with Touch Bar (16")
627 2019 December 10 Mac Pro (Late 2019)
628 2019 December 10 Pro Display XDR
629 2020 March 18 NaN
630 2020 March 18 iPad Pro (11") (2nd gen)
631 2020 March 18 iPad Pro (12.9") (4th gen)
632 2020 March 18 Magic Keyboard for iPad Pro
633 2020 March 18 MacBook Air (Early 2020)
634 2020 April 24 iPhone SE (2nd gen)
635 2020 May 4 MacBook Pro with Magic Keyboard (Mid 2020)
[636 rows x 3 columns]

How replaces values from a column in a Panel Data with values from a list in Python?

I have a database in panel data form:
Date id variable1 variable2
2015 1 10 200
2016 1 17 300
2017 1 8 400
2018 1 11 500
2015 2 12 150
2016 2 19 350
2017 2 15 250
2018 2 9 450
2015 3 20 100
2016 3 8 220
2017 3 12 310
2018 3 14 350
And I have a list with the labels of the ID
List = ['Argentina', 'Brazil','Chile']
I want to replace values of id with labels from my list.
Thanks in advance
Date id variable1 variable2
2015 Argentina 10 200
2016 Argentina 17 300
2017 Argentina 8 400
2018 Argentina 11 500
2015 Brazil 12 150
2016 Brazil 19 350
2017 Brazil 15 250
2018 Brazil 9 450
2015 Chile 20 100
2016 Chile 8 220
2017 Chile 12 310
2018 Chile 14 350

map is the way to go, with enumerate:
d = {k:v for k,v in enumerate(List, start=1)}
df['id'] = df['id'].map(d)
Output:
Date id variable1 variable2
0 2015 Argentina 10 200
1 2016 Argentina 17 300
2 2017 Argentina 8 400
3 2018 Argentina 11 500
4 2015 Brazil 12 150
5 2016 Brazil 19 350
6 2017 Brazil 15 250
7 2018 Brazil 9 450
8 2015 Chile 20 100
9 2016 Chile 8 220
10 2017 Chile 12 310
11 2018 Chile 14 350

Try
df['id'] = df['id'].map({1: 'Argentina', 2: 'Brazil', 3: 'Chile'})
or
df['id'] = df['id'].map({k+1: v for k, v in enumerate(List)})

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!

I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas str.extract year information from unclean column - python

Related

Pandas Python - How to create new columns with MultiIndex from pivot table

I need help plotting a bar graph from a dataframe

Scraping of Web Page Tables using Beautiful Soup Python

How replaces values from a column in a Panel Data with values from a list in Python?

Adding columns of different length into pandas dataframe

Categories

Resources