Extract minimum and maximum year from string in Pandas DataFrame - python

I have a CSV file that I read into a Pandas DataFrame that contains a column with multiple year values separated by a semicolon.
I need to extract the minimum and maximum value from the string and save each in a new column.
I am able to print the minimum and maximum but I can't seem to get the correct values from each row saved into a new column.
Any help is much appreciated.
Sample DataFrame:
import pandas as pd
import numpy as np
raw_data = {'id': ['1473-2262', '2327-9214', '1949-8349', '2375-6314',
'0095-6562'],
'years': ['2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005',
'2003; 2004; 2005', '2015', np.nan, '2012; 2014']}
df = pd.DataFrame(raw_data, columns = ['id', 'years'])
This is the DataFrame that I need:
id years minyear maxyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2000.0 2005.0
1 2327-9214 2003; 2004; 2005 2003.0 2005.0
2 1949-8349 2015 2015.0 2015.0
3 2375-6314 NaN NaN NaN
4 0095-6562 2012; 2014 2012.0 2014.0
I can print the minimum and maximum:
x = df['years'].notnull()
for row in df['years'][x].str.split(pat=';'):
lst = list()
for item in row:
lst.append(int(item))
print('Min=',min(lst),'Max=',max(lst))
Min= 2000 Max= 2005
Min= 2003 Max= 2005
Min= 2015 Max= 2015
Min= 2012 Max= 2014
Here's how I've tried to capture the values to new columns:
x = df['years'].notnull()
for row in df['years'][x].str.split(pat=';'):
lst = list()
for item in row:
lst.append(int(item))
df['minyear']=min(lst)
df['maxyear']=max(lst)
Only the values from the last row are saved to the new columns.
id years minyear maxyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2012 2014
1 2327-9214 2003; 2004; 2005 2012 2014
2 1949-8349 2015 2012 2014
3 2375-6314 NaN 2012 2014
4 0095-6562 2012; 2014 2012 2014

I think you need str.split with expand=True for new DataFrame, then cast to float.
Index values are same, so assign new columns:
df1 = df['years'].str.split('; ', expand=True).astype(float)
df = df.assign(maxyear=df1.max(axis=1),minyear=df1.min(axis=1))
#same as
#df['maxyear'], df['minyear'] = df1.min(axis=1), df1.max(axis=1)
print (df)
id years maxyear minyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2000.0 2005.0
1 2327-9214 2003; 2004; 2005 2003.0 2005.0
2 1949-8349 2015 2015.0 2015.0
3 2375-6314 NaN NaN NaN
4 0095-6562 2012; 2014 2012.0 2014.0

A solution similar to the one proposed by jezrael, but using a conversion to a Series. Warning: This solution does not scale well.
years = df.years.str.split(";").apply(pd.Series).astype(float)
#0 1 2 3 4 5 6 7
#0 2000.0 2001.0 2002.0 2003.0 2004.0 2004.0 2004.0 2005.0
#1 2003.0 2004.0 2005.0 NaN NaN NaN NaN NaN
#2 2015.0 NaN NaN NaN NaN NaN NaN NaN
#3 NaN NaN NaN NaN NaN NaN NaN NaN
#4 2012.0 2014.0 NaN NaN NaN NaN NaN NaN
df['maxyear'], df['minyear'] = years.min(axis=1), years.max(axis=1)

Related

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

get element in column that preceeded new column

I'm trying to collect items from column 'data' that just preceded the data I collected in column 'min' and create new column. See
Here is the data (importing with pd.read_csv):
time,data
12/15/18 01:10 AM,130352.146180556
12/16/18 01:45 AM,130355.219097222
12/17/18 01:47 AM,130358.223263889
12/18/18 02:15 AM,130361.281701389
12/19/18 03:15 AM,130364.406597222
12/20/18 03:25 AM,130352.427430556
12/21/18 03:27 AM,130355.431597222
12/22/18 05:18 AM,130358.663541667
12/23/18 06:44 AM,130361.842430556
12/24/18 07:19 AM,130364.915243056
12/25/18 07:33 AM,130352.944409722
12/26/18 07:50 AM,130355.979826389
12/27/18 09:13 AM,130359.153472222
12/28/18 11:53 AM,130362.4871875
12/29/18 01:23 PM,130365.673263889
12/30/18 02:17 PM,130353.785763889
12/31/18 02:23 PM,130356.798263889
01/01/19 04:41 PM,130360.085763889
01/02/19 05:01 PM,130363.128125
and my code:
import pandas as pd
import numpy as np
from scipy import signal
from scipy.signal import argrelextrema
import datetime
diff=pd.DataFrame()
df=pd.read_csv('saw_data2.csv')
df['time']=pd.to_datetime(df['time'])
print(df.head())
n=2 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data']
If you plot the data, you'll see it is similiar to a sawtooth. The element before in 'data' that I get in 'min' is the element I want to put in a new column df['new_col'].
I've tried many things like,
df['new_col']=df.index.get_loc(df['min'].df['data'])
and,
df['new_col']=df['min'].shift() #obviously wrong
IIUC, you can do the shift before selecting the rows with a value in min:
df['new_col'] = df.shift().loc[df['min'].notna(), 'data']
print (df)
time data min new_col
0 12/15/18 01:10 AM 130352.146181 130352.146181 NaN
1 12/16/18 01:45 AM 130355.219097 NaN NaN
2 12/17/18 01:47 AM 130358.223264 NaN NaN
3 12/18/18 02:15 AM 130361.281701 NaN NaN
4 12/19/18 03:15 AM 130364.406597 NaN NaN
5 12/20/18 03:25 AM 130352.427431 130352.427431 130364.406597
6 12/21/18 03:27 AM 130355.431597 NaN NaN
7 12/22/18 05:18 AM 130358.663542 NaN NaN
8 12/23/18 06:44 AM 130361.842431 NaN NaN
9 12/24/18 07:19 AM 130364.915243 NaN NaN
10 12/25/18 07:33 AM 130352.944410 130352.944410 130364.915243
11 12/26/18 07:50 AM 130355.979826 NaN NaN
12 12/27/18 09:13 AM 130359.153472 NaN NaN
13 12/28/18 11:53 AM 130362.487187 NaN NaN
14 12/29/18 01:23 PM 130365.673264 NaN NaN
15 12/30/18 02:17 PM 130353.785764 130353.785764 130365.673264
16 12/31/18 02:23 PM 130356.798264 NaN NaN
17 01/01/19 04:41 PM 130360.085764 NaN NaN
18 01/02/19 05:01 PM 130363.128125 NaN NaN

ValueError: Data overlaps. in python

I have dataframe df3 that looks like this
with unknown columns length as AAA_??? can be anything from the dataset
Date ID Calendar_Year Month DayName... AAA_1E AAA_BMITH AAA_4.1 AAA_CH
0 2019-09-17 8661 2019 Sep Sun... NaN NaN NaN NaN
1 2019-09-18 8662 2019 Sep Sun... 1.0 3.0 34.0 1.0
2 2019-09-19 8663 2019 Sep Sun... NaN NaN NaN NaN
3 2019-09-20 8664 2019 Sep Mon... NaN NaN NaN NaN
4 2019-09-20 8664 2019 Sep Mon... 2.0 4.0 32.0 3.0
5 2019-09-20 8664 2019 Sep Sat... NaN NaN NaN NaN
6 2019-09-20 8664 2019 Sep Sat... NaN NaN NaN NaN
7 2019-09-20 8664 2019 Sep Sat... 0.0 4.0 30.0 0.0
another dataframe dfMeans that has the mean of a third dataframe
Month Dayname ID ... AAA_BMITH AAA_4.1 AAA_CH
0 Jan Thu 7686.500000 ... 0.000000 28.045455 0.0
1 Jan Fri 7636.272727 ... 0.000000 28.136364 0.0
2 Jan Sat 7637.272727 ... 0.000000 27.045455 0.0
3 Jan Sun 7670.090909 ... 0.000000 27.090909 0.0
4 Jan Mon 7702.909091 ... 0.000000 27.727273 0.0
5 Jan Tue 7734.260870 ... 0.000000 27.956522 0.0
the dataframes will be joined by Month and Dayname
I want to replace NaNs in df3 with values from dfMean
using this line
df3.update(dfMeans, overwrite=False, errors="raise")
but I get this error
raise ValueError("Data overlaps.")
ValueError: Data overlaps.
How to update NaNs with values from dfMean and avoid this error?
Edit :
I have put all dataframes in one dataframe df
Month Dayname ID ... AAA_BMITH AAA_4.1 AAA_CH
0 Jan Thu 7686.500000 ... 0.000000 28.045455 0.0
1 Jan Fri 7636.272727 ... 0.000000 28.136364 0.0
2 Jan Sat 7637.272727 ... 0.000000 27.045455 0.0
3 Jan Sun 7670.090909 ... 0.000000 27.090909 0.0
4 Jan Mon 7702.909091 ... 0.000000 27.727273 0.0
5 Jan Tue 7734.260870 ... 0.000000 27.956522 0.0
How can I fill NaNs with average based on Month and Dayname?
Using fillna:
Data:
Date ID Calendar_Year Month Dayname AAA_1E AAA_BMITH AAA_4.1 AAA_CH
2019-09-17 8661 2019 Jan Sun NaN NaN NaN NaN
2019-09-18 8662 2019 Jan Sun 1.0 3.0 34.0 1.0
2019-09-19 8663 2019 Jan Sun NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Mon NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Mon 2.0 4.0 32.0 3.0
2019-09-20 8664 2019 Jan Sat NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Sat NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Sat 0.0 4.0 30.0 0.0
df.set_index(['Month', 'Dayname'], inplace=True)
df_mean:
Month Dayname ID AAA_BMITH AAA_4.1 AAA_CH
Jan Thu 7686.500000 0.0 28.045455 0.0
Jan Fri 7636.272727 0.0 28.136364 0.0
Jan Sat 7637.272727 0.0 27.045455 0.0
Jan Sun 7670.090909 0.0 27.090909 0.0
Jan Mon 7702.909091 0.0 27.727273 0.0
Jan Tue 7734.260870 0.0 27.956522 0.0
df_mean.set_index(['Month', 'Dayname'], inplace=True)
Update df:
This operation is based on matching index values
It doesn't work with multiple column names at once, you'll have to get the columns of interest and iterate through them
Note, AAA_1E isn't in df_mean
for col in df.columns:
if col in df_mean.columns:
df[col].fillna(df_mean[col], inplace=True)
You can groupby on 'Month' and DayName' and use apply to edit the dataframe.
Use fillna to fill the Nan values. fillna accepts a dictionary as value parameter: keys of the dictionary are column names, values are scalars: the scalars are used to substitute the Nan in each column. With loc you can select the proper value from dMeans.
You can create the dictionary with a dict comprehension, using the intersection between columns of df3 and dfMeans.
All this corresponds to the following statement:
df3filled = df3.groupby(['Month', 'DayName']).apply(lambda x : x.fillna(
{col : dfMeans.loc[(dfMeans['Month'] == x.name[0]) & (dfMeans['Dayname'] == x.name[1]), col].iloc[0]
for col in x.columns.intersection(dfMeans.columns)})).reset_index(drop=True)

Dataframe shift moving data into random columns?

I'm using code to shift time series data that looks somewhat similar to this:
Year Player PTSN AVGN
2018 Aaron Donald 280.60 17.538
2018 J.J. Watt 259.80 16.238
2018 Danielle Hunter 237.60 14.850
2017 Aaron Donald 181.0 12.929
2016 Danielle Hunter 204.6 12.788
with the intent of getting it into something like this:
AVGN PTSN AVGN_prev PTSN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 12.929 181.0 NaN NaN
2018 17.538 280.6 12.929 181.0
Danielle Hunter 2016 12.788 204.6 NaN NaN
2017 8.325 133.2 12.788 204.6
2018 14.850 237.6 8.325 133.2
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 16.238 259.8 NaN NaN
I'm using this code to make that happen:
res = df.set_index(['player', 'Year'])
idx = pd.MultiIndex.from_product([df['player'].unique(),
df['Year'].unique()],
names=['Player', 'Year'])
res = res.groupby(['player', 'Year']).apply(sum)
res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)
with the addition of a groupby.sum() because some players in the dataframe moved from one teamt o another within the same season and i want to combine those numbers. However, the data i have is actually coming out extremely wrong. The data has too many columns to post, but it seems like the data from the previous year (_prev) is placed into random columns. It doesn't change and will always place it into the same wrong columns. Is this an issue caused by the groupby.sum()? is it because i'm using a columns variable (containing all the same names as res.columns with a str(_prev) attached to them) and a list(res.columns)? And regardless of which it is, how do i solve this?
here's the outputs of columns and res.columns:
columns:
['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']
res_columns:
['player_id', 'position', 'player_game_count', 'team_name',
'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
'targets', 'receptions', 'yards', 'yards_per_reception',
'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
'pass_break_ups', 'qb_rating_against', 'penalties',
'declined_penalties']
both are length 35 when tested.
I suggest use:
#first aggregate for unique MultiIndex
res = df.groupby(['Player', 'Year']).sum()
#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
names=['Player', 'Year'])
#aded new missing years
res = res.reindex(idx).sort_index()
#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
PTSN AVGN PTSN_prev AVGN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 181.0 12.929 NaN NaN
2018 280.6 17.538 181.0 12.929
Danielle Hunter 2016 204.6 12.788 NaN NaN
2017 NaN NaN 204.6 12.788
2018 237.6 14.850 NaN NaN
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 259.8 16.238 NaN NaN

pandas DataFrame .stack(dropna=False) but keeping existing combinations of levels

My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN

Categories