Filling NA's with numeric order in pandas - python

I have a pandas data frame as below:
LogdiffT Year Country
0 -0.003094 2002 Australia
1 -0.015327 2001 NaN
2 0.100617 2000 NaN
3 0.067728 1999 NaN
4 0.089962 2010 China
5 -0.041844 2009 NaN
6 -0.031013 2008 NaN
7 0.091948 2007 NaN
8 0.082764 2006 Greece
9 0.103519 2005 NaN
10 -0.048100 2004 NaN
11 -0.014992 2003 NaN
12 0.166187 1966 Japan
If you see all NA's under the country column and following a country name belongs to that country name until a new country name is encountered. Like all 3 NA's after Australia stand for Australis and the 3 NAs' after China stand for China and so on. I want to recode this variable as a numeric variable such that all observations which belong to one country are coded as same. Like all 4 obs for Australia(AUS+3NAs) should have 1, China as 2 and so on.
In SAS I can do order by and use first. and last. and recode. How do we do similar stuff in pandas.
Any ideas?
EDIT:
I tried implementing one of below solution. And here is one interesting thing I am getting and am not sure why?
My data frame is as above. When I run this:
df.Country or df['Country'] I get an error that there is no column called Country when there is;
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-360-361952a0cbf3> in <module>()
2 data_train=data_2yr[features] # Subsetting the features from original data
3 # Recoding Country, Year variable
----> 4 data_train.Country
/Users/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
1945 return self[name]
1946 raise AttributeError("'%s' object has no attribute '%s'" %
-> 1947 (type(self).__name__, name))
1948
1949 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'Country'
Because of this I am not able to implement the solutions proposed. What is going wrong here?

Call df = df.ffill() and then call factorize, factorize returns a tuple of array values and an index which is composed of your Series values, we only want the array values here:
In [476]:
df['Country'] = df['Country'].factorize()[0]
df
Out[476]:
LogdiffT Year Country
0 -0.003094 2002 0
1 -0.015327 2001 0
2 0.100617 2000 0
3 0.067728 1999 0
4 0.089962 2010 1
5 -0.041844 2009 1
6 -0.031013 2008 1
7 0.091948 2007 1
8 0.082764 2006 2
9 0.103519 2005 2
10 -0.048100 2004 2
11 -0.014992 2003 2
12 0.166187 1966 3
output from factorize:
In [480]:
df['Country'].factorize()
Out[480]:
(array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3]),
Index(['Australia', 'China', 'Greece', 'Japan'], dtype='object'))
As suggest by #John Galt you could compact the above into a one-liner:
df['Country'] = df['Country'].ffill().factorize()[0]

Let's say your dataframe is called df and you have a nested dictionary of your country codes as shown below. Then you can use replace
country_code = {'Country': {'Australia':1, 'China':2, 'Greece':3, 'Japan':4}}
df = df.replace(country_code)
# alternatively df['Country'] = df['Country'].replace(country_code['Country'])
df['Country'] = df['Country'].fillna(method='pad') # fills up the nans

Here's one way to do it.
Get unique countries list by dropping NaNs
In [66]: country_dict = {y:x for x,y in enumerate(df['Country'].dropna().unique())}
In [67]: country_dict
Out[67]: {'Australia': 0, 'China': 1, 'Greece': 2, 'Japan': 3}
Replace Country with country_dict
In [68]: dff = df.replace({'Country': country_dict})
In [69]: dff
Out[69]:
LogdiffT Year Country
0 -0.003094 2002 0
1 -0.015327 2001 NaN
2 0.100617 2000 NaN
3 0.067728 1999 NaN
4 0.089962 2010 1
5 -0.041844 2009 NaN
6 -0.031013 2008 NaN
7 0.091948 2007 NaN
8 0.082764 2006 2
9 0.103519 2005 NaN
10 -0.048100 2004 NaN
11 -0.014992 2003 NaN
12 0.166187 1966 3
And, then ffill() with previous values.
In [70]: dff.ffill()
Out[70]:
LogdiffT Year Country
0 -0.003094 2002 0
1 -0.015327 2001 0
2 0.100617 2000 0
3 0.067728 1999 0
4 0.089962 2010 1
5 -0.041844 2009 1
6 -0.031013 2008 1
7 0.091948 2007 1
8 0.082764 2006 2
9 0.103519 2005 2
10 -0.048100 2004 2
11 -0.014992 2003 2
12 0.166187 1966 3

Related

How to interpolate missing years within pd.groupby()

Problem:
I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.
Sample df:
import pandas as pd
data = {
'id': ['a', 'b', 'a', 'b'],
'year': [2005, 2005, 2010, 2010],
'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)
example input df:
_ id year val
0 a 2005 0
1 a 2010 100
2 b 2005 0
3 b 2010 100
expected output df:
_ id year val type
0 a 2005 0 original
1 a 2006 20 interpolated
2 a 2007 40 interpolated
3 a 2008 60 interpolated
4 a 2009 80 interpolated
5 a 2010 100 original
6 b 2005 0 original
7 b 2006 20 interpolated
8 b 2007 40 interpolated
9 b 2008 60 interpolated
10 b 2009 80 interpolated
11 b 2010 100 original
'type' is not necessary its just for illustration purposes.
Question:
How can I add missing years to the groupby() view and interpolate() their corresponding values?
Thank you!
Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:
out = (df
.pivot(index='year', columns='id', values='val')
.reindex(range(df['year'].min(), df['year'].max()+1))
.interpolate('index')
.unstack(-1).reset_index(name='val')
)
Output:
id year val
0 a 2005 0.0
1 a 2006 20.0
2 a 2007 40.0
3 a 2008 60.0
4 a 2009 80.0
5 a 2010 100.0
6 b 2005 0.0
7 b 2006 20.0
8 b 2007 40.0
9 b 2008 60.0
10 b 2009 80.0
11 b 2010 100.0
Solution for create years by minimal and maximal years for each group independently:
First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:
df = (df.set_index('year')
.groupby('id')['val']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
.reset_index()
.merge(df, how='left', indicator=True)
.assign(type = lambda x: np.where(x.pop('_merge').eq('both'),
'original',
'interpolated')))
print (df)
id year val type
0 a 2005 0.0 original
1 a 2006 20.0 interpolated
2 a 2007 40.0 interpolated
3 a 2008 60.0 interpolated
4 a 2009 80.0 interpolated
5 a 2010 100.0 original
6 b 2005 0.0 original
7 b 2006 20.0 interpolated
8 b 2007 40.0 interpolated
9 b 2008 60.0 interpolated
10 b 2009 80.0 interpolated
11 b 2010 100.0 original

Pandas: How to replace column values in panel dataset based on ID and condition

So I have a panel df that looks like this:
ID
year
value
1
2002
8
1
2003
9
1
2004
10
2
2002
11
2
2003
11
2
2004
12
I want to set the value for every ID and for all years to the value in 2004. How do I do this?
The df should then look like this:
ID
year
value
1
2002
10
1
2003
10
1
2004
10
2
2002
12
2
2003
12
2
2004
12
Could not find anything online. So far I have tried to get the value for every ID for year 2004, created a new df from that and then merged it back in. Though, that is super slow.
We can use Series.map for this, first we select the values and create our mapping:
mapping = df[df["year"].eq(2004)].set_index("ID")["value"]
df["value"] = df["ID"].map(mapping)
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
Let's convert the value where corresponding year is not 2004 to NaN then get the max value per ID.
df['value'] = (df.assign(value=df['value'].mask(df['year'].ne(2004)))
.groupby('ID')['value'].transform('max'))
print(df)
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Another method, for some variety.
# Make everything that isn't 2004 null~
df.loc[df.year.ne(2004), 'value'] = np.nan
# Fill the values by ID~
df['value'] = df.groupby('ID')['value'].bfill()
Output:
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Yet another method, a bit longer but should be quite intuitive. Basically creating a lookup table for ID->value then performing lookup using pandas.merge.
import pandas as pd
# Original dataframe
df_orig = pd.DataFrame([(1, 2002, 8), (1, 2003, 9), (1, 2004, 10), (2, 2002, 11), (2, 2003, 11), (2, 2004, 12)])
df_orig.columns = ['ID', 'year', 'value']
# Dataframe with 2004 IDs
df_2004 = df_orig[df_orig['year'] == 2004]
df_2004.drop(columns=['year'], inplace=True)
print(df_2004)
# Drop values from df_orig and replace with those from df_2004
df_orig.drop(columns=['value'], inplace=True)
df_final = pd.merge(df_orig, df_2004, on='ID', how='right')
print(df_final)
df_2004:
ID value
2 1 10
5 2 12
df_final:
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12

Resampling DataFrame while keeping my other columns untouched

I have a dataset that tells me the frequency of an event for each demographic(for example, first row says there are 13 white men, who are 11 years old, in the county Alameda in the year 2006 who experienced an event). Here is the original DataFrame:
Year County age Race Sex freq
0 2006 Alameda 11 1 0 13
1 2006 Alameda 11 1 1 9
2 2006 Alameda 11 2 0 9
3 2006 Alameda 11 2 1 16
4 2006 Alameda 11 3 0 2
Now, I want to compute the 2 year average of the "freq" column, by the demographic. This is the code I tried and the output:
dc = dc.dropna()
dc['date'] = dc.apply(lambda x: pd.Timestamp('{year}'
.format(year=int(x.Year),
)),
axis=1)
dc.set_index('date', inplace=True)
dc=dc.resample('2A', how='mean')
Date age_range Race Sex freq
2006-12-31 14.507095 1.637789 0.489171 10.451830
2008-12-31 14.543697 1.664187 0.493120 10.285980
2010-12-31 14.516471 1.670205 0.489019 10.349927
2012-12-31 14.512953 1.675056 0.486677 10.109178
2014-12-31 14.568190 1.699817 0.485923 10.134186
It's computing the averages for each column, but how do I do it for just the freq,by the demographic cuts(like the orginal DF) column?
Combining groupby() Grouper() and transform() should give you a series you want. Your sample data set does not have enough data to demonstrate.
dc = pd.read_csv(io.StringIO(""" Year County age Race Sex freq
0 2006 Alameda 11 1 0 13
1 2006 Alameda 11 1 1 9
2 2006 Alameda 11 2 0 9
3 2006 Alameda 11 2 1 16
4 2006 Alameda 11 3 0 2"""), sep="\s+")
dc["date"] = pd.to_datetime(dc["Year"], format="%Y")
dc["freq_2y"] = dc.groupby(["County","age","Race","Sex",
pd.Grouper(key="date", freq="2A", origin="start"),])["freq"].transform("mean")
Year
County
age
Race
Sex
freq
date
freq_2y
0
2006
Alameda
11
1
0
13
2006-01-01 00:00:00
13
1
2006
Alameda
11
1
1
9
2006-01-01 00:00:00
9
2
2006
Alameda
11
2
0
9
2006-01-01 00:00:00
9
3
2006
Alameda
11
2
1
16
2006-01-01 00:00:00
16
4
2006
Alameda
11
3
0
2
2006-01-01 00:00:00
2

Python: In a DataFrame, how do I find the year that strings from one column appear in another column?

I've got a dataframe and want to loop through all strings within column c2 and print that string and the year it appears in column c2 and then also print the first year when it appears in column c1, if it exists in c1. And then tally the difference between the years in another column. There are NaN values in c2.
Example df:
id year c1 c2
0 1999 luke skywalker han solo
1 2000 leia organa r2d2
2 2001 han solo finn
3 2002 r2d2 NaN
4 2004 finn c3po
5 2002 finn NaN
6 2005 c3po NaN
Example printed result:
c2 year in c2 year in c1 delta
han solo 1999 2001 2
r2d2 2000 2002 2
finn 2001 2004 3
c3po 2004 2005 1
I'm using Jupyter notebook with python and pandas. Thanks!
You can do it in steps like this:
df1 = df[df.c2.notnull()].copy()
s = df.groupby('c1')['year'].first()
df1['year in c1'] = df1.c2.map(s)
df1 = df1.rename(columns={'year':'year in c2'})
df1['delta'] = df1['year in c1'] - df1['year in c2']
print(df1[['c2','year in c2','year in c1', 'delta']])
Output:
c2 year in c2 year in c1 delta
0 han solo 1999 2001 2
1 r2d2 2000 2002 2
2 finn 2001 2004 3
4 c3po 2004 2005 1
Here is one way.
df['year_c1'] = df['c2'].map(df.groupby('c1')['year'].agg('first'))\
.fillna(0).astype(int)
df = df.rename(columns={'year': 'year_c2'})
df['delta'] = df['year_c1'] - df['year_c2']
df = df.loc[df['c2'].notnull(), ['id', 'year_c2', 'year_c1', 'delta']]
# id year_c2 year_c1 delta
# 0 0 1999 2001.0 2
# 1 1 2000 2002.0 2
# 2 2 2001 2004.0 3
# 4 4 2004 2005.0 1
Explanation
Map c1 to year, aggregating by "first".
Use this map on c2 to calculate year_c1.
Calculate delta as the difference between year_c2 and year_c1.
Remove rows with null in c2 and order columns.

How to subtract rows of one pandas data frame from another?

The operation that I want to do is similar to merger. For example, with the inner merger we get a data frame that contains rows that are present in the first AND second data frame. With the outer merger we get a data frame that are present EITHER in the first OR in the second data frame.
What I need is a data frame that contains rows that are present in the first data frame AND NOT present in the second one? Is there a fast and elegant way to do it?
Consider Following:
df_one is first DataFrame
df_two is second DataFrame
Present in First DataFrame and Not in Second DataFrame
Solution: by Index
df = df_one[~df_one.index.isin(df_two.index)]
index can be replaced by required column upon which you wish to do exclusion.
In above example, I've used index as a reference between both Data Frames
Additionally, you can also use a more complex query using boolean pandas.Series to solve for above.
How about something like the following?
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
As long as there is a non-key commonly named column, you can let the added on sufffexes do the work (if there is no non-key common column then you could create one to use temporarily ... df1['common'] = 1 and df2['common'] = 1):
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
Or you can use isin but you would have to create a single key:
df1['key'] = df1['Team'] + df1['Year'].astype(str)
df2['key'] = df1['Team'] + df2['Year'].astype(str)
print df1[~df1.key.isin(df2.key)]
Team Year foo key
0 Hawks 2001 5 Hawks2001
2 Nets 1987 3 Nets1987
4 Nets 2001 8 Nets2001
5 Nets 2000 10 Nets2000
6 Heat 2004 6 Heat2004
7 Pacers 2003 12 Pacers2003
You could run into errors if your non-index column has cells with NaN.
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
8 Problem 2112 NaN
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
3 Problem 2112 NaN
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
6 Problem 2112 NaN NaN
The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.
Solution:
What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.
df2['in_df2']='yes'
print df2
Team Year foo in_df2
0 Pacers 2003 12 yes
1 Heat 2004 6 yes
2 Nets 1988 6 yes
3 Problem 2112 NaN yes
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.in_df2.isnull()]
Team Year foo_x foo_y in_df1 in_df2
0 Hawks 2001 5 NaN yes NaN
1 Hawks 2004 4 NaN yes NaN
2 Nets 1987 3 NaN yes NaN
4 Nets 2001 8 NaN yes NaN
5 Nets 2000 10 NaN yes NaN
NB. The problem row is now correctly filtered out, because it has a value for in_df2.
Problem 2112 NaN NaN yes yes
I suggest using parameter 'indicator' in merge. Also if 'on' is None this defaults to the intersection of the columns in both DataFrames.
new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge'
new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2
new = new.drop(columns='_merge').copy()
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
4 Nets 2001 8
5 Nets 2000 10
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
Information column is Categorical-type and takes on a value of
“left_only” for observations whose merge key only appears in ‘left’ DataFrame,
“right_only” for observations whose merge key only appears in ‘right’ DataFrame,
and “both” if the observation’s merge key is found in both.

Categories