I have two dataframes that I'm trying to merge.
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 NaN NaN
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 NaN NaN
5 313 3 NaN NaN
...
df2
code scale R1 R2...
0 121 2 30 20
3 313 2 15 10
...
I need, based on the equality of the columns code and scale copy the value from df2 to df1.
The result should look like this:
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 30 20
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 15 10
5 313 3 NaN NaN
...
The problem is that there can be a lot of columns like R1 and R2 and I can not check each one separately, so I wanted to use something from this instruction, but nothing gives me the desired result. I'm doing something wrong, but I can't understand what. I really need advice.
What do you want to happen if the two dataframes both have values for R1/R2? If you want keep df1, you could do
df1.set_index(['code', 'scale']).fillna(df2.set_index(['code', 'scale'])).reset_index()
To keep df2 just do the fillna the other way round. To combine in some other way please clarify the question!
Try this ?
pd.concat([df,df1],axis=0).sort_values(['code','scale']).drop_duplicates(['code','scale'],keep='last')
Out[21]:
code scale R1 R2
0 121 1 80.0 110.0
0 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
3 313 2 15.0 10.0
5 313 3 NaN NaN
This is a good situation for combine_first. It replaces the nulls in the calling dataframe from the passed dataframe.
df1.set_index(['code', 'scale']).combine_first(df2.set_index(['code', 'scale'])).reset_index()
code scale R1 R2
0 121 1 80.0 110.0
1 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
4 313 2 15.0 10.0
5 313 3 NaN NaN
Other solutions
with fillna
df.set_index(['code', 'scale']).fillna(df1.set_index(['code', 'scale'])).reset_index()
with add - a bit faster
df.set_index(['code', 'scale']).add(df1.set_index(['code', 'scale']), fill_value=0)
Related
df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN
My goal today is to follow each ID that belongs to Category==1 in a given date, one year later. So I have a dataframe like this:
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201: 1,2. So I need to get the value of Amount, only for those ID's, one year later, like this:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan. My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going. I'm thinking some kind of groupby('ID') might be more efficient too. If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.
You can create lagged features with an exact merge after you manually lag one of the join keys.
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN
I have a dataframe df containing the population p assigned to some buildings b
df
p b
0 150 3
1 345 7
2 177 4
3 267 2
and a dataframe df1 that associates some other buildings b1 to the buildings in df
df1
b1 b
0 17 3
1 9 7
2 13 7
I want to assign to the buildings that have an association in df1 a population divided the number of buildings. In this way we generate df2 that assign a population of 150/2=75 to the buildings 3 and 17 and a population of 345/3=115 to the buildings 7,9,13.
df2
p b
0 75 3
1 75 17
2 115 7
3 115 9
4 115 13
5 177 4
6 267 2
IIUC, you can try with merging both dfs on b then stack() and some cleansing, finally group on p and transform count and divide p with that to get divided values on p:
m=(df.merge(df1,on='b',how='left').set_index('p').stack().reset_index(name='b')
.drop_duplicates().drop('level_1',1).sort_values('p'))
m.p=m.p/m.groupby('p')['p'].transform('count')
print(m.sort_index())
p b
0 75.0 3.0
1 75.0 17.0
2 115.0 7.0
3 115.0 9.0
5 115.0 13.0
6 177.0 4.0
7 267.0 2.0
Another way using pd.concat. After that, fillna individually b1 and p. Next, transform with mean and assign filled b1 to the final dataframe
df2 = pd.concat([df, df1], sort=True).sort_values('b')
df2['b1'] = df2.b1.fillna(df2.b)
df2['p'] = df2.p.fillna(0)
df2.groupby('b').p.transform('mean').to_frame().assign(b=df2.b1).reset_index(drop=True)
Out[159]:
p b
0 267.0 2.0
1 75.0 3.0
2 75.0 17.0
3 177.0 4.0
4 115.0 7.0
5 115.0 9.0
6 115.0 13.0
Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).
I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?
Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0
You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64