I have two pandas DataFrame.
df1 looks like this:
Date A B
2020-03-01 12 15
2020-03-02 13 16
2020-03-03 14 17
while df2, like this:
Date C
2020-03-03 x
2020-03-01 w
2020-03-05 y
I want to merge df2 to df1 such that the values turn into columns. Kinda like a one-hot encoding:
Date A B w x y z
2020-03-01 12 15 1 0 0 0
2020-03-02 13 16 0 0 0 1
2020-03-03 14 17 0 1 0 0
So the first row has a 1 in column w because the row with the same date, "2020-03-01", in df2['C'] is "w". Column z is for those entries in df1 without corresponding dates in df2. (Sorry if I couldn't explain it better. Feel free to clarify.)
As a solution, I thought of merging df1 and df2 first, like this:
Date A B C
2020-03-01 12 15 w
2020-03-02 13 16 -
2020-03-03 14 17 x
Then doing one-hot encoding using:
df1['w'] = (df2['C'] == 'w')*1.0
df1['y'] = (df2['C'] == 'y')*1.0
...
But I'm still thinking of how to code the first part, and the whole solution may not even be efficient. So I'm asking in case you know a more efficient way, like some combination of DataFrame methods. Thank you.
You can do with get_dummies and reindex to get z values:
df1.merge(pd.get_dummies(df2['C'])
.reindex(list('wxyz'), axis=1, fill_value=0)
.assign(Date=df2.Date),
on='Date',
how='left'
).fillna(0)
Output:
Date A B w x y z
0 2020-03-01 12 15 1.0 0.0 0.0 0.0
1 2020-03-02 13 16 0.0 0.0 0.0 0.0
2 2020-03-03 14 17 0.0 1.0 0.0 0.0
You should first build a tmp dataframe by using get_dummies after merging df1 and df2 on Date. Use reindex to make sure to have all columns, eventually filled with 0:
tmp = pd.get_dummies(df1.merge(df2, 'left', on='Date')['C']).reindex(df2['C'].values,
axis=1, fill_value=0)
it gives:
x w y
0 0 1 0
1 0 0 0
2 1 0 0
We can now compute the z column to be 1 if no 1 is present on the row and concat to df1:
tmp['z'] = 1 - tmp.aggregate('sum', axis=1)
resul = pd.concat([df1, tmp], axis=1)
to obtain:
Date A B x w y z
0 2020-03-01 12 15 0 1 0 0
1 2020-03-02 13 16 0 0 0 1
2 2020-03-03 14 17 1 0 0 0
Related
I am curious why a simple concatenation of two dataframes in pandas:
initId.shape # (66441, 1)
initId.isnull().sum() # 0
ypred.shape # (66441, 1)
ypred.isnull().sum() # 0
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1)
foo.shape # (83384, 2)
foo.isnull().sum() # 16943
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced?
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.
I think there is problem with different index values, so where concat cannot align get NaN:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
prediction
4 0
5 1
8 0
7 1
10 0
12 0
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 NaN 0.0
1 NaN 0.0
2 NaN 1.0
3 NaN 0.0
4 0.0 1.0
5 1.0 1.0
7 1.0 NaN
8 0.0 NaN
10 0.0 NaN
12 0.0 NaN
Solution is reset_index if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)
print(aaa)
prediction
0 0
1 1
2 0
3 1
4 0
5 0
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1
EDIT: If need same index like aaa and length of DataFrames is same use:
bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
4 0 0
5 1 0
8 0 1
7 1 0
10 0 1
12 0 1
You can do something like this:
concatenated_dataframes = concat(
[
dataframe_1.reset_index(drop=True),
dataframe_2.reset_index(drop=True),
dataframe_3.reset_index(drop=True)
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dataframe_1.columns),
list(dataframe_2.columns),
list(dataframe_3.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
To concatenate multiple DataFrames and keep the columns names / avoid NaN.
As jezrael pointed out, this is due to different index labels. concat matches on index, so if they are not the same, this problem will occur. For a straightforward horizontal concatenation, you must "coerce" the index labels to be the same. One way is via set_axis method. This makes the second dataframes index to be the same as the first's.
joined_df = pd.concat([df1, df2.set_axis(df1.index)], axis=1)
or just reset the index of both frames
joined_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
Given a simple dataframe:
df = pd.DataFrame({'user': ['x','x','x','x','x','y','y','y','y'],
'Flag': [0,1,0,0,1,0,1,0,0],
'time': [10,34,40,43,44,12,20, 46, 51]})
I want to calculate the timedelta from the last flag == 1 for each user.
I did the diffs:
df.sort_values(['user', 'time']).groupby('user')['time'].diff().fillna(pd.Timedelta(10000000)).dt.total_seconds()/60
But it doesn't seem to solve my issue, I need time delta between the 1's and if there wasn't any then fill with some number N.
Please advise
For example:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 NaN
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 NaN
7 y 0 46 26.0
8 y 0 51 31.0
I am not sure that I understood correctly, but if you want to compute the time delta only between 1's per group of user, you can apply your computation on the sliced dataframe for 1's only and using groupby:
df['delta'] = (df[df['Flag'].eq(1)] # select 1's only
.groupby('user') # group by user
['time'].diff() # compute the diff
.dt.total_seconds()/60 # convert to minutes
)
output:
user Flag time delta
0 x 0 0 days 10:30:00 NaN
1 x 1 0 days 11:34:00 NaN
2 x 0 0 days 11:43:00 NaN
3 y 0 0 days 13:43:00 NaN
4 y 1 0 days 14:40:00 NaN
5 y 0 0 days 15:32:00 NaN
6 y 1 0 days 18:30:00 230.0
7 w 0 0 days 19:30:00 NaN
8 w 0 0 days 20:11:00 NaN
edit. Here is a working solution for the updated question.
IIUC the update, you want to calculate the difference to the last 1 per user, and if the flag is 1, the difference to the last valid value per user if any.
In summary, it creates subgroup for ranges starting with 1s, then uses these groups to calculate the diffs. Finally masks the 1s with a diff with them previous value (is existing)
(df.assign(mask=df['Flag'].eq(1),
group=lambda d: d.groupby('user')['mask'].cumsum(),
# diff from last 1
diff=lambda d: d.groupby(['user', 'group'])['time'].apply(lambda g: g -(g.iloc[0] if g.name[1]>0 else float('nan'))),
)
# mask 1s with their own diff
.assign(## diff=lambda d: d['diff'].mask(d['mask'],d.groupby('user')['time'].diff()) ## OLD VERSION
diff= lambda d: d['diff'].mask(d['mask'],
(d[d['mask'].groupby(d['user']).cumsum().eq(0)|d['mask']]
.groupby('user')['time'].diff())
)
)
.drop(['mask', 'group'], axis=1) # cleanup temp columns
)
Output:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 24.0
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 8.0
7 y 0 46 26.0
8 y 0 51 31.0
Input
df1
id date v1
a 2020-1-1 1
a 2020-1-2 2
b 2020-1-4 10
b 2020-1-22 30
c 2020-2-4 10
c 2020-2-22 30
df2
id date v1
a 2020-1-3 1
b 2020-1-7 12
b 2020-1-22 13
c 2020-2-10 15
c 2020-2-22 60
Goal
id date v1 v2
a 2020-1-1 1 0
a 2020-1-2 2 0
a 2020-1-3 0 1
b 2020-1-4 10 0
b 2020-1-7 0 12
b 2020-1-22 30 13
c 2020-2-4 10 0
c 2020-2-10 0 15
c 2020-2-22 30 60
The details:
Only two dataframes, for each id, the date is unique.
Concat two dataframes into df based on id, each id contains all date values from two dataframe
new merge dataframe contains v1 and v2 columns, while the date in df1 and df2, it returns original values, while the date only in one of df1 and df2, it returns original value and 0 if there is no value on the date.
Try
I have searched merge, concat document but I could not find the answers.
First convert columns to datetimes for correct ordering by to_datetime, then DataFrame.merge with outer join and rename column v1 for df2 for avoid v1_x and v1_y columns in output, replace missing values by DataFrame.fillna, sorting output by DataFrame.sort_values:
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
df = (df1.merge(df2.rename(columns={'v1':'v2'}), on=['id','date'], how='outer')
.fillna(0)
.sort_values(['id','date']))
print (df)
id date v1 v2
0 a 2020-01-01 1.0 0.0
1 a 2020-01-02 2.0 0.0
6 a 2020-01-03 0.0 1.0
2 b 2020-01-04 10.0 0.0
7 b 2020-01-07 0.0 12.0
3 b 2020-01-22 30.0 13.0
4 c 2020-02-04 10.0 0.0
8 c 2020-02-10 0.0 15.0
5 c 2020-02-22 30.0 60.0
I am curious why a simple concatenation of two dataframes in pandas:
initId.shape # (66441, 1)
initId.isnull().sum() # 0
ypred.shape # (66441, 1)
ypred.isnull().sum() # 0
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1)
foo.shape # (83384, 2)
foo.isnull().sum() # 16943
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced?
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.
I think there is problem with different index values, so where concat cannot align get NaN:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
prediction
4 0
5 1
8 0
7 1
10 0
12 0
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 NaN 0.0
1 NaN 0.0
2 NaN 1.0
3 NaN 0.0
4 0.0 1.0
5 1.0 1.0
7 1.0 NaN
8 0.0 NaN
10 0.0 NaN
12 0.0 NaN
Solution is reset_index if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)
print(aaa)
prediction
0 0
1 1
2 0
3 1
4 0
5 0
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1
EDIT: If need same index like aaa and length of DataFrames is same use:
bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
4 0 0
5 1 0
8 0 1
7 1 0
10 0 1
12 0 1
You can do something like this:
concatenated_dataframes = concat(
[
dataframe_1.reset_index(drop=True),
dataframe_2.reset_index(drop=True),
dataframe_3.reset_index(drop=True)
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dataframe_1.columns),
list(dataframe_2.columns),
list(dataframe_3.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
To concatenate multiple DataFrames and keep the columns names / avoid NaN.
As jezrael pointed out, this is due to different index labels. concat matches on index, so if they are not the same, this problem will occur. For a straightforward horizontal concatenation, you must "coerce" the index labels to be the same. One way is via set_axis method. This makes the second dataframes index to be the same as the first's.
joined_df = pd.concat([df1, df2.set_axis(df1.index)], axis=1)
or just reset the index of both frames
joined_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
I am curious why a simple concatenation of two dataframes in pandas:
initId.shape # (66441, 1)
initId.isnull().sum() # 0
ypred.shape # (66441, 1)
ypred.isnull().sum() # 0
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1)
foo.shape # (83384, 2)
foo.isnull().sum() # 16943
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced?
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.
I think there is problem with different index values, so where concat cannot align get NaN:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
prediction
4 0
5 1
8 0
7 1
10 0
12 0
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 NaN 0.0
1 NaN 0.0
2 NaN 1.0
3 NaN 0.0
4 0.0 1.0
5 1.0 1.0
7 1.0 NaN
8 0.0 NaN
10 0.0 NaN
12 0.0 NaN
Solution is reset_index if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)
print(aaa)
prediction
0 0
1 1
2 0
3 1
4 0
5 0
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1
EDIT: If need same index like aaa and length of DataFrames is same use:
bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
4 0 0
5 1 0
8 0 1
7 1 0
10 0 1
12 0 1
You can do something like this:
concatenated_dataframes = concat(
[
dataframe_1.reset_index(drop=True),
dataframe_2.reset_index(drop=True),
dataframe_3.reset_index(drop=True)
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dataframe_1.columns),
list(dataframe_2.columns),
list(dataframe_3.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
To concatenate multiple DataFrames and keep the columns names / avoid NaN.
As jezrael pointed out, this is due to different index labels. concat matches on index, so if they are not the same, this problem will occur. For a straightforward horizontal concatenation, you must "coerce" the index labels to be the same. One way is via set_axis method. This makes the second dataframes index to be the same as the first's.
joined_df = pd.concat([df1, df2.set_axis(df1.index)], axis=1)
or just reset the index of both frames
joined_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)