We have one dataframe like
-0.140447131 0.124802527 0.140780106
0.062166349 -0.121484447 -0.140675515
-0.002989106 0.13984927 0.004382326
and the other as
1
1
2
We need to concat both the dataframe like
-0.140447131 0.124802527 0.140780106 1
0.062166349 -0.121484447 -0.140675515 1
-0.002989106 0.13984927 0.004382326 2
Let's say your first dataframe is like
In [281]: df1
Out[281]:
a b c
0 -0.140447 0.124803 0.140780
1 0.062166 -0.121484 -0.140676
2 -0.002989 0.139849 0.004382
And, the second like,
In [283]: df2
Out[283]:
d
0 1
1 1
2 2
Then you could create new column for df1 using df2
In [284]: df1['d_new'] = df2['d']
In [285]: df1
Out[285]:
a b c d_new
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 1
2 -0.002989 0.139849 0.004382 2
The assumption however being both dataframes have common index
Use pd.concat and specify the axis equal to 1 (rows):
df_new = pd.concat([df1, df2], axis=1)
>>> df_new
0 1 2 0
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 2
2 -0.002989 0.139849 0.004382 3
Related
I am trying to update df1 with df2:
add new rows from df2 to df1
update existing rows (if row index exist)
df1 = pd.DataFrame([[1,3],[2,4]], index=[1,2], columns=['a','b'])
df2 = pd.DataFrame([[0,1],[3,2]], index=[3,2], columns=['a','b'])
The expected result should be
a b
1 1 3
2 2 3
3 1 0
but
df1.append(df2).drop_duplicates(keep='last') # drop_duplicates has no effect
gives a simple vertical stack
a b
1 1 3
2 2 4
3 1 0
2 2 3
df1.merge(df2, how='outer')
gives the same values and destroys the row index
a b
0 1 3
1 2 4
2 1 0
3 2 3
df1.join(df2)
df1.loc[df2.index] = df1.values
raise error
Try this:
new_df = df1.append(df2)
new_df = new_df[~new_df.index.duplicated(keep='last')]
How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)
Given the dataframe df
df = pd.DataFrame([1,2,3,4])
print(df)
0
0 1
1 2
2 3
3 4
I would like to modify it as
print(df)
0
A 1
A 2
A 3
A 4
In this specific case you can use:
df.index = ['A'] * len(df)
Use set_index
In [797]: df.set_index([['A']*len(df)], inplace=True)
In [798]: df
Out[798]:
0
A 1
A 2
A 3
A 4
When you create the df, you can add it.
df = pd.DataFrame([1,2,3,4],index=['A']*4)
df
Out[325]:
0
A 1
A 2
A 3
A 4
We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
I have a dataframe that looks like
day type col d_1 d_2 d_3 d_4 d_5...
1 A 1 1 0 1 0
1 A 2 1 0 1 0
2 B 1 1 1 0 0
That is, I have one normal column (col) and many columns prefixed by d_
I need to perform a groupby by day and type and I want to compute the sum of the values in every d_ column for every day-type combination. I also need to perform other aggregation functions on the other columns in my data (such as col in the example)
I can use:
agg_df=df.groupby(['day','type']).agg({'d_1': 'sum', 'col': 'mean'})
but this computes the sum only for one d_ column. How can I specify all the possible d_ columns in my data?
In other words, I would like to write something like
agg_df=df.groupby(['day','type']).agg({'d_*': 'sum', 'col': 'mean'})
so that the expected output is:
day type col d_1 d_2 d_3 d_4 d_5...
1 A 1.5 2 0 2 0 ...
2 B 1 1 1 0 0
As you can see, col is aggregated by mean, while the d_ columns are summed.
Thanks for your help!
IIUC you need to subset your groupby dataframe with your d_* columns. You could find that columns with str.contain and pass it to the groupby dataframe:
cols = df.columns[df.columns.str.contains('(d_)+|col')]
agg_df=df.groupby(['day','type'])[cols].sum()
In [150]: df
Out[150]:
day type col d_1 d_2 d_3 d_4
0 1 A 1 1 0 1 0
1 1 A 2 1 0 1 0
2 2 B 1 1 1 0 0
In [155]: agg_df
Out[155]:
col d_1 d_2 d_3 d_4
day type
1 A 3 2 0 2 0
2 B 1 1 1 0 0
Note: I added the col columns to the contains pattern as you requested. You could specify whatever regex expression you want and pass it with | symbol.
You can use filter:
In [23]: df.groupby(['day','type'], as_index=False)[df.filter(regex='d_.*').columns].sum()
Out[23]:
day type d_1 d_2 d_3 d_4
0 1 A 2 0 2 0
1 2 B 1 1 0 0
If you wanna apply all functions in one shot:
dic = {}
dic.update({i:np.sum for i in df.filter(regex='d_.*').columns})
dic.update({'col':np.mean})
In [48]: df.groupby(['day','type'], as_index=False).agg(dic)
#Out[48]:
# day type d_2 d_3 d_1 col d_4
#0 1 A 0 2 2 1.5 0
#1 2 B 1 0 1 1.0 0