What is a systematic way to go from this:
x = {'col0': [1, 1, 2, 2], 'col1': ['a', 'b', 'a', 'b'],
'col2': ['x', 'x', 'x', 'x'], 'col3': [12, 13, 14, 15]}
y = pd.DataFrame(data=x)
y
col0 col1 col2 col3
0 1 a x 12
1 1 b x 13
2 2 a x 14
3 2 b x 15
To this:
y2
col0 col3__a_x col3__b_x
0 1 12 13
1 2 14 15
I was initially thinking something like cast from the reshape2 package from R. However, I'm much less familiar with Pandas/Python than I am with R.
In the dataset I'm working with col1 has 3 different values, col2 is all the same value, ~200,000 rows, and ~80 other columns that would get the suffix added.
You will need pviot and column faltten
s=pd.pivot_table(y,index='col0',columns=['col1','col2'],values='col3')
s.columns=s.columns.map('_'.join)
s.add_prefix('col3_').reset_index()
Out[1383]:
col0 col3_a_x col3_b_x
0 1 12 13
1 2 14 15
You can do it using set_index and unstack if you don't have multiple values for resulting rows and columns otherwise you'll have to use a aggregation method such as pivot_table or groupby:
df_out = y.set_index(['col0','col1','col2']).unstack([1,2])
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()
Output:
col0 col3_a_x col3_b_x
0 1 12 13
1 2 14 15
Or with multiple values using groupby:
df_out = y.groupby(['col0','col1','col2']).mean().unstack([1,2])
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()
Using pd.factorize and Numpy slice assignment we can construct the data frame we need.
i, r = pd.factorize(df.col0)
j, c = pd.factorize(df.col1.str.cat(df.col2, '_'))
b = np.zeros((r.size, c.size), np.int64)
b[i, j] = df.col3.values
d = pd.DataFrame(
np.column_stack([r, b]),
columns=['col0'] + ['col3__' + col for col in c]
)
d
col0 col3__a_x col3__b_x
0 1 12 13
1 2 14 15
I think that #Wen 's solution is probably better, as it is pure pandas, but here is another solution if you want to use numpy:
import numpy as np
d = y.groupby('col0').apply(lambda x: x['col3']).unstack().values
d = d[~np.isnan(d)].reshape(len(d),-1)
new_df = pd.DataFrame(d).reset_index().rename(columns={'index': 'col0', 0: 'col3_a_x', 1:'col3_b_x'})
>>> new_df
col0 col3_a_x col3_b_x
0 0 12.0 13.0
1 1 14.0 15.0
Related
I have the following sample input data:
import pandas as pd
df = pd.DataFrame({'col1': ['x', 'y', 'z'], 'col2': [1, 2, 3], 'col3': ['a', 'a', 'b']})
I would like to sort and group by col3 while interleaving the summaries on top of the corresponding group in col1 and get the following output:
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I can of course do the part:
df.sort_values(by=['col3']).groupby(by=['col3']).sum()
col2
col3
a 3
b 3
but I am not sure how to interleave the group labels on top of col1.
Use custom function for top1 row for each group:
def f(x):
return pd.DataFrame({'col1': x.name, 'col2': x['col2'].sum()}, index=[0]).append(x)
df = (df.sort_values(by=['col3'])
.groupby(by=['col3'], group_keys=False)
.apply(f)
.drop('col3', 1)
.reset_index(drop=True))
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
More performant solution is use GroupBy.ngroup for indices, aggregate sum amd last join values by concat with only stable sorting by mergesort:
df = df.sort_values(by=['col3'])
df1 = df.groupby(by=['col3'])['col2'].sum().rename_axis('col1').reset_index()
df2 = df.set_index(df.groupby(by=['col3']).ngroup())
df = pd.concat([df1, df2]).sort_index(kind='mergesort', ignore_index=True).drop('col3', 1)
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
What about:
(df.melt(id_vars='col2')
.rename(columns={'value': 'col1'})
.groupby('col1').sum()
.reset_index()
)
output:
col1 col2
0 a 3
1 b 3
2 x 1
3 y 2
4 z 3
def function1(dd:pd.DataFrame):
df.loc[dd.index.min()-0.5,['col1','col2']]=[dd.name,dd.col2.sum()]
df.groupby('col3').apply(function1).pipe(lambda dd:df.sort_index(ignore_index=True)).drop('col3',axis=1)
output
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
or use pandasql library
def function1(dd:pd.DataFrame):
return dd.sql("select '{}' as col1,{} as col2 union select col1,col2 from self".format(dd.name,dd.col2.sum()))
df.groupby('col3').apply(function1).reset_index(drop=False)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
Given the following dataframe,
is it possible to calculate the sum of col2 and the sum of col2 + col3,
in a single aggregating function?
import pandas as pd
df = pd.DataFrame({'col1': ['a', 'a', 'b', 'b'], 'col2': [1, 2, 3, 4], 'col3': [10, 20, 30, 40]})
.
col1
col2
col3
0
a
1
10
1
a
2
20
2
b
3
30
3
b
4
40
In R's dplyr I would do it with a single line of summarize,
and I was wondering what might be the equivalent in pandas:
df %>% group_by(col1) %>% summarize(col2_sum = sum(col2), col23_sum = sum(col2 + col3))
Desired result:
.
col1
col2_sum
col23_sum
0
a
3
33
1
b
7
77
Let us try assign the new column first
out = df.assign(col23 = df.col2+df.col3).groupby('col1',as_index=False).sum()
Out[81]:
col1 col2 col3 col23
0 a 3 30 33
1 b 7 70 77
From my understanding the apply is more like the summarize in R
out = df.groupby('col1').\
apply(lambda x : pd.Series({'col2_sum':x['col2'].sum(),
'col23_sum':(x['col2'] + x['col3']).sum()})).\
reset_index()
Out[83]:
col1 col2_sum col23_sum
0 a 3 33
1 b 7 77
You can do it easily with datar:
>>> from datar.all import f, tibble, group_by, summarize, sum
>>> df = tibble(
... col1=['a', 'a', 'b', 'b'],
... col2=[1, 2, 3, 4],
... col3=[10, 20, 30, 40]
... )
>>> df >> group_by(f.col1) >> summarize(
... col2_sum = sum(f.col2),
... col23_sum = sum(f.col2 + f.col3)
... )
col1 col2_sum col23_sum
<object> <int64> <int64>
0 a 3 33
1 b 7 77
I am the author of the datar package.
How to multiply all the numeric values in the data frame by a constant without having to specify column names explicitly? Example:
In [13]: df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
In [14]: df
Out[14]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
I tried df.multiply but it affects the string values as well by concatenating them several times.
In [15]: df.multiply(3)
Out[15]:
col1 col2 col3
0 AAA 3 90
1 BBB 6 30
2 CCC 9 60
Is there a way to preserve the string values intact while multiplying only the numeric values by a constant?
you can use select_dtypes() including number dtype or excluding all columns of object and datetime64 dtypes:
Demo:
In [162]: df
Out[162]:
col1 col2 col3 date
0 A 1 30 2016-01-01
1 B 2 10 2016-01-02
2 C 3 20 2016-01-03
In [163]: df.dtypes
Out[163]:
col1 object
col2 int64
col3 int64
date datetime64[ns]
dtype: object
In [164]: df.select_dtypes(exclude=['object', 'datetime']) * 3
Out[164]:
col2 col3
0 3 90
1 6 30
2 9 60
or a much better solution (c) ayhan:
df[df.select_dtypes(include=['number']).columns] *= 3
From docs:
To select all numeric types use the numpy dtype numpy.number
The other answer specifies how to multiply only numeric columns. Here's how to update it:
df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
s = df.select_dtypes(include=[np.number])*3
df[s.columns] = s
print (df)
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
One way would be to get the dtypes, match them against object and datetime dtypes and exclude them with a mask, like so -
df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
Sample run -
In [273]: df
Out[273]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
In [274]: df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
In [275]: df
Out[275]:
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
This should work even over mixed types within columns but is likely slow over large dataframes.
def mul(x, y):
try:
return pd.to_numeric(x) * y
except:
return x
df.applymap(lambda x: mul(x, 3))
A simple solution using assign() and select_dtypes():
df.assign(**df.select_dtypes('number')*3)
I'm trying to replace values in a Pandas data frame, based on certain criteria on multiple columns. For a single column criteria this can be done very elegantly with a dictionary (e.g. Remap values in pandas column with a dict):
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2}, 'col2': {0:10, 1:20, 2:20}})
rdict = {1:'a', 2:'b'}
df2 = df.replace({"col1": rdict})
Input df:
col1 col2
0 1 10
1 1 20
2 2 20
Resulting df2:
col1 col2
0 a 10
1 a 20
2 b 20
I'm trying to extend this to criteria over multiple columns (e.g. where col1==1, col2==10 -> replace). For a single criteria this can be done like:
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'c'
Which results in a df3:
col1 col2
0 c 10
1 1 20
2 2 20
My real life problem has a large number of criteria, which would involve a large number of df3.loc[((criteria1)&(criteria2)), column] = value calls, which is far less elegant the the replacement using a dictionary as a "lookup table". Is it possible to extend the elegant solution (df2 = df.replace({"col1": rdict})) to a setup where values in one column are replaced by criteria based on multiple columns?
An example of what I'm trying to achieve (although in my real life case the number of criteria is a lot larger):
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'a'
df3.loc[((df['col1']==1)&(df['col2']==20)), 'col1'] = 'b'
df3.loc[((df['col1']==2)&(df['col2']==10)), 'col1'] = 'c'
df3.loc[((df['col1']==2)&(df['col2']==20)), 'col1'] = 'd'
Input df:
0 1 10
1 1 20
2 2 10
3 2 20
Resulting df3:
col1 col2
0 a 10
1 b 20
2 c 10
3 d 20
We can use merge.
Suppose your df looks like
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2, 4:2, 5:1}, 'col2': {0:10, 1:20, 2:10, 3:20, 4: 20, 5:10}})
col1 col2
0 1 10
1 1 20
2 2 10
3 2 20
4 2 20
5 1 10
And your conditional replacement can be represented as another dataframe:
df_replace
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
(As OP (Bart) pointed out, you can save this in a csv file.)
Then you can use
df = df.merge(df_replace, on=["col1", "col2"], how="left")
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
4 2 20 d
5 1 10 a
Then you just need to drop col1.
As MaxU pointed out, there could be rows that does not get replaced and resulting in NaN. We can use a line like
df["val"] = df["val"].combine_first(df["col1"])
to fill in values from col1 if the resulting values after merge is NaN.
Demo:
Source DF:
In [120]: df
Out[120]:
col1 col2
0 1 10
1 1 10
2 1 20
3 1 20
4 2 10
5 2 20
6 3 30
Conditions & Replacements DF:
In [121]: cond
Out[121]:
col1 col2 repl
1 1 20 b
2 2 10 c
0 1 10 a
3 2 20 d
Solution:
In [121]: res = df.merge(cond, how='left')
yields:
In [122]: res
Out[122]:
col1 col2 repl
0 1 10 a
1 1 10 a
2 1 20 b
3 1 20 b
4 2 10 c
5 2 20 d
6 3 30 NaN # <-- NOTE
In [123]: res['col1'] = res.pop('repl').fillna(res['col1'])
In [124]: res
Out[124]:
col1 col2
0 a 10
1 a 10
2 b 20
3 b 20
4 c 10
5 d 20
6 3 30
This method is likely to be more efficient than pandas functionality, as it relies on numpy arrays and dictionary mappings.
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
rdict = {(1, 10): 'a', (1, 20): 'b', (2, 10): 'c', (2, 20): 'd'}
df['col1'] = list(map(rdict.get, [(x[0], x[1]) for x in df1[['c1', 'c2']].values]))
I have a Pandas dataframe that looks something like:
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}, index=['A', 'B', 'C', 'D'])
col1 col2
A 1 50
B 2 60
C 3 70
D 4 80
However, I want to automatically rearrange it so that it looks like:
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 50 60 70 80
I want to combine the row name with the column name
I want to end up with only one row
df2 = df.unstack()
df2.index = [' '.join(x) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 5 6 7 8
If you want to have the orignal x axis labels in front of the column names ("A col1"...) just change .join(x) by .join(x[::-1]):
df2 = df.unstack()
df2.index = [' '.join(x[::-1]) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
A col1 B col1 C col1 D col1 A col2 B col2 C col2 D col2
0 1 2 3 4 5 6 7 8
Here's one way to do it, there could be a simpler way
In [562]: df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [50, 60, 70, 80]},
index=['A', 'B', 'C', 'D'])
In [563]: pd.DataFrame([df.values.T.ravel()],
columns=[y+x for y in df.columns for x in df.index])
Out[563]:
col1A col1B col1C col1D col2A col2B col2C col2D
0 1 2 3 4 50 60 70 80