I'm learning pandas. I have two dataframes:
df1 =
quality1 value
A 1
B 2
C 3
df2 =
quality2 value
D 1
E 10
F 100
I want to multiply them (as I might do with vectors to get a matrix). The answer should be:
df3 =
quality1 quality2 value
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300
How can I achieve this?
It's not the prettiest, but it would work:
>>> df1["dummy"] = 1
>>> df2["dummy"] = 1
>>> dfm = df1.merge(df2, on="dummy")
>>> dfm["value"] = dfm.pop("value_x") * dfm.pop("value_y")
>>> del dfm["dummy"]
>>> dfm
quality1 quality2 value
0 A D 1
1 A E 10
2 A F 100
3 B D 2
4 B E 20
5 B F 200
6 C D 3
7 C E 30
8 C F 300
Until we get native support for a Cartesian join (whistles and looks away..), merging on a dummy column is an easy way to get the same effect. The intermediate frame looks like
>>> dfm
quality1 value_x dummy quality2 value_y
0 A 1 1 D 1
1 A 1 1 E 10
2 A 1 1 F 100
3 B 2 1 D 1
4 B 2 1 E 10
5 B 2 1 F 100
6 C 3 1 D 1
7 C 3 1 E 10
8 C 3 1 F 100
You could also use cartesian function from scikit-learn:
from sklearn.utils.extmath import cartesian
# Your data:
df1 = pd.DataFrame({'quality1':list('ABC'), 'value':[1,2,3]})
df2 = pd.DataFrame({'quality2':list('DEF'), 'value':[1,10,100]})
# Make the matrix of labels:
dfm = pd.DataFrame(cartesian((df1.quality1.values, df2.quality2.values)),
columns=['quality1', 'quality2'])
# Multiply values:
dfm['value'] = df1.value.values.repeat(df2.value.size) * pd.np.tile(df2.value.values, df1.value.size)
print dfm.set_index(['quality1', 'quality2'])
Which yields:
value
quality1 quality2
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300
Related
What is a pythonic way of counting level 1 size per level 0 in multi index and creating a new column (named counts). I can achieve this in the following way but would like to gain an understanding of any simpler approaches:
Code
df = pd.DataFrame({'STNAME':['AL'] * 3 + ['MI'] * 4,
'CTYNAME':list('abcdefg'),
'COL': range(7) }).set_index(['STNAME','CTYNAME'])
print(df)
COL
STNAME CTYNAME
AL a 0
b 1
c 2
MI d 3
e 4
f 5
g 6
df1 = df.groupby(level=0).size().reset_index(name='count')
counts = df.merge(df1,left_on="STNAME",right_on="STNAME")["count"].values
df["counts"] = counts
This is the desired output:
COL counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
You can use groupby.transform with size here instead of merging:
output = df.assign(Counts=df.groupby(level=0)['COL'].transform('size'))
print(output)
COL Counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
I use the following code to try to change value in columns 4,5,6 of a dataframe to percentage format but it returned me the errors.
df.iloc[:,4:7].apply('{:.2%}'.format)
You can use DataFrame.applymap:
df = pd.DataFrame({
'a':list('abcdef'),
'b':list('aaabbb'),
'c':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'e':[1,3,5,7,1,0],
'e':[5,3,6,9,2,4],
'f':[7,8,9,4,2,3],
'g':[1,3,5,7,1,0],
'h':[7,8,9,4,2,3],
'i':[1,3,5,7,1,0]
})
df.iloc[:,4:7] = df.iloc[:,4:7].applymap('{:.2%}'.format)
print (df)
a b c d e f g h i
0 a a 4 7 500.00% 700.00% 100.00% 7 1
1 b a 5 8 300.00% 800.00% 300.00% 8 3
2 c a 4 9 600.00% 900.00% 500.00% 9 5
3 d b 5 4 900.00% 400.00% 700.00% 4 7
4 e b 5 2 200.00% 200.00% 100.00% 2 1
5 f b 4 3 400.00% 300.00% 0.00% 3 0
I'm working on a pandas data frame where I want to find the farthest out non-null value in each row and then reverse the order of those values and output a data frame with the row values reversed without leaving null values in the first column. Essentially reversing column order and shifting non-null values to the left.
IN:
1 2 3 4 5
1 a b c d e
2 a b c
3 a b c d
4 a b c
OUT:
1 2 3 4 5
1 e d c b a
2 c b a
3 d c b a
4 c b a
For each row, create a new Series with the same indexes but with the values reversed:
def reverse(s):
# Strip the NaN on both ends, but not in the middle
idx1 = s.first_valid_index()
idx2 = s.last_valid_index()
idx = s.loc[idx1:idx2].index
return pd.Series(s.loc[idx[::-1]].values, index=idx)
df.apply(reverse, axis=1)
Result:
1 2 3 4 5
1 e d c b a
2 c b a NaN NaN
3 d c b a NaN
4 c NaN b a NaN
I have made my own userdefined function in Python. The input are some parameters and a dataframe. First some new variables are added to the input dataframe. Then I try to make a groupby on the dataframe and left join the result on to the dataframe.
But the dataframe don't get the groupby variables added.
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['c']=df['b']*df['total']
aaa=df.groupby(['aa', 'bb']).agg({'c':'sum'})
df=pd.merge(df,a,how='left',on=['aa', 'bb'])
return
Next try:
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['d']=df['c']*df['b']
aaa=df.groupby(['y','x']).agg({'d':'sum','g':'sum'}).add_suffix('_sum')
df=df.join(aaa, on=['y','x'])
return
I then call the function by:
test(df2,params)
I would expect df2 would have 4 new columns, b, d, d_sum and g_sum. But it only has 2 new columns, b and d.
You can use GroupBy.transform instaed groupby with left join by merge:
aaa=df.groupby(['aa', 'bb']).agg({'c':'sum'})
df=pd.merge(df,a,how='left',on=['aa', 'bb'])
to:
df['c1'] = df.groupby(['aa', 'bb'])['c'].transform('sum')
All together:
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['c']=df['b']*df['total']
df['new'] = df.groupby(['aa', 'bb'])['c'].transform('sum')
return df
If need aggregate multiple columns is possible use DataFrame.join with default left join:
df = pd.DataFrame({
'x':list('dddddd'),
'y':list('aaabbb'),
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4],
'g':[1,3,6,4,4,3],
})
print (df)
x y a b c d g
0 d a 4 7 1 5 1
1 d a 5 8 3 3 3
2 d a 4 9 5 6 6
3 d b 5 4 7 9 4
4 d b 5 2 1 2 4
5 d b 4 3 0 4 3
params = {'some_parameter':100}
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['d']=df['c']*df['b']
aaa=df.groupby(['y','x']).agg({'d':'sum','g':'sum'}).add_suffix('_sum')
df=df.join(aaa, on=['y','x'])
return df
df1 = test(df, params)
print (df1)
x y a b c d g d_sum g_sum
0 d a 4 400 1 400 1 3900 10
1 d a 5 500 3 1500 3 3900 10
2 d a 4 400 5 2000 6 3900 10
3 d b 5 500 7 3500 4 4000 11
4 d b 5 500 1 500 4 4000 11
5 d b 4 400 0 0 3 4000 11
New to Pandas, not very sure how the 3D DataFrame works. My dataframe, called 'new' looks like this:
unique cat numerical
a b c d e f
0 0 1 2 3 4 5
1 0 1 2 3 4 5
I want to insert column 'z' so that it ends up like this:
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5
I successfully made a new column after slicing out 'unique' from my dataframe:
Doing this:
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
Gets me this:
a b z
0 0 1 9
1 0 1 9
However I have no idea how to put it back into the dataframe. I tried:
new['unique'] = new_column
But I've since found out that it just tries to replace all the values in all the rows and columns found under 'unique', like this:
new['unique'] = 'a'
Gets:
unique cat numerical
a b c d e f
0 a a 2 3 4 5
1 a a 2 3 4 5
And using .loc gets this instead:
unique cat numerical
a b c d e f
0 NaN NaN 2 3 4 5
1 NaN NaN 2 3 4 5
Here's my full code:
import pandas as pd
import numpy as np
data=[[0,1,2,3,4,5],[0,1,2,3,4,5]]
datatypes=np.array(['unique','unique','cat','cat','numerical','numerical'])
columnnames=np.array(['a','b','c','d','e','f'])
new = pd.DataFrame(data=data, columns=pd.MultiIndex.from_tuples(zip(datatypes,columnnames)))
print('new: ')
print(new)
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
print('\nnew column:')
print(new_column)
new.loc[:,'unique'] = new_column
print('\nattempt 1:')
print(new)
new['unique'] = new_column
print('\nattempt 2:')
print(new)
One way to do this:
# Create your new multiindexed column:
new['unique','z'] = 9
# Re-order your columns in your desired order:
new = new[['unique', 'cat', 'numerical']]
>>> new
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5