I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.
Related
Imagine the following dataframe
Base dataframe
df = pd.from_dict({'a': [1,2,1,2,1]
'b': [1,1,3,3,1]
})
And them i pick up the a, column and replace a few values based on b column values
df.loc[df['b']== 3]['a'].replace(2,1)
How could i reappend my a column to my original df, but only changing those specific filtered values?
Wanted result
df = pd.from_dict({'a': [1,2,1,1,1]
'b': [1,1,3,3,1]
})
Do with update
df.update(df.loc[df['b']== 3,['a']].replace(2,1))
df
Out[354]:
a b
0 1.0 1
1 2.0 1
2 1.0 3
3 1.0 3
4 1.0 1
You can try df.mask
df['a'] = df['a'].mask(df['a'].eq(2) & df['b'].eq(3), 1)
print(df)
a b
0 1 1
1 2 1
2 1 3
3 1 3
4 1 1
I want to create pandas data frame with multiple lists with different length. Below is my python code.
import pandas as pd
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
lenA = len(A)
lenB = len(B)
lenC = len(C)
df = pd.DataFrame(columns=['A', 'B','C'])
for i,v1 in enumerate(A):
for j,v2 in enumerate(B):
for k, v3 in enumerate(C):
if(i<random.randint(0, lenA)):
if(j<random.randint(0, lenB)):
if (k < random.randint(0, lenC)):
df = df.append({'A': v1, 'B': v2,'C':v3}, ignore_index=True)
print(df)
My lists are as below:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6,7]
In each run I got different output and which is correct. But not covers all list items in each run. In one run I got below output as:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
In the above output 'A' list's all items (1,2) are there. But 'B' list has only (1,2) items, the item 3 is missing. Also list 'C' has (1,2,3,5) items only. (4,6,7) items are missing in 'C' list. My expectation is: in each list each item should be in the data frame at least once and 'C' list items should be in data frame only once. My expected sample output is as below:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
4 2 3 4
5 1 1 7
6 2 3 6
Guide me to get my expected output. Thanks in advance.
You can add random values of each list to total length and then use DataFrame.sample:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
L = [A,B,C]
m = max(len(x) for x in L)
print (m)
6
a = [np.hstack((np.random.choice(x, m - len(x)), x)) for x in L]
df = pd.DataFrame(a, index=['A', 'B', 'C']).T.sample(frac=1)
print (df)
A B C
2 2 2 3
0 2 1 1
3 1 1 4
4 1 2 5
5 2 3 6
1 2 2 2
You can use transpose to achieve the same.
EDIT: Used random to randomize the output as requested.
import pandas as pd
from random import shuffle, choice
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
shuffle(A)
shuffle(B)
shuffle(C)
data = [A,B,C]
df = pd.DataFrame(data)
df = df.transpose()
df.columns = ['A', 'B', 'C']
df.loc[:,'A'].fillna(choice(A), inplace=True)
df.loc[:,'B'].fillna(choice(B), inplace=True)
This should give the below output
A B C
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 NaN NaN 5.0
5 NaN NaN 6.0
I've written some functions to help aggregate data. In the end, they give me what I want, but with a crazy multi-indexed series:
fec988a2-6eba-49e0-8327-a89f25143ccf fec988a2-6eba-49e0-8327-a89f25143ccf com.facebook.katana fec988a2-6eba-49e0-8327-a89f25143ccf 1067
com.android.systemui fec988a2-6eba-49e0-8327-a89f25143ccf 935
com.facebook.orca fec988a2-6eba-49e0-8327-a89f25143ccf 893
com.android.chrome fec988a2-6eba-49e0-8327-a89f25143ccf 739
com.whatsapp fec988a2-6eba-49e0-8327-a89f25143ccf 515
I only need the first index, and the one with the app names (and the value of course). How do I get rid of unwanted indices like this?
You can use double reset_index - first remove unnecessary levels (here only 2, because group_keys=False in groupby remove another) and second with name='new' for convert Series to DataFrame with set new column name:
df = pd.DataFrame({'application':list('abbddedcc'),
'id':list('aaabbbbbb')})
print (df)
application id
0 a a
1 b a
2 b a
3 d b
4 d b
5 e b
6 d b
7 c b
8 c b
top = 2
df1 = (df.groupby(['id', 'application'])['id']
.value_counts()
.groupby(['id'], group_keys=False)
.nlargest(top)
.reset_index(level=2, drop=True)
.reset_index(name='new'))
print (df1)
id application new
0 a b 2
1 a a 1
2 b d 3
3 b c 2
Or remove id from first groupby, rather test if same output with real data:
top = 2
df1 = (df.groupby(['application'])['id']
.value_counts()
.groupby(['id'], group_keys=False)
.nlargest(top)
.reset_index(name='new'))
print (df1)
application id new
0 b a 2
1 a a 1
2 d b 3
3 c b 2
You can use pd.DataFrame.reset_index() or pd.Series.reset_index() with drop=True argument:
n = 5
df = pd.DataFrame({'idx0': [0] * n, 'idx1': range(n, 0, -1),
'idx2': range(0, n), 'idx3': ['a'] * n,
'value': [i/2 for i in range(n)]},
).set_index(['idx0', 'idx1', 'idx2', 'idx3'])
df
Out:
idx0 idx1 idx2 idx3 value
0 5 0 a 0.0
4 1 a 0.5
3 2 a 1.0
2 3 a 1.5
1 4 a 2.0
df.reset_index(level=(1, 3), drop=True)
Out:
idx0 idx2 value
0 0 0.0
1 0.5
2 1.0
3 1.5
4 2.0
I have the following data frame:
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
Which looks like this:
In [17]: df
Out[17]:
probe gene cellA.1 cellA.2 cellB.1 cellB.2
0 a foo 5 12 15 5
1 b bar 0 90 3 7
2 c qux 1 13 11 11
3 d woz 0 0 2 1
Note that the values are contained in column that shared same substring (e.g. cellA and cellB). In real case the cell ID can be more than these two and numerical index can also be more (e.g. CellFoo.5)
What I want to do is to get the average so that it looks like this
probe gene cellA cellB
a foo 9.5 10
b bar 45 5
c qux 7 11
d woz 0 1.5
How can I achieve that with Pandas?
One way would be to make a function which takes a column name and turns it into the group you want to put it in:
>>> df = df.set_index(["probe", "gene"])
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean()
cellA cellB
probe gene
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean().reset_index()
probe gene cellA cellB
0 a foo 8.5 10.0
1 b bar 45.0 5.0
2 c qux 7.0 11.0
3 d woz 0.0 1.5
Note that we set the index (and reset it afterwards) so we didn't have to special-case the groups we didn't want to touch; also note we had to specify axis=1 because we want to group columnwise, not rowwise.
You can use groupby():
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
mask = df.columns.str.contains(".", regex=False)
df1 = df.loc[:, ~mask]
df2 = df.loc[:, mask]
pd.concat([df1, df2.groupby(lambda name:name.split(".")[0], axis=1).mean()], axis=1)
You could use list comprehension.
In [1]: df['cellA'] = [(x+y)/2. for x,y in zip(df['cellA.1'], df['cellA.2'])]
In [2]: df['cellB'] = [(x+y)/2. for x,y in zip(df['cellB.1'], df['cellB.2'])]
In [3]: df = df[['probe', 'gene', 'cellA', 'cellB']]
In [4]: df
Out [4]:
probe gene cellA cellB
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5
How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).
I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.