I have the following dataframe with two columns c1 and c2, I want to add a new column c3 based on the following logic, what I have works but is slow, can anyone suggest a way to vectorize this?
Must be grouped based on c1 and c2, then for each group, the new column c3 must be populated sequentially from values where the key is the value of c1 and each "sub group" will have subsequent values, IOW values[value_of_c1][idx], where idx is the "sub group", example below
The first group (1, 'a'), here c1 is 1, the "sub group" "a" index is 0 (first sub group of 1) so c3 for all rows in this group is values[1][0]
The second group (1, 'b') here c1 is still 1 but "sub group" is "b" so index 1 (second sub group of 1) so for all rows in this group c3 is values[1][1]
The third group (2, 'y') here c1 is now 2, "sub group" is "a" and the index is 0 (first sub group of 2), so for all rows in this group c3 is values[2][0]
And so on
values will have the necessary elements to satisfy this logic.
Code
import pandas as pd
df = pd.DataFrame(
{
"c1": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
"c2": ["a", "a", "a", "b", "b", "b", "y", "y", "y", "z", "z", "z"],
}
)
new_df = pd.DataFrame()
values = {1: ["a1", "a2"], 2: ["b1", "b2"]}
for i, j in df.groupby("c1"):
for idx, (k, l) in enumerate(j.groupby("c2")):
l["c3"] = values[i][idx]
new_df = new_df.append(l)
Output (works but my code is slow)
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
If you don't mind using another library, you basically need to label encode within your groups:
from sklearn.preprocessing import LabelEncoder
def le(x):
return pd.DataFrame(LabelEncoder().fit_transform(x),index=x.index)
df['idx'] = df.groupby('c1')['c2'].apply(le)
df['c3'] = df.apply(lambda x:values[x['c1']][x['idx']],axis=1)
c1 c2 idx c3
0 1 a 0 a1
1 1 a 0 a1
2 1 a 0 a1
3 1 b 1 a2
4 1 b 1 a2
5 1 b 1 a2
6 2 y 0 b1
7 2 y 0 b1
8 2 y 0 b1
9 2 z 1 b2
10 2 z 1 b2
11 2 z 1 b2
Otherwise it's a matter of using pd.Categorical , same concept as above, just that you convert within each group, the column into a category and just pull out the code:
def le(x):
return pd.DataFrame(pd.Categorical(x).codes,index=x.index)
In [203]: a = pd.DataFrame([[k, value, idx] for k,v in values.items() for idx,value in enumerate(v)], columns=['c1', 'c3', 'gr'])
...: b = df.assign(gr=df.groupby(['c1']).transform(lambda x: (x.ne(x.shift()).cumsum())- 1))
...: print(b)
...: b.merge(a).drop(columns='gr')
...:
# b
c1 c2 gr
0 1 a 0
1 1 a 0
2 1 a 0
3 1 b 1
4 1 b 1
5 1 b 1
6 2 y 0
7 2 y 0
8 2 y 0
9 2 z 1
10 2 z 1
11 2 z 1
Out[203]:
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
Related
I have a data frame where some rows have one ID and one related ID. In the example below, a1 and a2 are related (say to the same person) while b and c don't have any related rows.
import pandas as pd
test = pd.DataFrame(
[['a1', 1, 'a2'],
['a1', 2, 'a2'],
['a1', 3, 'a2'],
['a2', 4, 'a1'],
['a2', 5, 'a1'],
['b', 6, ],
['c', 7, ]],
columns=['ID1', 'Value', 'ID2']
)
test
ID1 Value ID2
0 a1 1 a2
1 a1 2 a2
2 a1 3 a2
3 a2 4 a1
4 a2 5 a1
5 b 6 None
6 c 7 None
What I need to achieve is to add a column containing the sum of all values for related rows. In this case, the desired output should be like below. Is there a way to get this, please?
ID1
Value
ID2
Group by ID1 and ID2
a1
1
a2
15
a1
2
a2
15
a1
3
a2
15
a2
4
a1
15
a2
5
a1
15
b
6
6
c
7
7
Note that I learnt to use group by to get sum for ID1 (from this question); but not for 'ID1' and 'ID2' together.
test['Group by ID1'] = test.groupby("ID1")["Value"].transform("sum")
test
ID1 Value ID2 Group by ID1
0 a1 1 a2 6
1 a1 2 a2 6
2 a1 3 a2 6
3 a2 4 a1 9
4 a2 5 a1 9
5 b 6 None 6
6 c 7 None 7
Update
Think I can still use for loop to get this done like below. But wondering if there is another non-loop way. Thanks.
bottle = pd.DataFrame().reindex_like(test)
bottle['ID1'] = test['ID1']
bottle['ID2'] = test['ID2']
for index, row in bottle.iterrows():
bottle.loc[index, "Value"] = test[test['ID1'] == row['ID1']]['Value'].sum() + \
test[test['ID1'] == row['ID2']]['Value'].sum()
print(bottle)
ID1 Value ID2
0 a1 15.0 a2
1 a1 15.0 a2
2 a1 15.0 a2
3 a2 15.0 a1
4 a2 15.0 a1
5 b 6.0 None
6 c 7.0 None
A possible solution would be to sort the pairs in ID1 and ID2, such that they always appear in the same order.
Swapping the IDs:
s = df['ID1'] > df['ID2']
df.loc[s, ['ID1', 'ID2']] = df.loc[s, ['ID2', 'ID1']].values
print(df)
>>> ID1 Value ID2
0 a1 1 a2
1 a1 2 a2
2 a1 3 a2
3 a1 4 a2
4 a1 5 a2
5 b 6 None
6 c 7 None
Then we can do a simple groupby:
df['RSUM'] = df.groupby(['ID1', 'ID2'], dropna=False)['Value'].transform("sum")
print(df)
>>> ID1 Value ID2 RSUM
0 a1 1 a2 15
1 a1 2 a2 15
2 a1 3 a2 15
3 a1 4 a2 15
4 a1 5 a2 15
5 b 6 None 6
6 c 7 None 7
Note the dropna=False to not discard IDs that have no pairing.
If you do not want to permanently swap the IDs, you can just create a temporary dataframe.
I have a dataframe:
id value
a1 0
a1 1
a1 2
a1 3
a2 0
a2 1
a3 0
a3 1
a3 2
a3 3
I want to filter id's and leave only those which have value higher than 3. So in this example id a2 must be removed since it only has values 0 and 1. So desired result is:
id value
a1 0
a1 1
a1 2
a1 3
a3 0
a3 1
a3 2
a3 3
a3 4
a3 5
How to to that in pandas?
Updated.
Group by IDs and find their max values. Find the IDs whose max value is at or above 3:
keep = df.groupby('id')['value'].max() >= 3
Select the rows with the IDs that match:
df[df['id'].isin(keep[keep].index)]
Use boolean mask to keep rows that match condition then replace bad id (a2) by the next id (a3). Finally, group again by id an apply a cumulative sum.
mask = df.groupby('id')['value'] \
.transform(lambda x: sorted(x.tolist()) == [0, 1, 2, 3])
df1 = df[mask].reindex(df.index).bfill()
df1['value'] = df1.groupby('id').agg('cumcount')
Output:
>>> df1
id value
0 a1 0
1 a1 1
2 a1 2
3 a1 3
4 a3 0
5 a3 1
6 a3 2
7 a3 3
8 a3 4
9 a3 5
I want to create a data frame as below.
C1 C2 C3 C4
1 1 1 1
1 2 2 2
1 2 2 3
1 2 3 4
1 2 3 5
2 3 4 6
3 4 5 7
The C4 column should be unique values. C4 column values should belongs to any one of the C3 column. C3 column values should belongs to any one of the C2 column. C2 column values should belongs to any one of the C1 column. The column values should be C1 < C2 < C3 < C4. The values may be random.
I used below sample Python code.
import pandas as pd
import numpy as np
C1 = [1, 2]
C2 = [1, 2, 3,4]
C3 = [1, 2, 3, 4,5]
C4 = [1,2,3,4,5,6,7]
Z = [C1,C2,C3,C4]
n = max(len(x) for x in Z)
a = [np.hstack((np.random.choice(x, n - len(x)), x)) for x in Z]
df = pd.DataFrame(a, index=['C1', 'C2', 'C3','C4']).T.sample(frac=1)
print (df)
Below is my output.
C1 C2 C3 C4
1 4 2 2
1 3 4 6
2 1 2 4
1 2 5 1
2 4 5 7
1 2 3 5
2 2 1 3
But I couldn’t get the output as per my logic. The value 2 in C3 column belongs to 1 and 4 of C2 column. Also the value 2 in C2 column belongs to 1 and 2 of C1 column. Guide me to get output as per my logic. Thanks.
Basically i want to just flatten ( maybe not good term )
for example having dataframe:
A B C
0 1 [1,2] [1, 10]
1 2 [2, 14] [2, 18]
I want to get the output of:
A B1 B2 B3 B4
0 1 1 2 1 10
1 2 2 14 2 18
I've tried:
print(pd.DataFrame(df.values.flatten().tolist(), columns=['%sG'%i for i in range(6)], index=df.index))
But nothing good.
Hope you get what i mean :)
General solution working also if lists have differents lengths:
df1 = pd.DataFrame(df['B'].values.tolist())
df2 = pd.DataFrame(df['C'].values.tolist())
df = pd.concat([df[['A']], df1, df2], axis=1)
df.columns = [df.columns[0]] + [f'B{i+1}' for i in range(len(df.columns)-1)]
print (df)
A B1 B2 B3 B4
0 1 1 2 1 10
1 2 2 14 2 18
If same size:
df1 = pd.DataFrame(np.array(df[['B','C']].values.tolist()).reshape(len(df),-1))
df1.columns = [f'B{i+1}' for i in range(len(df1.columns))]
df1.insert(0, 'A', df['A'])
print (df1)
A B1 B2 B3 B4
0 1 1 2 1 10
1 2 2 14 2 18
In more recent versions you can use explode:
>>> x = df.select_dtypes(exclude=list).join(df.select_dtypes(list).apply(pd.Series.explode, axis=1))
>>> x.columns = x.columns + x.columns.to_series().groupby(level=0).cumcount().add(1).astype(str)
>>> x
A1 B1 B2 C1 C2
0 1 1 2 1 10
1 2 2 14 2 18
>>>
Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.
You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1