I want to create a data frame as below.
C1 C2 C3 C4
1 1 1 1
1 2 2 2
1 2 2 3
1 2 3 4
1 2 3 5
2 3 4 6
3 4 5 7
The C4 column should be unique values. C4 column values should belongs to any one of the C3 column. C3 column values should belongs to any one of the C2 column. C2 column values should belongs to any one of the C1 column. The column values should be C1 < C2 < C3 < C4. The values may be random.
I used below sample Python code.
import pandas as pd
import numpy as np
C1 = [1, 2]
C2 = [1, 2, 3,4]
C3 = [1, 2, 3, 4,5]
C4 = [1,2,3,4,5,6,7]
Z = [C1,C2,C3,C4]
n = max(len(x) for x in Z)
a = [np.hstack((np.random.choice(x, n - len(x)), x)) for x in Z]
df = pd.DataFrame(a, index=['C1', 'C2', 'C3','C4']).T.sample(frac=1)
print (df)
Below is my output.
C1 C2 C3 C4
1 4 2 2
1 3 4 6
2 1 2 4
1 2 5 1
2 4 5 7
1 2 3 5
2 2 1 3
But I couldn’t get the output as per my logic. The value 2 in C3 column belongs to 1 and 4 of C2 column. Also the value 2 in C2 column belongs to 1 and 2 of C1 column. Guide me to get output as per my logic. Thanks.
Related
I have the following dataframe with two columns c1 and c2, I want to add a new column c3 based on the following logic, what I have works but is slow, can anyone suggest a way to vectorize this?
Must be grouped based on c1 and c2, then for each group, the new column c3 must be populated sequentially from values where the key is the value of c1 and each "sub group" will have subsequent values, IOW values[value_of_c1][idx], where idx is the "sub group", example below
The first group (1, 'a'), here c1 is 1, the "sub group" "a" index is 0 (first sub group of 1) so c3 for all rows in this group is values[1][0]
The second group (1, 'b') here c1 is still 1 but "sub group" is "b" so index 1 (second sub group of 1) so for all rows in this group c3 is values[1][1]
The third group (2, 'y') here c1 is now 2, "sub group" is "a" and the index is 0 (first sub group of 2), so for all rows in this group c3 is values[2][0]
And so on
values will have the necessary elements to satisfy this logic.
Code
import pandas as pd
df = pd.DataFrame(
{
"c1": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
"c2": ["a", "a", "a", "b", "b", "b", "y", "y", "y", "z", "z", "z"],
}
)
new_df = pd.DataFrame()
values = {1: ["a1", "a2"], 2: ["b1", "b2"]}
for i, j in df.groupby("c1"):
for idx, (k, l) in enumerate(j.groupby("c2")):
l["c3"] = values[i][idx]
new_df = new_df.append(l)
Output (works but my code is slow)
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
If you don't mind using another library, you basically need to label encode within your groups:
from sklearn.preprocessing import LabelEncoder
def le(x):
return pd.DataFrame(LabelEncoder().fit_transform(x),index=x.index)
df['idx'] = df.groupby('c1')['c2'].apply(le)
df['c3'] = df.apply(lambda x:values[x['c1']][x['idx']],axis=1)
c1 c2 idx c3
0 1 a 0 a1
1 1 a 0 a1
2 1 a 0 a1
3 1 b 1 a2
4 1 b 1 a2
5 1 b 1 a2
6 2 y 0 b1
7 2 y 0 b1
8 2 y 0 b1
9 2 z 1 b2
10 2 z 1 b2
11 2 z 1 b2
Otherwise it's a matter of using pd.Categorical , same concept as above, just that you convert within each group, the column into a category and just pull out the code:
def le(x):
return pd.DataFrame(pd.Categorical(x).codes,index=x.index)
In [203]: a = pd.DataFrame([[k, value, idx] for k,v in values.items() for idx,value in enumerate(v)], columns=['c1', 'c3', 'gr'])
...: b = df.assign(gr=df.groupby(['c1']).transform(lambda x: (x.ne(x.shift()).cumsum())- 1))
...: print(b)
...: b.merge(a).drop(columns='gr')
...:
# b
c1 c2 gr
0 1 a 0
1 1 a 0
2 1 a 0
3 1 b 1
4 1 b 1
5 1 b 1
6 2 y 0
7 2 y 0
8 2 y 0
9 2 z 1
10 2 z 1
11 2 z 1
Out[203]:
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
I have a dataframe:
id value
a1 0
a1 1
a1 2
a1 3
a2 0
a2 1
a3 0
a3 1
a3 2
a3 3
I want to filter id's and leave only those which have value higher than 3. So in this example id a2 must be removed since it only has values 0 and 1. So desired result is:
id value
a1 0
a1 1
a1 2
a1 3
a3 0
a3 1
a3 2
a3 3
a3 4
a3 5
How to to that in pandas?
Updated.
Group by IDs and find their max values. Find the IDs whose max value is at or above 3:
keep = df.groupby('id')['value'].max() >= 3
Select the rows with the IDs that match:
df[df['id'].isin(keep[keep].index)]
Use boolean mask to keep rows that match condition then replace bad id (a2) by the next id (a3). Finally, group again by id an apply a cumulative sum.
mask = df.groupby('id')['value'] \
.transform(lambda x: sorted(x.tolist()) == [0, 1, 2, 3])
df1 = df[mask].reindex(df.index).bfill()
df1['value'] = df1.groupby('id').agg('cumcount')
Output:
>>> df1
id value
0 a1 0
1 a1 1
2 a1 2
3 a1 3
4 a3 0
5 a3 1
6 a3 2
7 a3 3
8 a3 4
9 a3 5
I have a problem with getting nearest values for some rows in pandas dataframe and fill another column with values from those rows.
data sample I have:
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 100
A A5 250 3
A A6 250 100
B B1 0 1
B B2 30 2
The thing is, wherever match_v is equal to 100, I need to replace that 100 with a value from the row where r_value is the closest to r_value from origin row(where match_v is equal to 100), but just withing group (grouped by id)
Expected output
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 2
A A5 250 3
A A6 250 3
B B1 0 1
B B2 30 2
I have tried with creating lead and leg with shift and then finding differences. But doesn't work well and it somehow messed up already good values.
I haven't tried anything else cause I really don't have any idea.
Any help or hint is welcomed and I if you need any additional info, I'm here.
Thanks in advance.
More like merge_asof
s=df.loc[df.match_v!=100]
s=pd.merge_asof(df.sort_values('r_value'),s.sort_values('r_value'),on='r_value',by='id',direction='nearest')
df['match_v']=df['su_id'].map(s.set_index('su_id_x')['match_v_y'])
df
Out[231]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Here is another way using numpy broadcast , build for speed up calculation
l=[]
for x , y in df.groupby('id'):
s1=y.r_value.values
s=abs((s1-s1[:,None])).astype(float)
s[np.tril_indices(s.shape[0], 0)] = 999999
s=s.argmin(0)
s2=y.match_v.values
l.append(s2[s][s2==100])
df.loc[df.match_v==100,'match_v']=np.concatenate(l)
df
Out[264]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
You could define a custom function which does the calculation and substitution, and then use it with groupby and apply.
def mysubstitution(x):
for i in x.index[x['match_v'] == 100]:
diff = (x['r_value'] - (x['r_value'].iloc[i])).abs()
exclude = x.index.isin([i])
closer_idx = diff[~exclude].idxmin()
x['match_v'].iloc[i] = x['match_v'].iloc[closer_idx]
return x
ddf = df.groupby('id').apply(mysubstitution)
ddf is:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Assuming there is always at least one valid value within the group when a 100 is first encountered.
m = dict()
for i in range(len(df)):
if df.loc[i, "match_v"] == 100:
df.loc[i, "match_v"] = m[df.loc[i, "id"]]
else:
m[df.loc[i, "id"]] = df.loc[i, "match_v"]
Suppose I have a data frame with four columns and part of this is like:
C1 C2 C3 C4
60 60 60 60
59 59 58 58
0 0 0 0
12 13 13 11
Now i want to create four columns each corresponding to the frequency of each values considering other three columns like the columns will look like:
F1 F2 F3 F4
4 4 4 4
2 2 2 2
2 2 2 2
1 2 2 1
In the cell 1,1 the value is 4 because the value 60 appears in all the columns for the particular rows.
In cell 4,1 the value is 1 as 12 appears in no other columns for the particular row.
I need to calculate and add the features F1,F2,F3,F4 in the pandas data frame.
Use apply axis=1 for process by rows with map by frequencies by value_counts:
df = df.apply(lambda x: x.map(x.value_counts()),axis=1)
print (df)
C1 C2 C3 C4
0 4 4 4 4
1 2 2 2 2
2 4 4 4 4
3 1 2 2 1
Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.
You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1