How to count the following number of rows in pandas (new) - python

Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.

You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1

Related

How to filter values in data frame by grouped values in column

I have a dataframe:
id value
a1 0
a1 1
a1 2
a1 3
a2 0
a2 1
a3 0
a3 1
a3 2
a3 3
I want to filter id's and leave only those which have value higher than 3. So in this example id a2 must be removed since it only has values 0 and 1. So desired result is:
id value
a1 0
a1 1
a1 2
a1 3
a3 0
a3 1
a3 2
a3 3
a3 4
a3 5
How to to that in pandas?
Updated.
Group by IDs and find their max values. Find the IDs whose max value is at or above 3:
keep = df.groupby('id')['value'].max() >= 3
Select the rows with the IDs that match:
df[df['id'].isin(keep[keep].index)]
Use boolean mask to keep rows that match condition then replace bad id (a2) by the next id (a3). Finally, group again by id an apply a cumulative sum.
mask = df.groupby('id')['value'] \
.transform(lambda x: sorted(x.tolist()) == [0, 1, 2, 3])
df1 = df[mask].reindex(df.index).bfill()
df1['value'] = df1.groupby('id').agg('cumcount')
Output:
>>> df1
id value
0 a1 0
1 a1 1
2 a1 2
3 a1 3
4 a3 0
5 a3 1
6 a3 2
7 a3 3
8 a3 4
9 a3 5

Pandas sort by subtotal of each group

Still new into pandas but is there a way to sort df by subtotal of each group.
Area Unit Count
A A1 5
A A2 2
B B1 10
B B2 1
B B3 3
C C1 10
So I want to sort them by subtotal of each Area which results to A subtotal = 7, B subtotal=14, C subtotal = 10
The sort should be like
Area Unit Count
B B1 10
B B2 1
B B3 3
C C1 10
A A1 5
A A2 2
*Note that despite the value of B3 > B1 it should not be affected by the sort.
create a helper column 'sorter', which is the sum of the count variable, and sort ur dataframe with it
df['sorter'] = df.groupby("Area").Count.transform('sum')
df.sort_values('sorter',ascending=False).reset_index(drop=True).drop('sorter',axis=1)
Area Unit Count
0 B B1 10
1 B B2 1
2 B B3 3
3 C C1 10
4 A A1 5
5 A A2 2

Replace value with the value of nearest neighbor in Pandas dataframe

I have a problem with getting nearest values for some rows in pandas dataframe and fill another column with values from those rows.
data sample I have:
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 100
A A5 250 3
A A6 250 100
B B1 0 1
B B2 30 2
The thing is, wherever match_v is equal to 100, I need to replace that 100 with a value from the row where r_value is the closest to r_value from origin row(where match_v is equal to 100), but just withing group (grouped by id)
Expected output
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 2
A A5 250 3
A A6 250 3
B B1 0 1
B B2 30 2
I have tried with creating lead and leg with shift and then finding differences. But doesn't work well and it somehow messed up already good values.
I haven't tried anything else cause I really don't have any idea.
Any help or hint is welcomed and I if you need any additional info, I'm here.
Thanks in advance.
More like merge_asof
s=df.loc[df.match_v!=100]
s=pd.merge_asof(df.sort_values('r_value'),s.sort_values('r_value'),on='r_value',by='id',direction='nearest')
df['match_v']=df['su_id'].map(s.set_index('su_id_x')['match_v_y'])
df
Out[231]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Here is another way using numpy broadcast , build for speed up calculation
l=[]
for x , y in df.groupby('id'):
s1=y.r_value.values
s=abs((s1-s1[:,None])).astype(float)
s[np.tril_indices(s.shape[0], 0)] = 999999
s=s.argmin(0)
s2=y.match_v.values
l.append(s2[s][s2==100])
df.loc[df.match_v==100,'match_v']=np.concatenate(l)
df
Out[264]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
You could define a custom function which does the calculation and substitution, and then use it with groupby and apply.
def mysubstitution(x):
for i in x.index[x['match_v'] == 100]:
diff = (x['r_value'] - (x['r_value'].iloc[i])).abs()
exclude = x.index.isin([i])
closer_idx = diff[~exclude].idxmin()
x['match_v'].iloc[i] = x['match_v'].iloc[closer_idx]
return x
ddf = df.groupby('id').apply(mysubstitution)
ddf is:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Assuming there is always at least one valid value within the group when a 100 is first encountered.
m = dict()
for i in range(len(df)):
if df.loc[i, "match_v"] == 100:
df.loc[i, "match_v"] = m[df.loc[i, "id"]]
else:
m[df.loc[i, "id"]] = df.loc[i, "match_v"]

Statistics for Grouped DataFrames with Pandas

I have a DataFrame that can be grouped basically by two columns: Level and Sub_level.
The data looks like this:
Level_1 Sub_level Value
0 Group A A1 100
1 Group A A2 200
2 Group A A1 150
3 Group B B1 100
4 Group B B2 200
5 Group A A1 200
6 Group A A1 300
7 Group A A1 400
8 Group B B2 450
...
I would like to get the frequency/count in each Sub_level compared to each comparable Level_1, i.e
Level_1 Sub_level Pct_of_total
Group A A1 5 / 6 (as there are 6 Group A instances in 'Level_1', and 5 A1:s in 'Sub_level')
A2 1 / 6
Group B B1 1 / 3 (as there are 3 Group B instances in 'Level_1', and 1 B1:s in 'Sub_level')
B2 2 / 3
Of course the fractions in the new column Pct_of_total should be expressed in
percentage.
Any clues?
Thanks,
/N
I think you need groupby + size for first df and then groupby by first level (Level_1) and transform sum. Last divide by div:
df1 = df.groupby(['Level_1','Sub_level'])['Value'].size()
print (df1)
Level_1 Sub_level
Group A A1 5
A2 1
Group B B1 1
B2 2
Name: Value, dtype: int64
df2 = df1.groupby(level=0).transform('sum')
print (df2)
Level_1 Sub_level
Group A A1 6
A2 6
Group B B1 3
B2 3
Name: Value, dtype: int64
df3 = df1.div(df2).reset_index(name='Pct_of_total')
print (df3)
Level_1 Sub_level Pct_of_total
0 Group A A1 0.833333
1 Group A A2 0.166667
2 Group B B1 0.333333
3 Group B B2 0.666667

How to select rows which matches certain row

I have a dataframe below
A B
a0 1
b0 1
c0 2
a1 3
b1 4
b2 3
First,If df.A startswith "a",I would like to cut df.
df[df.A.str.startswith("a")]
A B
a0 1
a1 3
Therefore I would like to cut df like below.
sub1
A B
a0 1
b0 1
c0 2
sub2
A B
a1 3
b1 4
b2 3
then I would like to extract rows whose column B number matches the rows whose column A startswith"a"
sub1
A B
a0 1
b0 1
sub2
A B
a1 3
b2 3
then append.
result
A B
a0 1
b0 1
a1 3
b2 3
How can I cut and append df like this.
I tried cut method but didn't work well.
I think you can use where with mask for creating NaN which are forward filled by B values with ffill:
Notice is necessary values starts with a has to be first in each group for using ffill
print (df.B.where(df.A.str.startswith("a")))
0 1.0
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
Name: B, dtype: float64
print (df.B.where(df.A.str.startswith("a")).ffill())
0 1.0
1 1.0
2 1.0
3 3.0
4 3.0
5 3.0
Name: B, dtype: float64
df = df[df.B == df.B.where(df.A.str.startswith("a")).ffill()]
print (df)
A B
0 a0 1
1 b0 1
3 a1 3
5 b2 3

Categories