How to count the following number of rows in pandas (new)

How to count the following number of rows in pandas (new) - python

Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.

You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1

Related

How to filter values in data frame by grouped values in column

I have a dataframe:
id value
a1 0
a1 1
a1 2
a1 3
a2 0
a2 1
a3 0
a3 1
a3 2
a3 3
I want to filter id's and leave only those which have value higher than 3. So in this example id a2 must be removed since it only has values 0 and 1. So desired result is:
id value
a1 0
a1 1
a1 2
a1 3
a3 0
a3 1
a3 2
a3 3
a3 4
a3 5
How to to that in pandas?

Updated.
Group by IDs and find their max values. Find the IDs whose max value is at or above 3:
keep = df.groupby('id')['value'].max() >= 3
Select the rows with the IDs that match:
df[df['id'].isin(keep[keep].index)]

Use boolean mask to keep rows that match condition then replace bad id (a2) by the next id (a3). Finally, group again by id an apply a cumulative sum.
mask = df.groupby('id')['value'] \
.transform(lambda x: sorted(x.tolist()) == [0, 1, 2, 3])
df1 = df[mask].reindex(df.index).bfill()
df1['value'] = df1.groupby('id').agg('cumcount')
Output:
>>> df1
id value
0 a1 0
1 a1 1
2 a1 2
3 a1 3
4 a3 0
5 a3 1
6 a3 2
7 a3 3
8 a3 4
9 a3 5

Pandas sort by subtotal of each group

Still new into pandas but is there a way to sort df by subtotal of each group.
Area Unit Count
A A1 5
A A2 2
B B1 10
B B2 1
B B3 3
C C1 10
So I want to sort them by subtotal of each Area which results to A subtotal = 7, B subtotal=14, C subtotal = 10
The sort should be like
Area Unit Count
B B1 10
B B2 1
B B3 3
C C1 10
A A1 5
A A2 2
*Note that despite the value of B3 > B1 it should not be affected by the sort.

create a helper column 'sorter', which is the sum of the count variable, and sort ur dataframe with it
df['sorter'] = df.groupby("Area").Count.transform('sum')
df.sort_values('sorter',ascending=False).reset_index(drop=True).drop('sorter',axis=1)
Area Unit Count
0 B B1 10
1 B B2 1
2 B B3 3
3 C C1 10
4 A A1 5
5 A A2 2

Replace value with the value of nearest neighbor in Pandas dataframe

I have a problem with getting nearest values for some rows in pandas dataframe and fill another column with values from those rows.
data sample I have:
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 100
A A5 250 3
A A6 250 100
B B1 0 1
B B2 30 2
The thing is, wherever match_v is equal to 100, I need to replace that 100 with a value from the row where r_value is the closest to r_value from origin row(where match_v is equal to 100), but just withing group (grouped by id)
Expected output
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 2
A A5 250 3
A A6 250 3
B B1 0 1
B B2 30 2
I have tried with creating lead and leg with shift and then finding differences. But doesn't work well and it somehow messed up already good values.
I haven't tried anything else cause I really don't have any idea.
Any help or hint is welcomed and I if you need any additional info, I'm here.
Thanks in advance.

More like merge_asof
s=df.loc[df.match_v!=100]
s=pd.merge_asof(df.sort_values('r_value'),s.sort_values('r_value'),on='r_value',by='id',direction='nearest')
df['match_v']=df['su_id'].map(s.set_index('su_id_x')['match_v_y'])
df
Out[231]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Here is another way using numpy broadcast , build for speed up calculation
l=[]
for x , y in df.groupby('id'):
s1=y.r_value.values
s=abs((s1-s1[:,None])).astype(float)
s[np.tril_indices(s.shape[0], 0)] = 999999
s=s.argmin(0)
s2=y.match_v.values
l.append(s2[s][s2==100])
df.loc[df.match_v==100,'match_v']=np.concatenate(l)
df
Out[264]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2

You could define a custom function which does the calculation and substitution, and then use it with groupby and apply.
def mysubstitution(x):
for i in x.index[x['match_v'] == 100]:
diff = (x['r_value'] - (x['r_value'].iloc[i])).abs()
exclude = x.index.isin([i])
closer_idx = diff[~exclude].idxmin()
x['match_v'].iloc[i] = x['match_v'].iloc[closer_idx]
return x
ddf = df.groupby('id').apply(mysubstitution)
ddf is:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2

Assuming there is always at least one valid value within the group when a 100 is first encountered.
m = dict()
for i in range(len(df)):
if df.loc[i, "match_v"] == 100:
df.loc[i, "match_v"] = m[df.loc[i, "id"]]
else:
m[df.loc[i, "id"]] = df.loc[i, "match_v"]

Statistics for Grouped DataFrames with Pandas

I have a DataFrame that can be grouped basically by two columns: Level and Sub_level.
The data looks like this:
Level_1 Sub_level Value
0 Group A A1 100
1 Group A A2 200
2 Group A A1 150
3 Group B B1 100
4 Group B B2 200
5 Group A A1 200
6 Group A A1 300
7 Group A A1 400
8 Group B B2 450
...
I would like to get the frequency/count in each Sub_level compared to each comparable Level_1, i.e
Level_1 Sub_level Pct_of_total
Group A A1 5 / 6 (as there are 6 Group A instances in 'Level_1', and 5 A1:s in 'Sub_level')
A2 1 / 6
Group B B1 1 / 3 (as there are 3 Group B instances in 'Level_1', and 1 B1:s in 'Sub_level')
B2 2 / 3
Of course the fractions in the new column Pct_of_total should be expressed in
percentage.
Any clues?
Thanks,
/N

I think you need groupby + size for first df and then groupby by first level (Level_1) and transform sum. Last divide by div:
df1 = df.groupby(['Level_1','Sub_level'])['Value'].size()
print (df1)
Level_1 Sub_level
Group A A1 5
A2 1
Group B B1 1
B2 2
Name: Value, dtype: int64
df2 = df1.groupby(level=0).transform('sum')
print (df2)
Level_1 Sub_level
Group A A1 6
A2 6
Group B B1 3
B2 3
Name: Value, dtype: int64
df3 = df1.div(df2).reset_index(name='Pct_of_total')
print (df3)
Level_1 Sub_level Pct_of_total
0 Group A A1 0.833333
1 Group A A2 0.166667
2 Group B B1 0.333333
3 Group B B2 0.666667

How to select rows which matches certain row

I have a dataframe below
A B
a0 1
b0 1
c0 2
a1 3
b1 4
b2 3
First,If df.A startswith "a",I would like to cut df.
df[df.A.str.startswith("a")]
A B
a0 1
a1 3
Therefore I would like to cut df like below.
sub1
A B
a0 1
b0 1
c0 2
sub2
A B
a1 3
b1 4
b2 3
then I would like to extract rows whose column B number matches the rows whose column A startswith"a"
sub1
A B
a0 1
b0 1
sub2
A B
a1 3
b2 3
then append.
result
A B
a0 1
b0 1
a1 3
b2 3
How can I cut and append df like this.
I tried cut method but didn't work well.

I think you can use where with mask for creating NaN which are forward filled by B values with ffill:
Notice is necessary values starts with a has to be first in each group for using ffill
print (df.B.where(df.A.str.startswith("a")))
0 1.0
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
Name: B, dtype: float64
print (df.B.where(df.A.str.startswith("a")).ffill())
0 1.0
1 1.0
2 1.0
3 3.0
4 3.0
5 3.0
Name: B, dtype: float64
df = df[df.B == df.B.where(df.A.str.startswith("a")).ffill()]
print (df)
A B
0 a0 1
1 b0 1
3 a1 3
5 b2 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to count the following number of rows in pandas (new) - python

Related

How to filter values in data frame by grouped values in column

Pandas sort by subtotal of each group

Replace value with the value of nearest neighbor in Pandas dataframe

Statistics for Grouped DataFrames with Pandas

How to select rows which matches certain row

Categories

Resources