Pandas ordering based on column value - python

I have a pandas dataframe like:
I have the data frame as like below one,
Input DataFrame
id ratio
0 1 5.00%
1 2 9.00%
2 3 6.00%
3 2 13.00%
4 1 19.00%
5 4 30.00%
6 3 5.5%
7 2 22.00%
How can I then group this like
id ratio
0 1 5.00%
4 1 19.00%
6 3 5.5%
2 3 6.00%
1 2 9.00%
3 2 13.00%
7 2 22.00%
5 4 30.00%
So essentially first looks at the ratio, takes the lowest for that value and groups the rest of the rows for which it has the same id. Then looks for the second lowest ratio and groups the rest of the ids again etc.

First convert your ratio column to numeric.
Then we get the lowest rank per group by using Groupby
Finally we sort based on rank and numeric ratio.
df['ratio_num'] = df['ratio'].str[:-1].astype(float).rank()
df['rank'] = df.groupby('id')['ratio_num'].transform('min')
df = df.sort_values(['rank', 'ratio_num']).drop(columns=['rank', 'ratio_num'])
id ratio
0 1 5.00%
1 1 19.00%
2 3 5.5%
3 3 6.00%
4 2 9.00%
5 2 13.00%
6 2 22.00%
7 4 30.00%

With help of pd.Categorical:
d = {'id':[1, 2, 3, 2, 1, 4, 3, 2],
'ratio': ['5.00%', '9.00%', '6.00%', '13.00%', '19.00%', '30.00%', '5.5%', '22.00%']}
df = pd.DataFrame(d)
df['ratio_'] = df['ratio'].map(lambda x: float(x[:-1]))
df['id'] = pd.Categorical(df['id'], categories=df.sort_values(['id', 'ratio_']).groupby('id').head(1).sort_values(['ratio_', 'id'])['id'], ordered=True)
print(df.sort_values(['id', 'ratio_']).drop('ratio_', axis=1))
Prints:
id ratio
0 1 5.00%
4 1 19.00%
6 3 5.5%
2 3 6.00%
1 2 9.00%
3 2 13.00%
7 2 22.00%
5 4 30.00%

Related

pandas number of items in one column per value in another column

I have two dataframes. say for example, frame 1 is the student info:
student_id course
1 a
2 b
3 c
4 a
5 f
6 f
frame 2 is each interaction the student has with a program
student_id day number_of_clicks
1 4 60
1 5 34
1 7 87
2 3 33
2 4 29
2 8 213
2 9 46
3 2 103
I am trying to add the information from frame 2 to frame 1, ie. for each student I would like to know the number of different days they accessed the database on, and the sum of all the clicks on those days. eg:
student_id course no_days total_clicks
1 a 3 181
2 b 4 321
3 c 1 103
4 a 0 0
5 f 0 0
6 f 0 0
I've tried to do this with groupby, but I couldn't add the information back into frame 1, or figure out how to sum the number of clicks. any ideas?
First we aggregate your df2 to the desired information using GroupBy.agg. Then we merge that information into df1:
agg = df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
)
df1 = df1.merge(agg, on='student_id', how='left').fillna(0)
student_id course no_days total_clicks
0 1 a 3.0 181.0
1 2 b 4.0 321.0
2 3 c 1.0 103.0
3 4 a 0.0 0.0
4 5 f 0.0 0.0
5 6 f 0.0 0.0
Or if you like one-liners, here's the same method as above, but in one line of code and more in SQL kind of style:
df1.merge(
df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
),
on='student_id',
how='left'
).fillna(0)
Use merge and fillna the null values then aggregate using groupby.agg as:
df = df1.merge(df2, how='left').fillna(0, downcast='infer')\
.groupby(['student_id', 'course'], as_index=False)\
.agg({'day':np.count_nonzero, 'number_of_clicks':np.sum}).reset_index()
print(df)
student_id course day number_of_clicks
0 1 a 3 181
1 2 b 4 321
2 3 c 1 103
3 4 a 0 0
4 5 f 0 0
5 6 f 0 0
​

How to replace values in selected rows columns with an array in dataframe?

I have a df
df = pd.DataFrame(data={'A': [1,2,3,4,5,6,7,8],
'B': [10,20,30,40,50,60,70,80]})
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
5 6 60
6 7 70
7 8 80
which I selected a few rows from.
Then I have a dictionary containig values that I should insert in B column
if key matches with value in A column of df
my_dict = {2: 39622884,
4: 82709546,
5: 28166511,
7: 89465652}
When I use the following assignment
df.loc[df['A'].isin(my_dict.keys())]['B'] = list(my_dict.values())
I get the error:
ValueError: Length of values does not match length of index
The desirable output is
A B
0 1 10
1 2 39622884
2 3 30
3 4 82709546
4 5 50
5 6 28166511
6 7 89465652
7 8 80
What is the correct way to implement this procedure?
You can make do with map and fillna:
df['B'] = df['A'].map(my_dict).fillna(df['B'])
Output:
A B
0 1 10.0
1 2 39622884.0
2 3 30.0
3 4 82709546.0
4 5 28166511.0
5 6 60.0
6 7 89465652.0
7 8 80.0

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

How do I sort a whole pandas dataframe by one column, moving the rows grouped in 3s

I have a dataframe with genes (ensembl IDs and common name), homologs, counts, and totals in orders of three as such:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000019949 ENSG00000149257
1 serpinh1b SERPINH1
2 2 2 4
3 ENSDARG00000052437 ENSG00000268975
4 mia MIA-RAB4B
5 2 0 2
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000045580 ENSG00000139329
10 lum LUM
11 15 15 30
etc...
I want to sort these rows by the totals in descending order. such that all the rows are kept intact in groups of 3 in the orders shown. The ideal output would be:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329
1 lum LUM
2 15 15 30
3 ENSDARG00000019949 ENSG00000149257
4 serpinh1b SERPINH1
5 2 2 4
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000052437 ENSG00000268975
10 mia MIA-RAB4B
11 2 0 2
etc...
I tried making the totals for each in all 3 rows and then sorting using dataframe.sort.values() and removing the previous 2 rows for each clump of 3 but it didnt work properly. Is there a way to group the rows together into clumps of 3, then sort them to maintain that structure? Thank you in advance for any assistance.
Update #1
If i try to use the code:
df['Total'] = df['Total'].bfill().astype(int)
df = df.sort_values(by='Total', ascending=False)
to add values to the total for each group of 3 and then sort, It partially works, but scrambles the code like this:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329 30
1 lum LUM 30
2 15 15 30
4 serpinh1b SERPINH1 4
3 ENSDARG00000019949 ENSG00000149257 4
5 2 2 4
8 0 3 3
7 fstb FST 3
6 ENSDARG00000057992 ENSG00000134363 3
9 ENSDARG00000052437 ENSG00000268975 2
11 2 0 2
10 mia MIA-RAB4B 2
etc...
And even worse is if multiple genes have the same total counts, the rows will become interchanged between genes which becomes confusing
Is this a dead end? Maybe I should just rewrite the code a different way :(
You need to create a second key to keep the records together on sorting , see below:
df.Total= df.Total.bfill()
df["helper"]= np.arange(len(df))//3
df= df.sort_values(["Total","helper"])
df= df.drop(columns="helper")
It looks like your totals are missing values and that helps in this case
Approach 1
df['Total'] = df['Total'].bfill().astype(int)
df['idx'] = np.arange(len(df)) // 3
df = df.sort_values(by=['Total', 'idx'], ascending=False)
df = df.drop(['idx'], axis=1)
Zebrafish_Homolog Human_Homolog Total
9 ENSDARG00000045580 ENSG00000139329 30
10 lum LUM 30
11 15 15 30
0 ENSDARG00000019949 ENSG00000149257 4
1 serpinh1b SERPINH1 4
2 2 2 4
6 ENSDARG00000057992 ENSG00000134363 3
7 fstb FST 3
8 0 3 3
3 ENSDARG00000052437 ENSG00000268975 2
4 mia MIA-RAB4B 2
5 2 0 2
Note how the index stays the same, if you don't want that then reset_index()
df = df.reset_index(drop=True)
Approach 2
A more manual way of sorting.
The approach is to sort the index and then loc the df. It looks complicated but it's just subtract ints from a list. Note the process doesn't happen on the df until the end so there should be no speed issue for a larger df.
# Sort by total
df = df.reset_index().sort_values('Total', ascending=False)
# Get the index of the sorted values
uniq_index = df[df['Total'].notnull()]['index'].values
# Create the new index
index = uniq_index .repeat(3)
groups = [-2, -1, 0] * (len(df) // 3)
# Update so everything is in order
new_index = index + groups
# Apply to the dataframe
df = df.loc[new_index]
Zebrafish_Homolog Human_Homolog Total
0 ENSDARG00000045580 ENSG00000139329 NaN
1 lum LUM NaN
2 15 15 30.0
9 ENSDARG00000019949 ENSG00000149257 NaN
10 serpinh1b SERPINH1 NaN
11 2 2 4.0
3 ENSDARG00000057992 ENSG00000134363 NaN
4 fstb FST NaN
5 0 3 3.0
6 ENSDARG00000052437 ENSG00000268975 NaN
7 mia MIA-RAB4B NaN
8 2 0 2.0
12 ENSDARG00000052437 ENSG00000268975 NaN
13 mia MIA-RAB4B NaN
14 2 0 2.0

Pandas - create total column based on other column

I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15

Categories