Using `rank` on a pandas DataFrameGroupBy object - python

I have some simple data in a dataframe consisting of three columns [id, country, volume] where the index is 'id'.
I can perform simple operations like:
df_vol.groupby('country').sum()
and it works as expected. When I attempt to use rank() it does not work as expected and the results is an empty dataframe.
df_vol.groupby('country').rank()
The result is not consistent and in some cases it works. The following also works as expected:
df_vol.rank()
I want to return something like:
vols = []
for _, df in f_vol.groupby('country'):
vols.append(df['volume'].rank())
pd.concat(vols)
Any ideas why much appreciated!

You can add column by [] - function is call only for column Volume:
df_vol.groupby('country')['volume'].rank()
Sample:
df_vol = pd.DataFrame({'country':['en','us','us','en','en'],
'volume':[10,10,30,20,50],
'id':[1,1,1,2,2]})
print(df_vol)
country id volume
0 en 1 10
1 us 1 10
2 us 1 30
3 en 2 20
4 en 2 50
df_vol['r'] = df_vol.groupby('country')['volume'].rank()
print (df_vol)
country id volume r
0 en 1 10 1.0
1 us 1 10 1.0
2 us 1 30 2.0
3 en 2 20 2.0
4 en 2 50 3.0

Related

Perform operations on a dataframe from groupings by ID

I have the following dataframe in Python:
ID
maths
value
0
add
12
1
sub
30
0
add
10
2
mult
3
0
sub
10
1
add
11
3
sub
40
2
add
21
My idea is to perform the following operations to get the result I want:
First step: Group the rows of the dataframe by ID. The order of the groups shall be indicated by the order of the original dataframe.
ID
maths
value
0
add
12
0
add
10
0
sub
10
1
sub
30
1
add
11
2
mult
3
2
add
21
3
sub
40
Second step: For each group created: Create a value for a new column 'result' where a mathematical operation indicated by the previous column of 'maths' is performed. If there is no previous row for the group, this column would have the value NaN.
ID
maths
value
result
0
add
12
NaN
0
add
10
22
0
sub
10
20
1
sub
30
NaN
1
add
11
19
2
mult
3
NaN
2
add
21
63
3
sub
40
NaN
Third step: Return the resulting dataframe.
I have tried to realise this code by making use of the pandas groupby method. But I have problems to iterate with conditions for each row and each group, and I don't know how to create the new column 'result' on a groupby object.
grouped_df = testing.groupby('ID')
for key, item in grouped_df:
print(grouped_df.get_group(key))
I don't know whether to use orderby or groupby or some other method that works for what I want to do. If you can help me with a better idea, I'd appreciate it.
ID = list("00011223")
maths = ["add","add","sub","sub","add","mult","add","sub"]
value = [12,10,10,30,11,3,21,40]
import pandas as pd
df = pd.DataFrame(list(zip(ID,maths,value)),columns = ["ID","Maths","Value"])
df["Maths"] = df.groupby(["ID"]).pipe(lambda df:df.Maths.shift(1)).fillna("add")
df["Value1"] = df.groupby(["ID"]).pipe(lambda df:df.Value.shift(1))
df["result"] = df.groupby(["Maths"]).pipe(lambda x:(x.get_group("add")["Value1"] + x.get_group("add")["Value"]).append(
x.get_group("sub")["Value1"] - x.get_group("sub")["Value"]).append(
x.get_group("mult")["Value1"] * x.get_group("mult")["Value"])).sort_index()
Here is the Output:
df
Out[168]:
ID Maths Value Value1 result
0 0 add 12 NaN NaN
1 0 add 10 12.0 22.0
2 0 add 10 10.0 20.0
3 1 add 30 NaN NaN
4 1 sub 11 30.0 19.0
5 2 add 3 NaN NaN
6 2 mult 21 3.0 63.0
7 3 add 40 NaN NaN

Grouping my data according to one variable while I have some Nan values is not resulting in desired result

ANO
WK
GRP
lgSc
TS
THS
GS
GHS
US
419
1
1
2
0
0
2
4
5
199
1
2
2
2
1
0
2
4
263
1
1
1
2
2
5
0
4
I am trying to group by the ANO but the result is either operation not callable or there is no change in the database
grouped1=df1.groupby('ANO')
I used this code but no desired result came though I did not receive any error.
This is a weekly study so same number repeats now I want to group a single patient number data to come together like this.
ANO
WK
GRP
lgSc
TS
THS
GS
GHS
US
419
1
1
2
0
0
2
4
5
419
2
2
1
2
0
0
0
4
419
3
1
3
2
2
1
0
4
After applying the code I got no transformation.
There is one specialty in the data for some of the variables I have NaN data as no reading were taken for week 3 5 7 and 9 for 4 out of 12 variables
The idea of pandas' groupby method is to split the data into groups and then apply some transformation or aggregation to those groups. See the pandas user guide.
If I understand you correctly, you simply want to sort the data, but not apply any transformations. You can do that like this:
sorted1 = df1.sort_values(by='ANO')
The NaN values should not be an issue here.

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Dividing each row by the previous one

I have pandas dataframe:
df = pd.DataFrame()
df['city'] = ['NY','NY','LA','LA']
df['hour'] = ['0','12','0','12']
df['value'] = [12,24,3,9]
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
What's the most pythonic way to do this?
First divide by shifted values per groups:
df['ratio'] = df['value'].div(df.groupby('city')['value'].shift(1))
print (df)
city hour value ratio
0 NY 0 12 NaN
1 NY 12 24 2.0
2 LA 0 3 NaN
3 LA 12 9 3.0
Then remove NaNs and select only city and ratio column:
df = df.dropna(subset=['ratio'])[['city', 'ratio']]
print (df)
city ratio
1 NY 2.0
3 LA 3.0
You can use pct_change:
In [20]: df[['city']].assign(ratio=df.groupby('city').value.pct_change().add(1)).dropna()
Out[20]:
city ratio
1 NY 2.0
3 LA 3.0
This'll do it:
df.groupby('city')['value'].agg({'ratio': lambda x: x.max()/x.min()}).reset_index()
# city ratio
#0 LA 3
#1 NY 2
This is one way using a custom function. It assumes you want to ignore the NaN rows in the result of dividing one series by a shifted version of itself.
def divider(x):
return x['value'] / x['value'].shift(1)
res = df.groupby('city').apply(divider)\
.dropna().reset_index()\
.rename(columns={'value': 'ratio'})\
.loc[:, ['city', 'ratio']]
print(res)
city ratio
0 LA 3.0
1 NY 2.0
one way is,
df.groupby(['city']).apply(lambda x:x['value']/x['value'].shift(1))
for further improvement,
print df.groupby(['city']).apply(lambda x:(x['value']/x['value'].shift(1)).fillna(method='bfill'))).reset_index().drop_duplicates(subset=['city']).drop('level_1',axis=1)
city value
0 LA 3.0
2 NY 2.0

Selecting random columns for each group of pyspark RDD/dataframe

My dataframe as got 10,0000 columns, I have to apply some logic on each group (key is region and dept). Each group will use max 30 columns from 10k columns, the 30 columns list is from the second data set column "colList". Each group will have 2-3 millions rows. My approach is to do group by key and call function like below. But it fails - 1. shuffle and 2.data group is more than 2G(can be solved by re-partition but its costly), 3. very slow
def testfunc(iter):
<<got some complex business logic which cant be done in spark API>>
resRDD = df.rdd.groupBy(region, dept).map(lambda x: testfunc(x))
Input:
region dept week val0 val1 val2 val3 ... val10000
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Columns to pick for each group: (data set 2)
region dept colList
US CS val0,val10,val100,val2000
US ELE val2,val5,val800,val900
UE CS val21,val54,val806,val9000
My second solution is create a new data set from input data with only 30 columns and rename the columns to col1 to col30. Then use a mapping list for each columns and group. Then i can apply groupbyKey (assuming), which will be Skinner than original input of 10K columns.
region dept week col0 col1 col2 col3 ... col30
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Can any one help to convert Input with 10K to 30 columns? Or any other alternative should be fine to avoid group by.
you can use the create_map function to convert all the 10k columns to map per row. Now use a UDF which takes the map, region and dept and dillutes the map to the 30 columns and make sure to always have the same names for all 30 columns.
Lastly you can wrap your complex function to receive the map instead of the original 10K columns. Hopefully this will get it small enough to work properly.
If not, you can get a distinct of region and dept and assuming there are few enough you can loop through one and groupby the other.

Categories