Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1
Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])
I have been looking for a way to do this but haven't been successful in finding something that will work in python/pandas.
I am looking to loop through rows until a 1 is found again and concatenate the previous rows until the one is found and place that in a third column.
For Example:
df
'A' 'B' 'C'
1 4 4
2 3 43
3 1 431
4 2 4312
1 5 5
2 4 54
1 2 2
2 2 22
3 4 224
df
if df['A'] == 1
df['C'] = df.concat['B']
else
df['C'] = df.concat['B'] + df.concat['B'+1]
If you can't tell, this is my first time trying to write a loop.
Any help generating column C from columns A and B using python code would be well appreciated.
Thank you,
James
This can achieve what you need , create a new key by using cumsum, then we groupby the key we created , using cumsum again
df.groupby(df.A.eq(1).cumsum()).B.apply(lambda x : x.astype(str).cumsum())
Out[838]:
0 4
1 43
2 431
3 4312
4 5
5 54
6 2
7 22
8 224
Name: B, dtype: object
My first data frame has various columns one of which contains ID column and my second data frame has various columns one of which contains a No so I have found the link between the two. However how can I link these together using the number to assign the postcode information from data frame 2 to the correct practice in data frame 1.
Any help would be greatly appreciated!!!
Date frame 1
ID place Items Cost
0 5 10 2001.00
1 12 2 20.98
2 2 4 100.80
3 7 7 199.60
Data frame 2
ID No Dr Postcode
0 1 Dr.K BT94 7HX
1 5 Dr.H BT7 4MC
2 3 Dr.Love BT9 1HE
3 7 Dr.Kerr BT72 4TX
I want to create a new column 'Postcode' in Data frame 1 and assign the postcode to the correct Practice
ID Place Items Cost Postcode
0 5 10 BT7 4MC
1 2 3 BT9 1HE
2 22 8 BT62 4TU
3 7 7 BT72 4TX
How can I do this??
IIUC, I think what you are looking for is 'left_on' and 'right_on' parameters in merge:
df1.merge(df2, left_on='Practice', right_on='Prac No')
Output:
ID_x Practice Items Cost ID_y Prac No Dr Postcode
0 0 5 10 2001.0 1 5 Dr.H BT7 4MC
1 3 7 7 199.6 3 7 Dr.Kerr BT72 4TX
Or another way is to use set_index and map:
df1['Postcode'] = df1['Practice'].map(df2.set_index('Prac No')['Postcode'])
df1
Output:
ID Practice Items Cost Postcode
0 0 5 10 2001.00 BT7 4MC
1 1 12 2 20.98 NaN
2 2 2 4 100.80 NaN
3 3 7 7 199.60 BT72 4TX
I realize this question is similar to join or merge with overwrite in pandas, but the accepted answer does not work for me since I want to use the on='keys' from df.join().
I have a DataFrame df which looks like this:
keys values
0 0 0.088344
1 0 0.088344
2 0 0.088344
3 0 0.088344
4 0 0.088344
5 1 0.560857
6 1 0.560857
7 1 0.560857
8 2 0.978736
9 2 0.978736
10 2 0.978736
11 2 0.978736
12 2 0.978736
13 2 0.978736
14 2 0.978736
Then I have a Series s (which is a result from some df.groupy.apply()) with the same keys:
keys
0 0.183328
1 0.239322
2 0.574962
Name: new_values, dtype: float64
Basically I want to replace the 'values' in the df with the values in the Series, by keys so every keys block gets the same new value. Currently, I do it as follows:
df = df.join(s, on='keys')
df['values'] = df['new_values']
df = df.drop('new_values', axis=1)
The obtained (and desired) result is then:
keys values
0 0 0.183328
1 0 0.183328
2 0 0.183328
3 0 0.183328
4 0 0.183328
5 1 0.239322
6 1 0.239322
7 1 0.239322
8 2 0.574962
9 2 0.574962
10 2 0.574962
11 2 0.574962
12 2 0.574962
13 2 0.574962
14 2 0.574962
That is, I add it as a new column and by using on='keys' it gets the corrects shape. Then I assign values to be new_values and remove the new_values column. This of course works perfectly, the only problem being that I find it extremely ugly.
Is there a better way to do this?
You could try something like:
df = df[df.columns[df.columns!='values']].join(s, on='keys')
Make sure s is named 'values' instead of 'new_values'.
To my knowledge, pandas doesn't have the ability to join with "force overwrite" or "overwrite with warning".