I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1
I need to find month by month way of showing year to date unique values. For example:
month value
1 a
1 b
1 a
2 a
2 a
2 a
3 c
3 b
3 b
4 d
4 e
4 f
Should output:
Month Monthly unique Year to date unique
1 2 2
2 1 2
3 2 3
4 3 6
For monthly unique it is just a matter of group by and unique(), but it won't work for year-to-date this way. Year-to-date may be achieved by using for loop and filtering dataframe month by month since the beginning of the year, but it's slow, non-pythonic way I want to omit.
How to do it in efficient way?
Let us do
s = df.groupby('month').value.agg(['nunique',list])
s['list'] = s['list'].cumsum().map(lambda x : len(set(x)))
s
nunique list
month
1 2 2
2 1 2
3 2 3
4 3 6
BEN_YO's approach is pretty simple and effective for small datasets. However, it can be slow and costly on big dataframe due to cumsum on lists (of strings).
Let's try drop_duplicates first and only work on duplicates:
(df.drop_duplicates(['month','value'])
.assign(year=lambda x: ~x.duplicated(['value']))
.groupby('month')
.agg({'value':'nunique', 'year':'sum'})
.assign(year=lambda x: x.year.cumsum())
)
Output:
value year
month
1 2 2
2 1 2
3 2 3
4 3 6
I have a dataframe like this:
user_id order_id
0 a 1
1 a 2
2 a 3
3 b 4
4 c 5
Now I want to add a column to show whether the user of each order has multiple orders:
user_id order_id repetitive
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 0
4 c 5 0
Since a has three orders, the tag is 1. I know the method value_counts can calculate the result but it only shows the result after groupby. I want to combine it with the origin dataframe. How can I achieve this?
Use groupby and transform to get your counts while maintaining the same structure.
df['repetitive'] = df.groupby('user_id').transform('count').gt(1).astype(int)
Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])
So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.
I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1