Pandas - group by ID, assign category - python

I have a pandas dataframe with approx 60,000 records that looks like this:
ID P1 YEAR
0 20184045 MK 2020
1 20184045 GF 2020
2 20184011 EC 2020
3 20184011 MK 2020
4 20184011 EC 2020
5 20180673 GF 2020
Where ID is the ID of the record (8-digit integer), which has a P1 property that can take 10 distinct values (all are 2-char strings) and year is between 1995 and 2020. Each ID can have records that have between 1 and 5 different year values.
I want to obtain 2 additional dataframes:
one that gives me information about the number of distinct values of P1 for each year and each ID that would look like this:
ID YEAR NUMBER OF DISTINCT VALUES OF P1 FOR EACH YEAR
0 20184045 2020 n
1 20184045 2019
2 20184045 2018
3 20184045 2017
4 20184011 2020
5 20180673 2020
My second dataframe would count the total number of distinct values of P1 for each ID.
ID NUMBER OF DISTINCT VALUES OF P1 OVERALL
0 123 n1
1 456 n2
2 789 n3
3 987 n4
4 654 n1
5 321 n2
I tried looking up how to iterate over a dataframe with iterrows() and iteritems() but I have been unable to find how to iterate over 3 columns at the same time and grouping by id.
I've also looked into itertuples() which yields namedtuples and seemed more promising but I've been unable to find a satisfactory solution.

You can make do with two groupby:
df1 = (df.groupby(['ID','YEAR'])['P1']
.nunique()
.reset_index(name='Number of Unique P1')
)
df2 = (df.groupby('YEAR')['P1']
.nunique()
.reset_index(name='Number of Unique P1')
)

Related

How to group and aggregate data starting from constant and ending on changing date? [duplicate]

This question already has an answer here:
How to count unique occurrences grouping by changing time period?
(1 answer)
Closed 1 year ago.
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. There are two variants of this problem:
easier - sum:
created_at value
01-01-2012 5
02-01-2012 6
05-01-2012 1
05-01-2012 1
01-02-2012 3
02-02-2012 2
05-02-2012 1
which should output:
Date Month to date sum Year to date sum
01-01-2012 5 5
02-01-2012 11 11
05-01-2012 13 13
01-02-2012 3 14
02-02-2012 5 15
05-02-2012 6 16
and harder - count unique:
created_at value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
05-02-2012 d
which should output:
Date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 1 3
05-02-2012 2 4
The data is, of course, in Pandas data frame.The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values.
I was checking out Pandas window functions, but it doesn't look like a solution.
Try with groupby:
Cumulative sum:
df["created_at"] = pd.to_datetime(df["created_at"], format="%d-%m-%Y")
df["Month to date sum"] = df.groupby(df["created_at"].dt.month)["value"].transform('cumsum')
df["Year to date sum"] = df.groupby(df["created_at"].dt.year)["value"].transform('cumsum')
>>> df
created_at value Month to date sum Year to date sum
0 2012-01-01 5 5 5
1 2012-01-02 6 11 11
2 2012-01-05 1 12 12
3 2012-02-01 3 3 15
4 2012-02-02 2 5 17
5 2012-02-05 1 6 18
Cumulative unique count:
df2["created_at"] = pd.to_datetime(df2["created_at"], format="%d-%m-%Y")
df2["Month to date unique"] = df2.groupby(df2["created_at"].dt.month)["value"].apply(lambda x: (~x.duplicated()).cumsum())
df2["Year to date unique"] = df2.groupby(df2["created_at"].dt.year)["value"].apply(lambda x: (~x.duplicated()).cumsum())
>>> df2
created_at value Month to date unique Year to date unique
0 2012-01-01 a 1 1
1 2012-01-02 b 2 2
2 2012-01-05 c 3 3
3 2012-02-01 a 1 3
4 2012-02-02 a 1 3
5 2012-02-05 d 2 4

Getting max row from multi-index table

I have a table that looks similar to this:
user_id
date
count
1
2020
5
2021
7
2
2017
1
3
2020
2
2019
1
2021
3
I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this:
user_id
date
count
1
2021
7
2
2017
1
3
2021
3
I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns
You can try to specify only column count after .groupby() and then use .apply() to generate the boolean series whether the current entry in a group is equal to max count in group. Then, use .loc to locate the boolean series and display the whole dataframe.
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that if there are multiple entries in one user_id that have the same greatest count, all these entries will be kept.
In case for such multiple entries with greatest count you want to keep only one entry per user_id, you can use the following logics instead:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that we cannot simply use df.loc[df.groupby(level=0)["count"].idxmax()] because user_id is the row index. This code only gives you all unfiltered rows just like the original dataframe unprocessed. This is because the index that idxmax() returns in this code is the user_id itself (instead of simple RangeIndex 0, 1, 2, ...etc). Then, when .loc locates these user_id index, it will simply return all entries under the same user_id.
Demo
Let's add more entries to the sample data and see the differences between the 2 solutions:
Our base df (user_id is the row index):
date count
user_id
1 2018 7 <=== max1
1 2020 5
1 2021 7 <=== max2
2 2017 1
3 2020 3 <=== max1
3 2019 1
3 2021 3 <=== max2
1st Solution result:
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
date count
user_id
1 2018 7
1 2021 7
2 2017 1
3 2020 3
3 2021 3
2nd Solution result:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
date count
user_id
1 2018 7
2 2017 1
3 2020 3

Pandas Dataframe comparison

I have 2 very large dataframes +20k rows. df_input and df_output.
df_input is made of test cases ;
df_output is filled with the results from those test cases.
I need to select all the case numbers which failed from df_output and then fix those cases in the df_input dataframe. The fix is selecting a new unique date for each case_id.
To select a new unique date it has to be within 7*k days of the prior date, before or after. So I need to use Datetime.
Basically, I want to do this:
select fail cases number from the output result
=> output_sheet[output_sheet[output_result =='FAIL']]
=> get the results in some array or vector **(how ? )**
go to input_sheet, do
=> input_df.groupBy(input_carId)
=> replace the failing dates with a new unique date within +-7k days of that old date
but it has to be unique date for that input_carId. So I think I need to use unique().
I cannot use the output_df as input_df; they're 2 very different sheets. I greatly simplified their schema here, they only share 3 columns. And also, they actually are +20000 such rows and ids
In the end I have the old input_df but changed with the new dates.
output_df
case_id output_date output_carId ouput_result
1 01/20/21 001 FAIL
2 02/21/21 001 SUCCESS
3 02/08/20 003 FAIL
4 01/07/20 001 FAIL
5 09/05/20 002 SUCCESS
input_df (old)
case_id input_date input_carId
1 01/20/21 001
2 02/21/21 002
3 02/08/20 003
4 01/07/20 001
5 09/05/20 002
expected result =>
input_df (new)
case_id input_date input_carId
1 01/13/21 001
2 02/21/21 002
3 02/22/20 003
4 01/28/20 001
5 09/05/20 002
Notice the dates for the failed cases rows 1,3,4 have changed by -+ multiple of 7 days
Use custom function for add +- 7 days to rows with FAIL:
output_df['output_date'] = pd.to_datetime(output_df['output_date'])
input_df['input_date'] = pd.to_datetime(input_df['input_date'])
cases = output_df.loc[output_df['ouput_result'] =='FAIL', 'case_id']
print (cases)
0 1
2 3
3 4
Name: case_id, dtype: int64
def func(dates):
#count number of failed rows
count = len(dates)
#generate range by count of failed rows, multiple 7 (omited 0)
arr = np.arange(1, count + 1) * 7
#shuffling for random
np.random.shuffle(arr)
#generated timedeltas for add or subtract
td = pd.to_timedelta(arr, unit='d')
less = dates - td
more = dates + td
#randomly add or subtract
rand = np.random.randint(2, size=count, dtype=bool)
#return +- 7 days
return np.where(rand, less, more)
#filter by cases
mask = input_df['case_id'].isin(cases)
input_df.loc[mask, 'input_date'] = (input_df[mask].groupby('input_carId')['input_date']
.transform(func))
print (input_df)
case_id input_date input_carId
0 1 2021-02-03 1
1 2 2021-02-21 2
2 3 2020-02-15 3
3 4 2020-01-14 1
4 5 2020-09-05 2

A way to iterate through rows and columns (in a panda data frame), select rows and columns based on conditions to put into panda another data frame

I have a data frame with over 1500 rows
a sample of the table is like so
Site 2019 2020 2021 ....
ABC 0 1 2
DEF 1 1 2
GHI 2 0 1
JKL 0 0 0
MNO 2 1 1
I want to create a new dataframe which only selects sites and years if they have:
a value in 2019
if 2019 has a value greater that or equal to value in the next years
if there is a greater value in the next year, then the value of the previous year
if the next year has a value less than the previous year
so the out put for the example would be
Site 2019 2020 2021 ....
DEF 1 1 1
GHI 2
MNO 2 1 1
DEF has got a 1 in 2021 because there is a one in 2020
I tried to use the following to find the rows with values in the 2019 column but
for i.j in df.iterrows():
if when j=2
if i >0
return value
but I get syntax errors
Without looping the rows you can do:
df1 = df[(df[2019] > 0) & (df.loc[:, 2020:].min(axis=1) <= df.loc[:, 2019])]
cols = df1.columns.tolist()
for i in range(2, len(cols)):
df1[cols[i]] = df1.loc[:, cols[i - 1: i + 1]].min(axis=1)
df1
Output:
2019 2020 2021
DEF 1 1 1
GHI 2 0 0
MNO 2 1 1
This should work as long as you don't have too many columns. Add another comparison for each set of years that need to be compared. This will be a reference to the original df unless you use .copy() to make a deep copy.
new_df = df[(df['2019'] > 0) & (df['2019'] <= df['2020']) & (df['2020'] <= df['2021']) & (df['2021'] <= df['2022'])]

Pandas Time-Series: Find previous value for each ID based on year and semester

I realize this is a fairly basic question, but I couldn't find what I'm looking for through searching (partly because I'm not sure how to summarize what I want). In any case:
I have a dataframe that has the following columns:
* ID (each one represents a specific college course)
* Year
* Term (0 = fall semester, 1 = spring semester)
* Rating (from 0 to 5)
My goal is to create another column for Previous Rating. This column would be equal to the course's rating the last time the course was held, and would be NaN for the first offering of the course. The goal is to use the course's rating from the last time the course was offered in order to predict the current semester's enrollment. I am struggling to figure out how to find the last offering of each course for a given row.
I'd appreciate any help in performing this operation! I am working in Pandas but could move my data to R if that'd make it easier. Please let me know if I need to clarify my question.
I think there are two critical points: (1) sorting by Year and Term so that the order corresponds to temporal order; and (2) using groupby to collect on IDs before selecting and shifting the Rating. So, from a frame like
>>> df
ID Year Term Rating
0 1 2010 0 2
1 2 2010 0 2
2 1 2010 1 1
3 2 2010 1 0
4 1 2011 0 3
5 2 2011 0 3
6 1 2011 1 4
7 2 2011 1 0
8 2 2012 0 4
9 2 2012 1 4
10 1 2013 0 2
We get
>>> df = df.sort(["ID", "Year", "Term"])
>>> df["Previous_Rating"] = df.groupby("ID")["Rating"].shift()
>>> df
ID Year Term Rating Previous_Rating
0 1 2010 0 2 NaN
2 1 2010 1 1 2
4 1 2011 0 3 1
6 1 2011 1 4 3
10 1 2013 0 2 4
1 2 2010 0 2 NaN
3 2 2010 1 0 2
5 2 2011 0 3 0
7 2 2011 1 0 3
8 2 2012 0 4 0
9 2 2012 1 4 4
Note that we didn't actually need to sort by ID -- the groupby would have worked equally well without it -- but this way it's easier to see that the shift has done the right thing. Reading up on the split-apply-combine pattern might be helpful.
Use this function to create the new column...
DataFrame.shift(periods=1, freq=None, axis=0, **kwds)
Shift index by desired number of periods with an optional time freq
Lets say you have a dataframe like this...
ID Rating Term Year
1 1 0 2002
2 2 1 2003
3 3 0 2004
2 4 0 2005
where ID is the course ID and you have multiple entries for each id based on year and semester. Your goal is to find the row based on an ID and recent year and term.
For that you can do this...
df[((df['Year'] == max(df.Year)) & (df['ID'] == 2) & (df['Term'] == 0))]
Where we are finding the course by given ID and term and last offering of the course. If you want the rating, then you can do
df[((df['Year'] == max(df.Year)) & (df['ID'] == 2) & (df['Term'] == 0))].Rating
Hope you were trying to accomplish this result.
Thanks.

Categories