Grouping dataframe by custom date - python

I have a large dataframe that I'm trying to combine date in one instance by minute and the other by 30 minutes.
df = pd.read_csv('2015-09-01.csv', header=None,\
names=['ID','CITY', 'STATE', 'TIMESTAMP','TWEET'], \
low_memory=False, \
parse_dates=['TIMESTAMP'], usecols=['STATE','TIMESTAMP','TWEET'])
Method 1
I have used this solution but if I try the following:
df = df2.groupby([df2.TIMESTAMP,pd.TimeGrouper(freq='H')])
It results in this error:
TypeError: axis must be a DatetimeIndex, but got an instance of 'Int64Index
which is very weird because TIMESTAMP is being parsed in read_csv
Method 2
I tried setting TIMESTAMP into index then doing:
df = df2.groupby([df2.index,pd.TimeGrouper(freq='H')])
However it's not coming up right as len(df) is 1350 rather than 24 since the dataframe as a whole is from 1 day worth of data.
Method 3
I used this solution but I'm not sure how to set it into 30 minute interval:
df = df2.groupby(df2['TIMESTAMP'].map(lambda x: x.hour))
Sample Data
STATE,TIMESTAMP,TWEET
0,TX,2015-09-25 00:00:01,Wish I could have gone to the game
1,USA,2015-09-25 00:00:01,PSA: #HaileyCassidyy and I are not related in...
2,USA,2015-09-25 00:00:02,If you gonna fail don't bring some one down wi...
3,NJ,2015-09-25 00:00:02,#_falastinia hol up hol up I can't listen to t...
4,USA,2015-09-25 00:00:02,"Wind 0.0 mph ---. Barometer 30.235 in, Rising ..."
5,NJ,2015-09-25 00:00:03,WHY ISNT GREYS ANATOMY ON?!
6,MI,2015-09-25 00:00:03,#cody_cole06 you bet it is
7,WA,2015-09-25 00:00:04,"Could be worse, I guess, could be in a collisi..."
8,NY,2015-09-25 00:00:04,I'm totally using this graphic some day... tha...
9,USA,2015-09-25 00:00:04,#MKnightOwl #Andromehda LMAO I honestly didn't..

To group a column by a frequency, you need to pass its name to the key parameter of the Grouper, like this:
df.groupby(pd.Grouper(key='TIMESTAMP', freq='30T'))
Edit:
See the Grouper docs for more - but in general, when you do groupby([a,b]) you are grouping by the unique combinations of a and b.
So in your example you were grouping by all the unique timestamp values (df['TIMESTAMP'])
and a time grouper to the index (pd.TimeGrouper defaults to the index if no key is specified) - the TypeError was because your index was not datetimelike.
This also is why you were getting the large number of groups after setting the index to 'TIMESTAMP'.

Related

How to best calculate average issue age for a range of historic dates (pandas)

Objective:
I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref
Created
resolved
a1
20/4/2021
15/5/2021
a2
21/4/2021
15/6/2021
a3
23/4/2021
15/7/2021
Endpoint: "df2"
Date
Avg_age
1/1/2021
x
2/1/2021
y
3/1/2021
z
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far:
I got this to work in what feels like a very poor way.
create a date range (pd.date_range(start,finish,freq="D")
I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
once done I just convert the list into a series for my final result, which I can then graph or whatever.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems:
This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.
I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)
You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series as before.

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

Assigning a value to the same dates fulfilling a condition in a more efficient way in a dataframe

I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.

groupby agg using a date offset or similar

Sample of dataset below
Trying to create a groupby that will give me the number of months that I specify eg last 12 months, last 36 months etc.
My groupby that rolls up my whole dataset for each 'client' is below. rolled_ret is just a custom function that geometrically links whatever performance array it gets, we can pretend is is sum()
df_client_perf = df_perf.groupby(df_perf.CLIENT_NAME)['GROUP_PERFORMANCE'].agg(Client_Return = rolled_ret)
If I put .rolling(12) I can take the most recent entry to get the previous 12 months but there is obviously a better way to do this.
Worth saying that the period column is a monthly period datetime type using to_period
thanks in advance
PERIOD,CLIENT_NAME,GROUP_PERFORMANCE
2020-03,client1,0.104
2020-04,client1,0.004
2020-05,client1,0.23
2020-06,client1,0.113
2020-03,client2,0.0023
2020-04,client2,0.03
2020-05,client2,0.15
2020-06,client2,0.143
lets say for example that I wanted to do a groupby to SUM the latest three months of data, my expected output of the above would be
client1,0.347
client2,0.323
also - I would like a way to return nan if the dataset is missing the minimum number of periods, as you can do with the rolling function.
Here is my answer.
I've used a DatetimeIndex because the method last does not work with period. First I sort values based on the PERIOD column, then I set it as Index to keep only the last 3 months (or whatever you provide), then I do the groupby the same way as you.
df['PERIOD'] = pd.to_datetime(df['PERIOD'])
(df.sort_values(by='PERIOD')
.set_index('PERIOD')
.last('3M')
.groupby('CLIENT_NAME')
.GROUP_PERFORMANCE
.sum())
# Result
CLIENT_NAME GROUP_PERFORMANCE
client1 0.347
client2 0.323

Searching for values from one dataframe in another and returning information in the corresponding row/different column

I have 2 Dataframes:
df_Billed: pd.Dataframe({'Bill_Number':[220119, 220120, 220219, 220219, 220419, 220519, 220619, 221219],'Date': [1/31/2019, 2/20/2020, 2/28/2019, 6/30/2019,6/30/2019,6/30/2019,6/30/2019,12/31/2019], 'Amount': [3312.5, 832.0,10000.0, -3312.5,8725.0,1862.5,3637.5,1587.5]})
df_Received: pd.Dataframe({'Bill_Number':[220119, 220219, 220419, 220519, 220619],'Date':[4/16/2019,5/21/2019,8/2/2019,8/2/2019,8/2/2019],'Amount':[3312.5,6687.5,8725,1862.5,3637.5]})
I am trying to search for for each "Bill_Number" in df_Billed to see if it is present if df_Received. Ideally, if it is present, I would like to take the difference between the dates between df_Billed and df_Received for that particular bill number (to see how many days it took to get paid). If the billing number is not present in df_Received, I would like to simply return the all rows for that billing number in df_Billed.
EX: Since df_Billed Bill_Number 220119 is in df_Received, it would return 75 (which is the number of days it took for the bill to be paid 4/16/2019 - 1/31/2019).
EX: Since df_Billed Bill_Number 221219 is not in df_Received, it would return 12/31/2019 (which is the date it was billed).
You could probably use merge on Bill_Number initially
df_Billed=df_Billed.merge(df_Received,on='Bill_Number',how='left')
Then use apply and pandas.to_datetime for computing diffrence between dates
df_Billed['result']=df_Billed.apply(lambda x:x.Date_x if pd.isnull(x.Date_y)
else abs(pd.to_datetime(x.Date_x)-pd.to_datetime(x.Date_y)).days,
axis=1)
And finally, I think you want to create a new column for final result..so I'm renaming the merged columns Date_x and Amount_y back to Date and Amount below:
df_Billed.drop(['Date_y','Amount_y'],axis=1,inplace=True)
df_Billed.rename(columns={"Date_x": "Date","Amount_x":"Amount"},inplace=True)
Final Dataframe:

Categories