Calculate daily sums using python pandas - python

I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.

You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128

Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()

Related

getting mean values of dates in pandas dataframe

I can't seem to understand what the difference is between <M8[ns] and date time formats on how these operations relate to why this does or doesn't work.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a')
# ultimate goal is to be able to go. * df.mean() * and be able to see mean DATE
# but this doesn't seem to work so...
df['a'].mean().strftime('%Y-%m-%d') ### ok this works... I can mess around and concat stuff...
# But why won't this work?
df2 = df.select_dtypes('datetime')
df2.mean() # WONT WORK
df2['a'].mean() # WILL WORK?
What I seem to be running into unless I am missing something is the difference between 'datetime' and '<M8[ns]' and how that works when I'm trying to get the mean date.
You can try passing numeric_only parameter in mean() method:
out=df.select_dtypes('datetime').mean(numeric_only=False)
output of out:
a 2021-06-03 04:48:00
dtype: datetime64[ns]
Note: It will throw you an error If the dtype is string
mean function you apply is different in each case.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a'])
df.mean()
This mean function its the DataFrame mean function, and it works on numeric data. To see who is numeric, do:
df._get_numeric_data()
b
0 100
1 200
2 0
3 400
4 500
But df['a'] is a datetime series.
df['a'].dtype, type(df)
(dtype('<M8[ns]'), pandas.core.frame.DataFrame)
So df['a'].mean() apply different mean function that works on datetime values. That's why df['a'].mean() output the mean of datetime values.
df['a'].mean()
Timestamp('2021-06-03 04:48:00')
read more here:
difference-between-data-type-datetime64ns-and-m8ns
DataFrame.mean() ignores datetime series
#28108

How to assign a fix value to all hour of a day in pandas

I have a half-hourly dataframe with two columns. I would like to take all the hours of a day, then do some calculation which returns one number and assign that to all half-hours of that day. Below is an example code:
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
print(df)
DATA1 DATA2
2003-01-01 08:30:00 NaN 79.990866
2003-01-01 09:00:00 NaN 5.461791
2003-01-01 09:30:00 NaN 68.892447
2003-01-01 10:00:00 NaN 44.823338
2003-01-01 10:30:00 NaN 57.860309
... ... ...
2003-01-04 22:00:00 0.394574 31.943657
2003-01-04 22:30:00 0.140950 78.275981
Then I would like to apply the following function which returns one numbre:
def my_f(data1,data2):
y = data1[data2>20]
return np.median(y)
This function selects all data in DATA1 based on a condition (DATA2>20) then takes the median of all these data.
How can I create a third column (let's say result) and assign back this fixed number (y) for all half-hours data of that day?
My guess is I should use something like this:
daily_tmp = df.resample('D').apply(my_f)
df['results'] = b.reindex(df.index,method='ffill')
If this approach is correct, how can I pass my_f with two arguments to resample.apply()?
Or is there any other way to do the similar task?
My solution assumes that you have a fairly small dataset. Please let me know if it is not the case.
I would decompose your goal as follows:
(1) group data by day
(2) for each day, compute some complicated function
(3) assign the resulted value in to half-hours.
# specify the day for each datapoint
df['day'] = df.index.map(lambda x: x.strftime('%Y-%m-%d'))
# compute a complicated function for each day and store the result
mapping = {}
for day, data_for_the_day in df.groupby(by='day'):
# assign to mapping[day] the result of a complicated function
mapping[day] = np.mean(data_for_the_day[data_for_the_day['Data2'] > 20]['Data1'])
# assign the values to half-hours
df['result'] = df.index.map(lambda x: mapping.get(x.strftime('%Y-%m-%d'), np.nan) if x.strftime('%M')=='30' else np.nan)
That's not the neatest solution, but it is straight-forward, easy-to-understand, and works well on small datasets.
Here is a fast way to do it.
First, import libraries :
import time
import pandas as pd
import numpy as np
import datetime as dt
Second, the code to achieve it:
%%time
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
#### Create an unique marker per hour
df['Date'] = df.index
df['Date'] = df['Date'].dt.strftime(date_format='%Y-%m-%d %H')
#### Then Stipulate some conditions
_condition_1 = df.Date == df.Date.shift(-1) # if full hour
_condition_2 = df.DATA2 > 20 # yours
_condition_3 = df.Date == df.Date.shift(1) # if half an hour
#### Now, report median where condition 1 and 2 are fullfilled
df['result'] = np.where(_condition_1 & _condition_2,(df.DATA1+df.DATA1.shift(-1)/2),0)
#### Fill the hours with median
df['result'] = np.where(_condition_3,df.result.shift(1),df.result)
#### Drop useless column
df = df.drop(['Date'],axis=1)
df[df.DATA2>20].tail(20)
Third: the output
output

A script to correct corrupted date values

my dataframe contains numerous incorrect datetime values that have been fat-fingered in by the people who entered those data. The errors are mostly 2019-11-12 was entered at 0019-12-12 and 2018 entered as 0018. There are so many of them, so I want came up with a script to correct them en mass. I used the following code:
df['A'].loc[df.A.dt.year<100]=df.A.dt.year+2000
Basically, I want tell python to detect any of the years less than 100 then add 2000 to the year. However, I am getting error :"Out of bounds nanosecond timestamp: 19-11-19 00:00:00" Is there any solution to my problem? Thanks
This is because of the limitations of timestamps : see this post about out of bounds nanosecond timestamp.
Therefore, I suggest correcting the column as a string before turning it into a datetime column, as follows:
import pandas as pd
import re
df = pd.DataFrame({"A": ["2019-10-04", "0019-04-02", "0018-06-08", "2018-07-08"]})
# I look for every date starting with zero and another number and replace by 20
r = re.compile(r"^0[0-9]{1}")
df["A"] = df["A"].apply(lambda x: r.sub('20', x))
# then I transform to datetime
df["A"] = pd.to_datetime(df["A"], format='%Y-%m-%d')
df
Here is the result
A
0 2019-10-04
1 2019-04-02
2 2018-06-08
3 2018-07-08
You need to make sure that you can only have dates in 20XX (where X is any number) and not dates in 19XX or other before applying this.
An option would be to export to csv. Then make the changes and import again.
df.to_csv('path/csvfile.csv')
text = open("path/csvfile.csv", "r")
text = ''.join([i for i in text]) \
.replace("0019-", "2019-")
x = open("path/newcsv.csv","w")
x.writelines(text)
x.close()
df_new = pd.read_csv("path/newcsv.csv")

Calculate day time differences in a pandas Dataframe

I have the following dataframe:
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C")]
#Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], '%m/%d/%Y')
Each row represents when a user makes a specific action. I want to compute how frequently (in terms of days) each user makes that specific action.
Let's say user A transacted first time on 08/11/2016, and then he transacted again on 09/12/2016, i.e. around 30 days after. Then, he transacted again on 10/10/2016, around 29 days after his second transaction. So, his average frequency in days would be (29+30)/2.
What is the most efficient way to do that?
Thanks in advance!
Update
I wrote the following function that computes my desired output.
from datetime import timedelta
def averagetime(a):
numdeltas = len(a) - 1
sumdeltas = 0
i = 1
while i < len(a):
delta = abs((a[i] - a[i-1]).days)
sumdeltas += delta
i += 1
if numdeltas > 1:
avg = sumdeltas / numdeltas
else:
avg = 'NaN'
return avg
It works correctly, for example, when I pass the whole "Time" column:
averagetime(df["Time"])
But it gives me an error when I try to apply it after group by.
df.groupby('User')['Time'].apply(averagetime)
Any suggestions how I can fix the above?
You can use diff, convert to float by np.timedelta64(1,'D') and with abs count sum:
print (averagetime(df["Time"]))
12.0
su = ((df["Time"].diff() / np.timedelta64(1,'D')).abs().sum())
print (su / (len(df) - 1))
12.0
Then I apply it to groupby, but there is necessary condition, because:
ZeroDivisionError: float division by zero
print (df.groupby('User')['Time']
.apply(lambda x: np.nan if len(x) == 1
else (x.diff()/np.timedelta64(1,'D')).abs().sum()/(len(x)-1)))
User
A 30.0
B 28.0
C NaN
Name: Time, dtype: float64
Building on from #Jezrael's answer:
If by "how frequently" you mean - how much time passes between each user performing the action then here's an approach:
import pandas as pd
import numpy as np
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C"),
]
# Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], dayfirst=True)
# Group the DF by min, max and count the number of instances
grouped = (df.groupby("User").agg([np.max, np.min, np.count_nonzero])
# This step is a bit messy and could be improved,
# but we need the count as an int
.assign(counter=lambda x: x["Time"]["count_nonzero"].astype(int))
# Use apply to calculate the time between first and last, then divide by frequency
.apply(lambda x: (x["Time"]["amax"] - x["Time"]["amin"]) / x["counter"].astype(int), axis=1)
)
# Output the DF if using an interactive prompt
grouped
Output:
User
A 20 days
B 30 days
C 0 days

Day delta for dates >292 years apart

I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.

Categories