I have written the following code to preprocess a dataset like this:
StartLocation StartTime EndTime
school Mon Jul 25 19:04:30 GMT+01:00 2016 Mon Jul 25 19:04:33 GMT+01:00 2016
... ... ...
It contains a list of locations attended by a user with the start and end time. Each location may occur several times and there is no comprehensive list of locations. From this, I want to aggregate data for each location (frequency, total time, mean time). To do this I have written the following code:
def toEpoch(x):
try:
x = datetime.strptime(re.sub(r":(?=[^:]+$)", "", x), '%a %b %d %H:%M:%S %Z%z %Y').strftime('%s')
except:
x = datetime.strptime(x, '%a %b %d %H:%M:%S %Z %Y').strftime('%s')
x = (int(x)/60)
return x
#Preprocess data
df = pd.read_csv('...')
for index, row in df.iterrows():
df['StartTime'][index] = toEpoch(df['StartTime'][index])
df['EndTime'][index] = toEpoch(df['EndTime'][index])
df['TimeTaken'][index] = int(df['EndTime'][index]) - int(df['StartTime'][index])
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)
This code functions correctly, however is quite inefficient. How can I optimise the code?
EDIT: Based on #Batman's helpful comments I no longer iterate. However, I still hope to further optimise this if possible. The updated code is:
df = pd.read_csv('...')
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)
First thing I'd do is stop iterating over the rows.
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
Then, do a single groupby operation.
gb = df.groupby('StartLocation')
total = gb.sum()
av = gb.mean()
count = gb.count()
vectorize the date conversion
take the difference of two series of timestamps gives a series of timedeltas
use total_seconds to get the seconds from the the timedeltas
groupby with agg
# convert dates
cols = ['StartTime', 'EndTime']
df[cols] = pd.to_datetime(df[cols].stack()).unstack()
# generate timedelta then total_seconds via the `dt` accessor
df['TimeTaken'] = (df.EndTime - df.StartTime).dt.total_seconds()
# define the lower case version for cleanliness
loc_lower = df.StartLocation.str.lower()
# define `agg` functions for cleanliness
# this tells `groupby` to use 3 functions, sum, mean, and count
# it also tells what column names to use
funcs = dict(Total='sum', Mean='mean', Count='count')
df.groupby(loc_lower).TimeTaken.agg(funcs).reset_index()
explanation of date conversion
I define cols for convenience
df[cols] = is an assignment to those two columns
pd.to_datetime() is a vectorized date converter but only takes pd.Series not pd.DataFrame
df[cols].stack() makes the 2-column dataframe into a series, now ready for pd.to_datetime()
use pd.to_datetime(df[cols].stack()) as described and unstack() to get back my 2-columns and now ready to be assigned.
Related
I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64
I have a file which contain a date column. I want to check that datetime column is in specific range.(eg, i get 5 files per day (where i don't have control), In which I need to pick a file which contain reading nearly in midnight.
All rows in that particular file will defer by a minute.(it is all readings, so not more than a minute gap)
Using panda , I load date column as follows;
def read_dipsfile(writer):
atg_path = '/Users/ratha/PycharmProjects/DataLoader/data/dips'
files = os.listdir(atg_path)
df = pd.DataFrame()
dateCol = ['Dip Time']
for f in files:
if(f.endswith('.CSV')):
data = pd.read_csv(os.path.join(atg_path, f), delimiter=',', skiprows=[1], skipinitialspace=True,
parse_dates=dateCol)
if mid_day_check(data['Dip Time']): --< gives error
df = df.append(data)
def mid_day_check(startTime):
midnightTime = datetime.datetime.strptime(startTime, '%Y%m%d')
hourbefore = datetime.datetime.strptime(startTime, '%Y%m%d') + datetime.timedelta(hours=-1)
if startTime <= midnightTime and startTime>=hourbefore:
return True
else:
return False
In the above code, how can i pass the column to my function?
Currently I get following error;
midnightTime = datetime.datetime.strptime(startTime, '%Y%m%d')
TypeError: strptime() argument 1 must be str, not Series
How can i check a time range using panda date column?
I think you need:
def mid_day_check(startTime):
#remove times
midnightTime = startTime.dt.normalize()
#add timedelta
hourbefore = midnightTime + pd.Timedelta(hours=-1)
#test with between and return at least one True by any
return startTime.between(hourbefore, midnightTime).any()
It seems you are trying to pass pd Series in strptime() which is invalid.
You can use pd.to_datetime() method to achieve the same.
pd.to_datetime(data['Dip Time'], format='%b %d, %Y')
Check these links for explaination.
strptime
conversion from series
I am attempting to find records in my dataframe that are 30 days old or older. I pretty much have everything working but I need to correct the format of the Age column. Most everything in the program is stuff I found on stack overflow, but I can't figure out how to change the format of the delta that is returned.
import pandas as pd
import datetime as dt
file_name = '/Aging_SRs.xls'
sheet = 'All'
df = pd.read_excel(io=file_name, sheet_name=sheet)
df.rename(columns={'SR Create Date': 'Create_Date', 'SR Number': 'SR'}, inplace=True)
tday = dt.date.today()
tdelta = dt.timedelta(days=30)
aged = tday - tdelta
df = df.loc[df.Create_Date <= aged, :]
# Sets the SR as the index.
df = df.set_index('SR', drop = True)
# Created the Age column.
df.insert(2, 'Age', 0)
# Calculates the days between the Create Date and Today.
df['Age'] = df['Create_Date'].subtract(tday)
The calculation in the last line above gives me the result, but it looks like -197 days +09:39:12 and I need it to just be a positive number 197. I have also tried to search using the python, pandas, and datetime keywords.
df.rename(columns={'Create_Date': 'SR Create Date'}, inplace=True)
writer = pd.ExcelWriter('output_test.xlsx')
df.to_excel(writer)
writer.save()
I can't see your example data, but IIUC and you're just trying to get the absolute value of the number of days of a timedelta, this should work:
df['Age'] = abs(df['Create_Date'].subtract(tday)).dt.days)
Explanation:
Given a dataframe with a timedelta column:
>>> df
delta
0 26523 days 01:57:59
1 -1601 days +01:57:59
You can extract just the number of days as an int using dt.days:
>>> df['delta']dt.days
0 26523
1 -1601
Name: delta, dtype: int64
Then, all you need to do is wrap that in a call to abs to get the absolute value of that int:
>>> abs(df.delta.dt.days)
0 26523
1 1601
Name: delta, dtype: int64
here is what i worked out for basically the same issue.
# create timestamp for today, normalize to 00:00:00
today = pd.to_datetime('today', ).normalize()
# match timezone with datetimes in df so subtraction works
today = today.tz_localize(df['posted'].dt.tz)
# create 'age' column for days old
df['age'] = (today - df['posted']).dt.days
pretty much the same as the answer above, but without the call to abs().
I have a list of lists composed of dates in excel float format (every minute since July 5, 1996) and an integer value associated with each date like this: [[datetime,integer]...]. I need to create a new list composed of all of the dates (no hours or minutes) and the sum of the values for all of the datetimes within that date. In other words, what is the sum of the values for each date when listolists[x][0] >= math.floor(listolists[x][0]) and listolists[x][0] < math.floor(listolists[x][0]). Thanks
Since you didn't provide any actual data (just the data structure you used, nested lists), I created some dummy data below to demonstrate how you might do a SUMIFS-type of problem in Python.
from datetime import datetime
import numpy as np
import pandas as pd
dates_list = []
# just take one month as an example of how to group by day
year = 2015
month = 12
# generate similar data to what you might have
for day in range(1, 32):
for hour in range(1, 24):
for minute in range(1, 60):
dates_list.append([datetime(year, month, day, hour, minute), np.random.randint(20)])
# unpack these nested list pairs so we have all of the dates in
# one list, and all of the values in the other
# this makes it easier for pandas later
dates, values = zip(*dates_list)
# to eventually group by day, we need to forget about all intra-day data, e.g.
# different hours and minutes. we only care about the data for a given day,
# not the by-minute observations. So, let's set all of the intra-day values to
# some constant for easier rolling-up of these dates.
new_dates = []
for d in dates:
new_d = d.replace(hour = 0, minute = 0)
new_dates.append(new_d)
# throw the new dates and values into a pandas.DataFrame object
df = pd.DataFrame({'new_dates': new_dates, 'values': values})
# here's the SUMIFS function you're looking for
grouped = df.groupby('new_dates')['values'].sum()
Let's see the results:
>>> print(grouped.head())
new_dates
2015-12-01 12762
2015-12-02 13292
2015-12-03 12857
2015-12-04 12762
2015-12-05 12561
Name: values, dtype: int64
Edit: If you want these new grouped data back in the nested list format, just do this:
new_list = [[date, value] for date, value in zip(grouped.index, grouped)]
Thanks everyone. This is the simplest code I could come up with that doesn't require panda:
for row in listolist:
for k in (0, 1):
row[k] = math.floor(float(row[k]))
date = {}
for d,v in listolist:
if d in date:
date[math.floor(d)].append(v)
else:
date[math.floor(d)] = [v]
result = [(d,sum(v)) for d,v in date.items()]
I have the following dataframe:
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C")]
#Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], '%m/%d/%Y')
Each row represents when a user makes a specific action. I want to compute how frequently (in terms of days) each user makes that specific action.
Let's say user A transacted first time on 08/11/2016, and then he transacted again on 09/12/2016, i.e. around 30 days after. Then, he transacted again on 10/10/2016, around 29 days after his second transaction. So, his average frequency in days would be (29+30)/2.
What is the most efficient way to do that?
Thanks in advance!
Update
I wrote the following function that computes my desired output.
from datetime import timedelta
def averagetime(a):
numdeltas = len(a) - 1
sumdeltas = 0
i = 1
while i < len(a):
delta = abs((a[i] - a[i-1]).days)
sumdeltas += delta
i += 1
if numdeltas > 1:
avg = sumdeltas / numdeltas
else:
avg = 'NaN'
return avg
It works correctly, for example, when I pass the whole "Time" column:
averagetime(df["Time"])
But it gives me an error when I try to apply it after group by.
df.groupby('User')['Time'].apply(averagetime)
Any suggestions how I can fix the above?
You can use diff, convert to float by np.timedelta64(1,'D') and with abs count sum:
print (averagetime(df["Time"]))
12.0
su = ((df["Time"].diff() / np.timedelta64(1,'D')).abs().sum())
print (su / (len(df) - 1))
12.0
Then I apply it to groupby, but there is necessary condition, because:
ZeroDivisionError: float division by zero
print (df.groupby('User')['Time']
.apply(lambda x: np.nan if len(x) == 1
else (x.diff()/np.timedelta64(1,'D')).abs().sum()/(len(x)-1)))
User
A 30.0
B 28.0
C NaN
Name: Time, dtype: float64
Building on from #Jezrael's answer:
If by "how frequently" you mean - how much time passes between each user performing the action then here's an approach:
import pandas as pd
import numpy as np
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C"),
]
# Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], dayfirst=True)
# Group the DF by min, max and count the number of instances
grouped = (df.groupby("User").agg([np.max, np.min, np.count_nonzero])
# This step is a bit messy and could be improved,
# but we need the count as an int
.assign(counter=lambda x: x["Time"]["count_nonzero"].astype(int))
# Use apply to calculate the time between first and last, then divide by frequency
.apply(lambda x: (x["Time"]["amax"] - x["Time"]["amin"]) / x["counter"].astype(int), axis=1)
)
# Output the DF if using an interactive prompt
grouped
Output:
User
A 20 days
B 30 days
C 0 days