I have data on spells (hospital stays), each with a start and end date, but I want to count the number of days spent in hospital for calendar months. Of course, this number can be zero for months not appearing in a spell. But I cannot just attribute the length of each spell to the starting month, as longer spells run over to the following month (or more).
Basically, it would suffice for me if I could cut spells at turn-of-month datetimes, getting from the data in the first example to the data in the second:
id start end
1 2011-01-01 10:00:00 2011-01-08 16:03:00
2 2011-01-28 03:45:00 2011-02-04 15:22:00
3 2011-03-02 11:04:00 2011-03-05 05:24:00
id start end month stay
1 2011-01-01 10:00:00 2011-01-08 16:03:00 2011-01 7
2 2011-01-28 03:45:00 2011-01-31 23:59:59 2011-01 4
2 2011-02-01 00:00:00 2011-02-04 15:22:00 2011-02 4
3 2011-03-02 11:04:00 2011-03-05 05:24:00 2011-03 3
I read up on the Time Series / Date functionality of pandas, but I do not see a straightforward solution to this. How can one accomplish the slicing?
It's simpler than you think: just subtract the dates. The result is a time span. See Add column with number of days between dates in DataFrame pandas
You even get to do this for the entire frame at once:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.subtract.html
Update, now that I understand the problem better.
Add a new column: take the spell's end date; if the start date is in a different month, then set this new date's day to 01 and the time to 00:00.
This is the cut DateTime you can use to compute the portion of the stay attributable to each month. cut - start is the first month; end - cut is the second.
Related
At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days
Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471
I’m trying to look at some sales data for a small store. I have a time stamp of when the settlement was made, but sometimes it’s done before midnight and sometimes its done after midnight.
This is giving me data correct for some days and incorrect for others, as anything after midnight should be for the day before. I couldn’t find the correct pandas documentation for what I’m looking for.
Is there an if else solution to create a new column, loop through the NEW_TIMESTAMP column and set a custom timeframe (if after midnight, but before 3pm: set the day before ; else set the day). Every time I write something it either runs forever, or it crashes jupyter.
Data:
What I did is I created another series which says when a day should be offset back by one day, and I multiplied it by a pd.timedelta object, such that 0 turns into "0 days" and 1 turns into "1 day". Subtracting two series gives the right result.
Let me know how the following code works for you.
import pandas as pd
import numpy as np
# copied from https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
def random_dates(start, end, n=15):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dates = random_dates(start=pd.to_datetime('2020-01-01'),
end=pd.to_datetime('2021-01-01'))
timestamps = pd.Series(dates)
# this takes only the hour component of every datetime
hours = timestamps.dt.hour
# this takes only the hour component of every datetime
dates = timestamps.dt.date
# this compares the hours with 15, and returns a boolean if it is smaller
flag_is_day_before = hours < 15
# now you can set the dates by multiplying the 1s and 0s with a day timedelta
new_dates = dates - pd.to_timedelta(1, unit='day') * flag_is_day_before
df = pd.DataFrame(data=dict(timestamps=timestamps, new_dates=new_dates))
print(df)
This outputs
timestamps new_dates
0 2020-07-10 20:11:13 2020-07-10
1 2020-05-04 01:20:07 2020-05-03
2 2020-03-30 09:17:36 2020-03-29
3 2020-06-01 16:16:58 2020-06-01
4 2020-09-22 04:53:33 2020-09-21
5 2020-08-02 20:07:26 2020-08-02
6 2020-03-22 14:06:53 2020-03-21
7 2020-03-14 14:21:12 2020-03-13
8 2020-07-16 20:50:22 2020-07-16
9 2020-09-26 13:26:55 2020-09-25
10 2020-11-08 17:27:22 2020-11-08
11 2020-11-01 13:32:46 2020-10-31
12 2020-03-12 12:26:21 2020-03-11
13 2020-12-28 08:04:29 2020-12-27
14 2020-04-06 02:46:59 2020-04-05
I’m just getting my toes wet in Pandas and gotten pretty stuck. I want to aggregate events (get the count) in a CSV by hour and have a start time and and end time in the event.
i.e. an example would be:
event, start, end
soccer, 2020-01-20 00:34:00, 2020-01-20 02:34:00,
football, 2020-01-20 00:34:00, 2020-01-20 01:34:00
etc
expected output:
00:00:00 - 2 (both began in 0th hour and went to 1st hour)
01:00:00 - 2 (both were live in 1st hour)
02:00:00 - 1 (only soccer occurred in 02 hour)
How would you go about this? I’ve been trying reindexing, resampling, time difference, time indexes — all with no luck.
What you want is effectively a frequency distribution of hours during which events are taking place. First, you need to generate the samples from which to take the distribution by creating a range and then exploding it:
hours = events.apply(lambda row: range(row['end'].hour - row['start'].hour + 1), axis=1).explode()
0 0
0 1
0 2
1 0
1 1
dtype: object
Don't forget to add one to the difference between end and start to account for fencepost error.
Then just get value counts for the sample. To get the frequency in order of hours instead of by descending count, pass sort=False.
hours.value_counts(sort=False)
0 2
1 2
2 1
dtype: int64
UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)
I have currently a dataframe I created by scraping google news headlines. One of my columns is "Time", which refers to time of publication of an article.
Unfortunately, for recent articles, google news uses a "relative" date, e.g., 6 hours ago, or 1 day ago instead of Nov 1, 2017.
I really want to convert these relative dates to be consistent with the other entries (so they also say Nov 12, 2017, for example), but I have no idea where to even start on this.
My thoughts are to maybe make a variable which represents todays date, and then do some kind of search through the dataframe for stuff which doesn't match my format, and then to subtract those relative times with the current date. I would also have to make some sort of filter for stuff which has "hours ago" and just have those equal the current date.
I don't really want a solution but rather a general idea of what to read to try to solve this. Am I supposed to try using numpy?
Example of some rows:
Publication Time Headline
0 The San Diego Union-Tribune 6 hours ago I am not opposed to new therapeutic modalities...
1 Devon Live 13 hours ago If you're looking for a bargain this Christmas...
15 ABS-CBN News 1 day ago Now, Thirdy has a chance to do something that ...
26 New York Times Nov 2, 2017 Shepherds lead their sheep through the centre ...
You can use to_datetime with to_timedelta first and then use combine_first with floor:
#create dates
dates = pd.to_datetime(df['Time'], errors='coerce')
#create times
times = pd.to_timedelta(df['Time'].str.extract('(.*)\s+ago', expand=False))
#combine final datetimes
df['Time'] = (pd.datetime.now() - times).combine_first(dates).dt.floor('D')
print (df)
Publication Time \
0 The San Diego Union-Tribune 2017-11-12
1 Devon Live 2017-11-11
2 ABS-CBN News 2017-11-11
3 New York Times 2017-11-02
Headline
0 I am not opposed to new therapeutic modalities
1 If you're looking for a bargain this Christmas
2 Now, Thirdy has a chance to do something that
3 Shepherds lead their sheep through the centre
print (df['Time'])
0 2017-11-12
1 2017-11-11
2 2017-11-11
3 2017-11-02
Name: Time, dtype: datetime64[ns]
Your approach should work. Use Pandas Timedelta to subtract relative dates from the current date.
For example, given your sample data as:
Publication;Time;Headline
The San Diego Union-Tribune;6 hours ago;I am not opposed to new therapeutic modalities
Devon Live;13 hours ago;If you're looking for a bargain this Christmas
ABS-CBN News;1 day ago;Now, Thirdy has a chance to do something that
New York Times;Nov 2, 2017;Shepherds lead their sheep through the centre
Read in the data from the clipboard (although you could just as easily substitute with read_csv() or some other file format):
import pandas as pd
from datetime import datetime
df = pd.read_clipboard(sep=";")
For the dates that are already in date format, Pandas is smart enough to convert them with to_datetime():
absolute_date = pd.to_datetime(df.Time, errors="coerce")
absolute_date
0 NaT
1 NaT
2 NaT
3 2017-11-02
Name: Time, dtype: datetime64[ns]
For the relative dates, once we drop the "ago" part, they're basically in the right format to convert with pd.Timedelta:
relative_date = (datetime.today() -
df.Time.str.extract("(.*) ago", expand=False).apply(pd.Timedelta))
relative_date
0 2017-11-11 17:05:54.143548
1 2017-11-11 10:05:54.143548
2 2017-11-10 23:05:54.143548
3 NaT
Name: Time, dtype: datetime64[ns]
Now fill in the respective NaN values from each set, absolute and relative (updated to use combine_first(), via Jezrael's answer):
date = relative_date.combine_first(absolute_date)
relative_date
0 2017-11-11 17:06:29.658925
1 2017-11-11 10:06:29.658925
2 2017-11-10 23:06:29.658925
3 2017-11-02 00:00:00.000000
Name: Time, dtype: datetime64[ns]
Finally, pull out just the date from the datetime:
date.dt.date
0 2017-11-11
1 2017-11-11
2 2017-11-10
3 2017-11-02
Name: Time, dtype: object