I have currently a dataframe I created by scraping google news headlines. One of my columns is "Time", which refers to time of publication of an article.
Unfortunately, for recent articles, google news uses a "relative" date, e.g., 6 hours ago, or 1 day ago instead of Nov 1, 2017.
I really want to convert these relative dates to be consistent with the other entries (so they also say Nov 12, 2017, for example), but I have no idea where to even start on this.
My thoughts are to maybe make a variable which represents todays date, and then do some kind of search through the dataframe for stuff which doesn't match my format, and then to subtract those relative times with the current date. I would also have to make some sort of filter for stuff which has "hours ago" and just have those equal the current date.
I don't really want a solution but rather a general idea of what to read to try to solve this. Am I supposed to try using numpy?
Example of some rows:
Publication Time Headline
0 The San Diego Union-Tribune 6 hours ago I am not opposed to new therapeutic modalities...
1 Devon Live 13 hours ago If you're looking for a bargain this Christmas...
15 ABS-CBN News 1 day ago Now, Thirdy has a chance to do something that ...
26 New York Times Nov 2, 2017 Shepherds lead their sheep through the centre ...
You can use to_datetime with to_timedelta first and then use combine_first with floor:
#create dates
dates = pd.to_datetime(df['Time'], errors='coerce')
#create times
times = pd.to_timedelta(df['Time'].str.extract('(.*)\s+ago', expand=False))
#combine final datetimes
df['Time'] = (pd.datetime.now() - times).combine_first(dates).dt.floor('D')
print (df)
Publication Time \
0 The San Diego Union-Tribune 2017-11-12
1 Devon Live 2017-11-11
2 ABS-CBN News 2017-11-11
3 New York Times 2017-11-02
Headline
0 I am not opposed to new therapeutic modalities
1 If you're looking for a bargain this Christmas
2 Now, Thirdy has a chance to do something that
3 Shepherds lead their sheep through the centre
print (df['Time'])
0 2017-11-12
1 2017-11-11
2 2017-11-11
3 2017-11-02
Name: Time, dtype: datetime64[ns]
Your approach should work. Use Pandas Timedelta to subtract relative dates from the current date.
For example, given your sample data as:
Publication;Time;Headline
The San Diego Union-Tribune;6 hours ago;I am not opposed to new therapeutic modalities
Devon Live;13 hours ago;If you're looking for a bargain this Christmas
ABS-CBN News;1 day ago;Now, Thirdy has a chance to do something that
New York Times;Nov 2, 2017;Shepherds lead their sheep through the centre
Read in the data from the clipboard (although you could just as easily substitute with read_csv() or some other file format):
import pandas as pd
from datetime import datetime
df = pd.read_clipboard(sep=";")
For the dates that are already in date format, Pandas is smart enough to convert them with to_datetime():
absolute_date = pd.to_datetime(df.Time, errors="coerce")
absolute_date
0 NaT
1 NaT
2 NaT
3 2017-11-02
Name: Time, dtype: datetime64[ns]
For the relative dates, once we drop the "ago" part, they're basically in the right format to convert with pd.Timedelta:
relative_date = (datetime.today() -
df.Time.str.extract("(.*) ago", expand=False).apply(pd.Timedelta))
relative_date
0 2017-11-11 17:05:54.143548
1 2017-11-11 10:05:54.143548
2 2017-11-10 23:05:54.143548
3 NaT
Name: Time, dtype: datetime64[ns]
Now fill in the respective NaN values from each set, absolute and relative (updated to use combine_first(), via Jezrael's answer):
date = relative_date.combine_first(absolute_date)
relative_date
0 2017-11-11 17:06:29.658925
1 2017-11-11 10:06:29.658925
2 2017-11-10 23:06:29.658925
3 2017-11-02 00:00:00.000000
Name: Time, dtype: datetime64[ns]
Finally, pull out just the date from the datetime:
date.dt.date
0 2017-11-11
1 2017-11-11
2 2017-11-10
3 2017-11-02
Name: Time, dtype: object
Related
I have a data frame with a date column. There are almost 7k rows and 10 of them are NaN. So I wanted to interpolate the date. I checked out the documentation and they used .interpolate(). However, when I tried that, I was not getting the desired result.
One sample row:
0 November 1, 2019
1 May 1, 2017
2 NaN
3 December 15, 2017
4 March 9, 2018
My approach:
main_df['date_added'].interpolate(method='linear', inplace=True)
When I viewed the rows, they remain NaN.
Is there a way to fill that date? I have 10 such cases in the data frame.
Thank you in advance.
You don't get to do usual arithmetic operators on datetime type, e.g. multiplication/division. So you don't get to interpolate the dates linearly. One option is to convert the dates into float by subtracting a time stamp, then dividing by a period:
first_date=pd.to_datetime('1900-01-01')
periods = pd.Timedelta('1s')
(df['date'].sub(first_date).div(periods)
.interpolate(method='linear')
.mul(periods).add(first_date)
)
Output:
0 2019-11-01
1 2017-05-01
2 2017-08-23
3 2017-12-15
4 2018-03-09
Name: date, dtype: datetime64[ns]
I’m trying to look at some sales data for a small store. I have a time stamp of when the settlement was made, but sometimes it’s done before midnight and sometimes its done after midnight.
This is giving me data correct for some days and incorrect for others, as anything after midnight should be for the day before. I couldn’t find the correct pandas documentation for what I’m looking for.
Is there an if else solution to create a new column, loop through the NEW_TIMESTAMP column and set a custom timeframe (if after midnight, but before 3pm: set the day before ; else set the day). Every time I write something it either runs forever, or it crashes jupyter.
Data:
What I did is I created another series which says when a day should be offset back by one day, and I multiplied it by a pd.timedelta object, such that 0 turns into "0 days" and 1 turns into "1 day". Subtracting two series gives the right result.
Let me know how the following code works for you.
import pandas as pd
import numpy as np
# copied from https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
def random_dates(start, end, n=15):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dates = random_dates(start=pd.to_datetime('2020-01-01'),
end=pd.to_datetime('2021-01-01'))
timestamps = pd.Series(dates)
# this takes only the hour component of every datetime
hours = timestamps.dt.hour
# this takes only the hour component of every datetime
dates = timestamps.dt.date
# this compares the hours with 15, and returns a boolean if it is smaller
flag_is_day_before = hours < 15
# now you can set the dates by multiplying the 1s and 0s with a day timedelta
new_dates = dates - pd.to_timedelta(1, unit='day') * flag_is_day_before
df = pd.DataFrame(data=dict(timestamps=timestamps, new_dates=new_dates))
print(df)
This outputs
timestamps new_dates
0 2020-07-10 20:11:13 2020-07-10
1 2020-05-04 01:20:07 2020-05-03
2 2020-03-30 09:17:36 2020-03-29
3 2020-06-01 16:16:58 2020-06-01
4 2020-09-22 04:53:33 2020-09-21
5 2020-08-02 20:07:26 2020-08-02
6 2020-03-22 14:06:53 2020-03-21
7 2020-03-14 14:21:12 2020-03-13
8 2020-07-16 20:50:22 2020-07-16
9 2020-09-26 13:26:55 2020-09-25
10 2020-11-08 17:27:22 2020-11-08
11 2020-11-01 13:32:46 2020-10-31
12 2020-03-12 12:26:21 2020-03-11
13 2020-12-28 08:04:29 2020-12-27
14 2020-04-06 02:46:59 2020-04-05
I have two time series, df1
day cnt
2020-03-01 135006282
2020-03-02 145184482
2020-03-03 146361872
2020-03-04 147702306
2020-03-05 148242336
and df2:
day cnt
2017-03-01 149104078
2017-03-02 149781629
2017-03-03 151963252
2017-03-04 147384922
2017-03-05 143466746
The problem is that the sensors I'm measuring are sensitive to the day of the week, so on Sunday, for instance, they will produce less cnt. Now I need to compare the time series over 2 different years, 2017 and 2020, but to do that I have to align (March, in this case) to the matching day of the week, and plot them accordingly. How do I "shift" the data to make the series comparable?
The ISO calendar is a representation of date in a tuple (year, weeknumber, weekday). In pandas they are the dt members year, weekofyear and weekday. So assuming that the day column actually contains Timestamps (convert if first with to_datetime if it does not), you could do:
df1['Y'] = df1.day.dt.year
df1['W'] = df1.day.dt.weekofyear
df1['D'] = df1.day.dt.weekday
Then you could align the dataframes on the W and D columns
March 2017 started on wednesday
March 2020 started on Sunday
So, delete the last 3 days of march 2017
So, delete the first sunday, monday and tuesday from 2020
this way you have comparable days
df1['ctn2020'] = df1['cnt']
df2['cnt2017'] = df2['cnt']
df1 = df1.iloc[2:, 2]
df2 = df2.iloc[:-3, 2]
Since you don't want to plot the date, but want the months to align, make a new dataframe with both columns and a index column. This way you will have 3 columns: index(0-27), 2017 and 2020. The index will represent.
new_df = pd.concat([df1,df2], axis=1)
If you also want to plot the days of the week on the x axis, check out this link, to know how to get the day of the week from a date, and them change the x ticks label.
Sorry for the "written step-to-stop", if it all sounds confusing, i can type the whole code later for you.
I have a Pandas dataframe with dates for the last two property purchases. I have subtracted one from another, labelled that column Sale Date Diff and saved to a csv file. Now, I am trying to convert the data back to datetime, but its problematic.
Here's the data
Area Sale Date Diff
10 Downtown 16553 days 00:00:00.000000000
167 Downtown 67 days 00:00:00.000000000
555 Upper Sahali 2289 days 00:00:00.000000000
987 Brockluhurst 2912 days 00:00:00.000000000
1400 North Shore 4663 days 00:00:00.000000000
When I first loaded the data from csv, it had a format type 'str'.
The column has some null values, so I tried the following:
gdf['Sale Date Diff'] = pd.to_datetime(gdf['Sale Date Diff'], errors='coerce')
Which converted all my data to pandas.tslib.NaTType and it now looks like this:
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
What would be a way around this?
I would also want to format the column to only have days, is that possible?
I'm not entirely convinced you're reading your csv correctly, it looks like you are splitting things into columns that shouldn't be split up. However, you don't want to cast to datetime, you want to cast to timedelta:
pd.to_timedelta(df['Sale Date Diff'])
10 16553 days
167 67 days
555 2289 days
987 2912 days
1400 4663 days
Name: Sale Date Diff, dtype: timedelta64[ns]
It would be helpful in the future to remove the errors='coerce' line from your code, so you can better understand what went wrong. With that change, here is the error you would have seen:
ValueError: ('Unknown string format:', '16553 days 00:00:00.000000000')
This was caused by you trying to cast a string representing a timedelta object, to a Timestamp.
I have data on spells (hospital stays), each with a start and end date, but I want to count the number of days spent in hospital for calendar months. Of course, this number can be zero for months not appearing in a spell. But I cannot just attribute the length of each spell to the starting month, as longer spells run over to the following month (or more).
Basically, it would suffice for me if I could cut spells at turn-of-month datetimes, getting from the data in the first example to the data in the second:
id start end
1 2011-01-01 10:00:00 2011-01-08 16:03:00
2 2011-01-28 03:45:00 2011-02-04 15:22:00
3 2011-03-02 11:04:00 2011-03-05 05:24:00
id start end month stay
1 2011-01-01 10:00:00 2011-01-08 16:03:00 2011-01 7
2 2011-01-28 03:45:00 2011-01-31 23:59:59 2011-01 4
2 2011-02-01 00:00:00 2011-02-04 15:22:00 2011-02 4
3 2011-03-02 11:04:00 2011-03-05 05:24:00 2011-03 3
I read up on the Time Series / Date functionality of pandas, but I do not see a straightforward solution to this. How can one accomplish the slicing?
It's simpler than you think: just subtract the dates. The result is a time span. See Add column with number of days between dates in DataFrame pandas
You even get to do this for the entire frame at once:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.subtract.html
Update, now that I understand the problem better.
Add a new column: take the spell's end date; if the start date is in a different month, then set this new date's day to 01 and the time to 00:00.
This is the cut DateTime you can use to compute the portion of the stay attributable to each month. cut - start is the first month; end - cut is the second.