I have dataframe - see below. This is just a snippet of the full dateframe, there are more text and date/times in each respective rows/IDS. As you can see the text before and after each date/time is random.
ID RESULT
1 Patients Discharged Home : 12/07/2022 11:19 Bob Melciv Appt 12/07/2022 12:19 Medicaid...
2 Stawword Geraldio - 12/17/2022 11:00 Bob Melciv Appt 12/10/2022 12:09 Risk Factors...
I would like to pull all date/times where the format is MM/DD/YYYY HH:MM from the RESULT column and make each of those respective date/times into their own column.
ID DATE_TIME_1 DATE_TIME_2 DATE_TIME_3 .....
1 12/07/2022 11:19 12/07/2022 12:19
2 12/17/2022 11:00 12/10/2022 12:09
From #David542's regex, you can use str.extractall:
pattern = r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2})'
out = pd.concat([df['ID'],
df['RESULT'].str.extractall(pattern).squeeze()
.unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
.rename_axis(columns=None)], axis=1)
print(out)
# Output
ID DATE_TIME_1 DATE_TIME_2
0 1 12/07/2022 11:19 12/07/2022 12:19
1 2 12/17/2022 11:00 12/10/2022 12:09
A slightly modified version to convert extracted date/time to pd.DatetimeIndex:
pattern = r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2})'
out = pd.concat([df['ID'],
df['RESULT'].str.extractall(pattern).squeeze().apply(pd.to_datetime)
.unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
.rename_axis(columns=None)], axis=1)
print(out)
# Output
ID DATE_TIME_1 DATE_TIME_2
0 1 2022-12-07 11:19:00 2022-12-07 12:19:00
1 2 2022-12-17 11:00:00 2022-12-10 12:09:00
Step by step:
# 1. Date extraction (and squeeze DataFrame with 1 column to Series)
>>> out = df['RESULT'].str.extractall(pattern)
match
0 0 12/07/2022 11:19
1 12/07/2022 12:19
1 0 12/17/2022 11:00
1 12/10/2022 12:09
Name: 0, dtype: object
# 2. Move second index level as column (and add the prefix DATE_TIME_N)
>>> out = out.unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
match DATE_TIME_1 DATE_TIME_2
0 12/07/2022 11:19 12/07/2022 12:19
1 12/17/2022 11:00 12/10/2022 12:09
# 3. Remove the 'match' title on column axis
>>> out = out.rename_axis(columns=None)
DATE_TIME_1 DATE_TIME_2
0 12/07/2022 11:19 12/07/2022 12:19
1 12/17/2022 11:00 12/10/2022 12:09
Finally concatenate original ID with this new dataframe along column axis.
How about:
\d{2}/\d{2}/\d{4} \d{2}:\d{2}
Of course this doesn't cover nonsensical dates such as 55/55/1023, but it should get you 99% of the way there.
Related
I am trying to calculate the rolling mean for 1 year in the below pandas dataframe. 'mean_1year' for the below dataframe is calcualted using the 1 year calculation based
on month and year.
For example, month and year of first row in the below dataframe is '05' and '2016'. Hence 'mean_1year' is calculated using average 'price' of '2016-04' back to '2015-04'.Hence it
would be (1300+1400+1500)/3 = 1400. Also, while calculating this average, a filter has to be made on the "type" column. As the "type" of first row is "A", while calculating "mean_1year",
the rows have to be filtered on type=="A" and the average is computed using '2016-04' back to '2015-04'.
type year month price mean_1year
A 2016 05 1200 1400
A 2016 04 1300
A 2016 01 1400
A 2015 12 1500
Any suggestions would be appreciated. Thanks !
First you need a datetime index in ascending order so you can apply a rolling time period calculation.
df['date'] = pd.to_datetime(df['year'].astype('str')+'-'+df['month'].astype('str'))
df = df.set_index('date')
df = df.sort_index()
Then you groupby type and apply the rolling mean.
df['mean_1year'] = df.groupby('type')['price'].rolling('365D').mean().reset_index(0,drop=True)
The result is:
type year month price mean_1year
date
2015-12-01 A 2015 12 1500 1500.0
2016-01-01 A 2016 1 1400 1450.0
2016-04-01 A 2016 4 1300 1400.0
2016-05-01 A 2016 5 1200 1350.0
"Ordinary" rolling can't be applied, because it:
includes rows starting from the current row, whereas you want
to exclude it,
the range of the window expands into the future,
whereas you want to expand it back.
So I used different approach, based on loc with suitable
date slices.
As a test DataFrame I used:
type year month price
0 A 2016 5 1200
1 A 2016 4 1300
2 A 2016 1 1400
3 A 2015 12 1500
4 B 2016 5 1200
5 B 2016 4 1300
And the code is as follows:
Compute date offsets of 12 months and 1 day:
yearOffs = pd.offsets.DateOffset(months=12)
dayOffs = pd.offsets.DateOffset(days=1)
Will be needed in loc later.
Set the index to a datetime, derived from year and
month columns:
df.set_index(pd.to_datetime(df.year.astype(str)
+ df.month.astype(str), format='%Y%m'), inplace=True)
Define the function to compute means within the current
group:
def myMeans(grp):
wrk = grp.sort_index()
return wrk.apply(lambda row: wrk.loc[row.name - yearOffs
: row.name - dayOffs, 'price'].mean(), axis=1)
Compute the means:
means = df.groupby('type').apply(myMeans).swaplevel()
So far the result is:
type
2015-12-01 A NaN
2016-01-01 A 1500.0
2016-04-01 A 1450.0
2016-05-01 A 1400.0
2016-04-01 B NaN
2016-05-01 B 1300.0
dtype: float64
but df has a single level index, with non-unique values.
So to add means to df and drop now unnecessary index,
the last step is:
df = df.set_index('type', append=True).assign(mean_1year=means)\
.reset_index(level=1).reset_index(drop=True)
The final result is:
type year month price mean_1year
0 A 2016 5 1200 1400.0
1 A 2016 4 1300 1450.0
2 A 2016 1 1400 1500.0
3 A 2015 12 1500 NaN
4 B 2016 5 1200 1300.0
5 B 2016 4 1300 NaN
For the "earliest" rows in each group the result is NaN,
as there are no source (earlier) rows to compute the means
for them (so there is apparently something wrong in the other solution).
I have this dataframe, I am trying to find the difference in minutes between Date1 and Date2 IF the first two characters are the same and create a column for that.
For example, the first row, 22 = 22 so then find the difference between 20:27:45 and 20:52:03
Date1 Date2 ID City
0 22 20:27:45 22 20:52:03 76 Denver
1 02 20:16:28 02 20:49:02 45 Austin
2 15 19:35:09 15 20:52:44 233 Chicago
3 30 19:47:53 30 20:18:01 35 Detroit
4 09 19:01:52 09 19:45:26 342 New York City
This is what I've tried so far:
(pd.to_datetime(data['Date1'].str[3:]).dt.minute - pd.to_datetime(data['Date2'].str[3:]).dt.minute)
This works fine but I want to add that condition in here.
I tried creating a function:
def f(data):
if data['Date1'][:3] == data['Date2'][:3]:
return pd.to_datetime(data['Date1'][3:]).dt.minute - pd.to_datetime(data['Date2'][3:]).dt.minute
Getting Error:
AttributeError: ("'Timestamp' object has no attribute 'dt'", 'occurred at index 0')
I know it's nonsensical to be adding in a pd.to_datetime to a series object but how can I convert this into a timestamp and find the difference in minutes?
Assuming your date columns are currently strings, you can parse the whole day hour:minute:second string, then do an apply based on the day attribute of the timestamp
I changed the day of one of the values to demonstrate what happens if the days aren't equal
def diff_func(x):
date_1 = pd.to_datetime(x.Date1, format='%d %H:%M:%S')
date_2 = pd.to_datetime(x.Date2, format='%d %H:%M:%S')
if date_1.day == date_2.day:
return (date_2-date_1).seconds / 60
else:
return None
df['minute_difference'] = df.apply(diff_func, axis=1)
Date1 Date2 minute_difference
0 22 20:27:45 22 20:52:03 24.300000
1 03 20:16:28 02 20:49:02 NaN
2 15 19:35:09 15 20:52:44 77.583333
3 30 19:47:53 30 20:18:01 30.133333
4 09 19:01:52 09 19:45:26 43.566667
You can use Series.str.slice to create the day columns, then pd.to_datetime to create datetime objects. And finally use np.where to conditionally fill the new column called Difference:
df['Date1_day'] = df['Date1'].str.slice(start=0, stop=3)
df['Date2_day'] = df['Date2'].str.slice(start=0, stop=3)
df['Date1'] = pd.to_datetime(df['Date1'].str.slice(start=3))
df['Date2'] = pd.to_datetime(df['Date2'].str.slice(start=3))
df['Difference'] = np.where(df['Date1_day'] == df['Date2_day'],
df['Date2'] - df['Date1'],
np.NaN)
df.drop(['Date1_day', 'Date2_day'], axis=1, inplace=True)
print(df)
Date1 Date2 ID City Difference
0 2019-04-11 20:27:45 2019-04-11 20:52:03 76 Denver 00:24:18
1 2019-04-11 20:16:28 2019-04-11 20:49:02 45 Austin 00:32:34
2 2019-04-11 19:35:09 2019-04-11 20:52:44 233 Chicago 01:17:35
3 2019-04-11 19:47:53 2019-04-11 20:18:01 35 Detroit 00:30:08
4 2019-04-11 19:01:52 2019-04-11 19:45:26 342 New York City 00:43:34
I have a dataset containing monthly observations of a time-series.
What I want to do is transform the datetime to year/quarter format and then extract the first value DATE[0] as the previous quarter. For example 2006-10-31 belongs to 4Q of 2006. But I want to change it to 2006Q3.
For the extraction of the subsequent values I will just use the last value from each quarter.
So, for 2006Q4 I will keep BBGN, SSD, and QQ4567 values only from DATE[2]. Similarly, for 2007Q1 I will keep only DATE[5] values, and so forth.
Original dataset:
DATE BBGN SSD QQ4567
0 2006-10-31 00:00:00 1.210 22.022 9726.550
1 2006-11-30 00:00:00 1.270 22.060 9891.008
2 2006-12-31 00:00:00 1.300 22.080 10055.466
3 2007-01-31 00:00:00 1.330 22.099 10219.924
4 2007-02-28 00:00:00 1.393 22.110 10350.406
5 2007-03-31 00:00:00 1.440 22.125 10480.888
After processing the DATE
DATE BBGN SSD QQ4567
0 2006Q3 1.210 22.022 9726.550
2 2006Q4 1.300 22.080 10055.466
5 2007Q1 1.440 22.125 10480.888
The steps I have taken so far are:
Turn the values from the yyyy-mm-dd hh format to yyyyQQ format
DF['DATE'] = pd.to_datetime(DF['DATE']).dt.to_period('Q')
and I get this
DATE BBGN SSD QQ4567
0 2006Q4 1.210 22.022 9726.550
1 2006Q4 1.270 22.060 9891.008
2 2006Q4 1.300 22.080 10055.466
3 2007Q1 1.330 22.099 10219.924
4 2007Q1 1.393 22.110 10350.406
5 2007Q1 1.440 22.125 10480.888
The next step is to extract the last values from each quarter. But because I always want to keep the first row I will exclude DATE[0] from the function.
quarterDF = DF.iloc[1:,].drop_duplicates(subset='DATE', keep='last')
Now, my question is how can I change the value in DATE[0] to always be the previous quarter. So, from 2006Q4 to be 2006Q3. Also, how this will work if DATE[0] is 2007Q1, can I change it to 2006Q4?
My suggestion would be to create a new DATE column with a day 3 months in the past. Like this
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['2006-10-31', '2007-01-31'])
one_quarter = pd.tseries.offsets.DateOffset(months=3)
df['Last_quarter'] = df.Date - one_quarter
This will give you
Date Last_quarter
0 2006-10-31 2006-07-31
1 2007-01-31 2006-10-31
Then you can do the same process as you described above on Last_quarter
Here is a pivot_table approach
# Subtract the quarter from date save it in a column
df['Q'] = df['DATE'] - pd.tseries.offsets.QuarterEnd()
#0 2006-09-30
#1 2006-09-30
#2 2006-09-30
#3 2006-12-31
#4 2006-12-31
#5 2006-12-31
#Name: Q, dtype: datetime64[ns]
# Drop and pivot for not including the columns
ndf = df.drop(['DATE','Q'],1).pivot_table(index=pd.to_datetime(df['Q']).dt.to_period('Q'),aggfunc='last')
BBGN QQ4567 SSD
Qdate
2006Q3 1.30 10055.466 22.080
2006Q4 1.44 10480.888 22.125
My data has trips with datetime info, user id for each trip and trip type (single, round, pseudo).
Here's a data sample (pandas dataframe), named All_Data:
HoraDTRetirada idpass type
2016-02-17 15:36:00 39579449489 'single'
2016-02-18 19:13:00 39579449489 'single'
2016-02-26 09:20:00 72986744521 'pseudo'
2016-02-27 12:11:00 72986744521 'round'
2016-02-27 14:55:00 11533148958 'pseudo'
2016-02-28 12:27:00 72986744521 'round'
2016-02-28 16:32:00 72986744521 'round'
I would like to count the number of times each category repeats in a "week of year" by user.
For example, if the event happens on a monday and the next event happens on a thursday for a same user, that makes two events on the same week; however, if one event happens on a saturday and the next event happens on the following monday, they happened in different weeks.
The output I am looking for would be in a form like this:
idpass weekofyear type frequency
39579449489 1 'single' 2
72986744521 2 'round' 3
72986744521 2 'pseudo' 1
11533148958 2 'pseudo' 1
Edit: this older question approaches a similar problem, but I don't know how to do it with pandas.
import pandas as pd
data = {"HoraDTRetirada": ["2016-02-17 15:36:00", "2016-02-18 19:13:00", "2016-12-31 09:20:00", "2016-02-28 12:11:00",
"2016-02-28 14:55:00", "2016-02-29 12:27:00", "2016-02-29 16:32:00"],
"idpass": ["39579449489", "39579449489", "72986744521", "72986744521", "11533148958", "72986744521",
"72986744521"],
"type": ["single", "single", "pseudo", "round", "pseudo", "round", "round"]}
df = pd.DataFrame.from_dict(data)
print(df)
df["HoraDTRetirada"] = pd.to_datetime(df['HoraDTRetirada'])
df["week"] = df['HoraDTRetirada'].dt.strftime('%U')
k = df.groupby(["idpass", "week", "type"],as_index=False).count()
print(k)
Output:
HoraDTRetirada idpass type
0 2016-02-17 15:36:00 39579449489 single
1 2016-02-18 19:13:00 39579449489 single
2 2016-12-31 09:20:00 72986744521 pseudo
3 2016-02-28 12:11:00 72986744521 round
4 2016-02-28 14:55:00 11533148958 pseudo
5 2016-02-29 12:27:00 72986744521 round
6 2016-02-29 16:32:00 72986744521 round
idpass week type HoraDTRetirada
0 11533148958 09 pseudo 1
1 39579449489 07 single 2
2 72986744521 09 round 3
3 72986744521 52 pseudo 1
This is how I got what I was looking for:
Step 1 from suggested answers was skipped because timestamps were already in pandas datetime form.
Step 2: create column for week of year:
df['week'] = df['HoraDTRetirada'].dt.strftime('%U')
Step 3: group by user id, type and week, and count values with size()
df.groupby(['idpass','type','week']).size()
My suggestion would be to do this:
make sure your timestamp is pandas datetime and add frequency column
df['HoraDTRetirada'] = pd.to_datetime(df['HoraDTRetirada'])
df['freq'] = 1
Group it and count
res = df.groupby(['idpass', 'type', pd.Grouper(key='HoraDTRetirada', freq='1W')]).count().reset_index()
Convert time to week of a year
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
Final result looks like that:
EDIT:
You are right, in your case we should do step 3 before step 2, and if you want to do that, remember that groupby will change, so finally step 2 will be:
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
and step 3 :
res = df.groupby(['idpass', 'type', 'HoraDTRetirada')]).count().reset_index()
It's a bit different because the "Hora" variable is not a time anymore, but just an int representing a week.
I have a dataframe that looks like:
id email domain created_at company
0 1 son#mail.com old.com 2017-01-21 18:19:00 company_a
1 2 boy#mail.com new.com 2017-01-22 01:19:00 company_b
2 3 girl#mail.com nadda.com 2017-01-22 01:19:00 no_company
I need summarize the data by Year, Month and if the company has a value that doesn't match "no_company":
Desired output:
year month company count
2017 1 has_company 2
no_company 1
The following works great but gives me the count for each value in the company column;
new_df = test_df['created_at'].groupby([test_df.created_at.dt.year, test_df.created_at.dt.month, test_df.company]).agg('count')
print(new_df)
result:
year month company
2017 1 company_a 1
company_b 1
no_company 1
Map a new series for has_company/no_company then groupby:
c = df.company.map(lambda x: x if x == 'no_company' else 'has_company')
y = df.created_at.dt.year.rename('year')
m = df.created_at.dt.month.rename('month')
df.groupby([y, m, c]).size()
year month company
2017 1 has_company 2
no_company 1
dtype: int64