I have dataframe like this:
I want to convert the 'start_year', 'start_month', 'start_day' columns to date
and the columns 'end_year', 'end_month', 'end_day' to another date
There is a way to do that?
Thank you.
Given a dataframe like this:
year month day
0 2019.0 12.0 29.0
1 2020.0 9.0 15.0
2 2018.0 3.0 1.0
You can convert them to date string using type cast, and str.zfill:
OUTPUT:
df.apply(lambda x: f'{int(x["year"])}-{str(int(x["month"])).zfill(2)}-{str(int(x["day"])).zfill(2)}', axis=1)
0 2019-12-29
1 2020-09-15
2 2018-03-01
dtype: object
Here's an approach
simulate some data as your data was an image
use apply against each row to row series using datetime.datetime()
import datetime as dt
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"start_year": np.random.choice(range(2018, 2022), 10),
"start_month": np.random.choice(range(1, 13), 10),
"start_day": np.random.choice(range(1, 28), 10),
"end_year": np.random.choice(range(2018, 2022), 10),
"end_month": np.random.choice(range(1, 13), 10),
"end_day": np.random.choice(range(1, 28), 10),
}
)
df = df.apply(
lambda r: r.append(pd.Series({f"{startend}_date": dt.datetime(*(r[f"{startend}_{part}"]
for part in ["year", "month", "day"]))
for startend in ["start", "end"]})),
axis=1)
df
start_year
start_month
start_day
end_year
end_month
end_day
start_date
end_date
0
2018
9
6
2020
1
3
2018-09-06 00:00:00
2020-01-03 00:00:00
1
2018
11
6
2020
7
2
2018-11-06 00:00:00
2020-07-02 00:00:00
2
2021
8
13
2020
11
2
2021-08-13 00:00:00
2020-11-02 00:00:00
3
2021
3
15
2021
3
6
2021-03-15 00:00:00
2021-03-06 00:00:00
4
2019
4
13
2021
11
5
2019-04-13 00:00:00
2021-11-05 00:00:00
5
2021
2
5
2018
8
17
2021-02-05 00:00:00
2018-08-17 00:00:00
6
2020
4
19
2020
9
18
2020-04-19 00:00:00
2020-09-18 00:00:00
7
2020
3
27
2020
10
20
2020-03-27 00:00:00
2020-10-20 00:00:00
8
2019
12
23
2018
5
11
2019-12-23 00:00:00
2018-05-11 00:00:00
9
2021
7
18
2018
5
10
2021-07-18 00:00:00
2018-05-10 00:00:00
An interesting feature of pandasonic to_datetime function is that instead of
a sequence of strings you can pass to it a whole DataFrame.
But in this case there is a requirement that such a DataFrame must have columns
named year, month and day. They can be also of float type, like your source
DataFrame sample.
So a quite elegant solution is to:
take a part of the source DataFrame (3 columns with the respective year,
month and day),
rename its columns to year, month and day,
use it as the argument to to_datetime,
save the result as a new column.
To do it, start from defining a lambda function, to be used as the rename
function below:
colNames = lambda x: x.split('_')[1]
Then just call:
df['Start'] = pd.to_datetime(df.loc[:, 'start_year' : 'start_day']
.rename(columns=colNames))
df['End'] = pd.to_datetime(df.loc[:, 'end_year' : 'end_day']
.rename(columns=colNames))
For a sample of your source DataFrame, the result is:
start_year start_month start_day evidence_method_dating end_year end_month end_day Start End
0 2019.0 12.0 9.0 Historical Observations 2019.0 12.0 9.0 2019-12-09 2019-12-09
1 2019.0 2.0 18.0 Historical Observations 2019.0 7.0 28.0 2019-02-18 2019-07-28
2 2018.0 7.0 3.0 Seismicity 2019.0 8.0 20.0 2018-07-03 2019-08-20
Maybe the next part should be to remove columns with parts of both "start"
and "end" dates. Your choice.
Edit
To avoid saving the lambda (anonymous) function under a variable, define
this function as a regular (named) function:
def colNames(x):
return x.split('_')[1]
Related
I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1
I am using a calendar data set for price prediction for different houses with a date feature that includes 365 days of the year. I would like to minimize the data set by taking the average month price of each listing in a new column.
input data:
listing_id date price months
1 2020-01-08 75.0 Jan
1 2020-01-09 100.0 Jan
1 2020-02-08 350.0 Feb
2 2020-01-08 465.0 Jan
2 2020-02-08 250.0 Feb
2 2020-02-09 250.0 Feb
Output data:
listing_id date Avg_price months
1 2020-01-08 90.0 Jan
1 2020-02-08 100.0 Feb
2 2020-01-08 50.0 Jan
2 2020-02-08 150.0 Feb
You can get the average price for each month using groupby:
g = df.groupby("months")["price"].mean()
You can then create new columns:
for month, avg in g.iteritems():
df["average_{}".format(month)] = avg
Example with dummy data:
import pandas as pd
df = pd.DataFrame({'months':['Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar'],
'price':[1, 2, 3, 4, 5, 6]})
Result:
months price average_Feb average_Jan average_Mar
0 Jan 1 2.5 1.0 5.0
1 Feb 2 2.5 1.0 5.0
2 Feb 3 2.5 1.0 5.0
3 Mar 4 2.5 1.0 5.0
4 Mar 5 2.5 1.0 5.0
5 Mar 6 2.5 1.0 5.0
I upvoted Dan's answer.
It may help to see another way to do this.
Additionally, if you ever have data that spans multiple years you may want a month_year column instead.
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html
Example:
df = pd.DataFrame({'price':[i for i in range(121)]},
index=pd.date_range(start='12/1/2017',end='3/31/2018'))
df = df.reset_index()
df['month_year'] = df['index'].dt.month_name() + " " +
df['index'].dt.year.astype(str)
df.pivot_table(values='price',columns='month_year')
Result:
In [39]: df.pivot_table(values='price',columns='month_year')
Out[39]:
month_year December 2017 February 2018 January 2018 March 2018
price 15.0 75.5 46.0 105.0
Say I have a pd.Series of daily S&P 500 values, and I would like to filter this series to get the first business day and the associated value of each week.
So, for instance, my filtered series would contain the 5 September 2017 (Tuesday - no value for the Monday), then 11 September 2017 (Monday).
Source series:
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-06 2465.54
2017-09-07 2465.10
2017-09-08 2461.43
2017-09-11 2488.11
2017-09-12 2496.48
Filtered series
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
My solution currently consists of:
mask = SP500.apply(lambda row: SP500[row.name - datetime.timedelta(days=row.name.weekday()):].index[0], axis=1).unique()
filtered = SP500.loc[mask]
This however feels suboptimal/non-pythonic. Any better/faster/cleaner solutions?
Using resample on pd.Series.index.to_series
s[s.index.to_series().resample('W').first()]
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
dtype: float64
df.sort_index().assign(week=df.index.get_level_values(0).week).drop_duplicates('week',keep='first').drop('week',1)
Out[774]:
price
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
I'm not sure that the solution you give works, since the .apply method for series can't access the index, and doesn't have an axis argument. What you gave would work on a DataFrame, but this is simpler if you have a dataframe:
#Make some fake data
x = pd.DataFrame(pd.date_range(date(2017, 10, 9), date(2017, 10, 23)), columns = ['date'])
x['value'] = x.index
print(x)
date value
0 2017-10-09 0
1 2017-10-10 1
2 2017-10-11 2
3 2017-10-12 3
4 2017-10-13 4
5 2017-10-14 5
6 2017-10-15 6
7 2017-10-16 7
8 2017-10-17 8
9 2017-10-18 9
10 2017-10-19 10
11 2017-10-20 11
12 2017-10-21 12
13 2017-10-22 13
14 2017-10-23 14
#filter
filtered = x.groupby(x['date'].apply(lambda d: d-timedelta(d.weekday())), as_index = False).first()
print(filtered)
date value
0 2017-10-09 0
1 2017-10-16 7
2 2017-10-23 14
I have a dataframe that looks like this:
from to datetime other
-------------------------------------------------
11 1 2016-11-06 22:00:00 -
11 1 2016-11-06 20:00:00 -
11 1 2016-11-06 15:45:00 -
11 12 2016-11-06 15:00:00 -
11 1 2016-11-06 12:00:00 -
11 18 2016-11-05 10:00:00 -
11 12 2016-11-05 10:00:00 -
12 1 2016-10-05 10:00:59 -
12 3 2016-09-06 10:00:34 -
I want to groupby "from" and then "to" columns and then sort the "datetime" in descending order and then finally want to calculate the time difference within these grouped by objects between the current time and the next time. For eg, in this case,
I would like to have a dataframe like the following:
from to timediff in minutes others
11 1 120
11 1 255
11 1 225
11 1 0 (preferrably subtract this date from the epoch)
11 12 300
11 12 0
11 18 0
12 1 25
12 3 0
I can't get my head around figuring this out!! Is there a way out for this?
Any help will be much much appreciated!!
Thank you so much in advance!
df.assign(
timediff=df.sort_values(
'datetime', ascending=False
).groupby(['from', 'to']).datetime.diff(-1).dt.seconds.div(60).fillna(0))
I think you need:
groupby with apply sort_values with diff, convert Timedelta to minutes by seconds and floor division 60
fillna and sort_index, remove level 2 in index
df = df.groupby(['from','to']).datetime
.apply(lambda x: x.sort_values().diff().dt.seconds // 60)
.fillna(0)
.sort_index()
.reset_index(level=2, drop=True)
.reset_index(name='timediff in minutes')
print (df)
from to timediff in minutes
0 11 1 120.0
1 11 1 255.0
2 11 1 225.0
3 11 1 0.0
4 11 12 300.0
5 11 12 0.0
6 11 18 0.0
7 12 3 0.0
8 12 3 0.0
df = df.join(df.groupby(['from','to'])
.datetime
.apply(lambda x: x.sort_values().diff().dt.seconds // 60)
.fillna(0)
.reset_index(level=[0,1], drop=True)
.rename('timediff in minutes'))
print (df)
from to datetime other timediff in minutes
0 11 1 2016-11-06 22:00:00 - 120.0
1 11 1 2016-11-06 20:00:00 - 255.0
2 11 1 2016-11-06 15:45:00 - 225.0
3 11 12 2016-11-06 15:00:00 - 300.0
4 11 1 2016-11-06 12:00:00 - 0.0
5 11 18 2016-11-05 10:00:00 - 0.0
6 11 12 2016-11-05 10:00:00 - 0.0
7 12 3 2016-10-05 10:00:59 - 0.0
8 12 3 2016-09-06 10:00:34 - 0.0
Almost as above, but without apply:
result = df.sort_values(['from','to','datetime'])\
.groupby(['from','to'])['datetime']\
.diff().dt.seconds.fillna(0)
I have this dataframe df:
U,Datetime
01,2015-01-01 20:00:00
01,2015-02-01 20:05:00
01,2015-04-01 21:00:00
01,2015-05-01 22:00:00
01,2015-07-01 22:05:00
02,2015-08-01 20:00:00
02,2015-09-01 21:00:00
02,2014-01-01 23:00:00
02,2014-02-01 22:05:00
02,2015-01-01 20:00:00
02,2014-03-01 21:00:00
03,2015-10-01 20:00:00
03,2015-11-01 21:00:00
03,2015-12-01 23:00:00
03,2015-01-01 22:05:00
03,2015-02-01 20:00:00
03,2015-05-01 21:00:00
03,2014-01-01 20:00:00
03,2014-02-01 21:00:00
made by U and a Datetime object. What I would like to do is to filter U values having at least three consecutive occurrences in months/year. So far I have grouped by by U, year and month as:
m = df.groupby(['U',df.index.year,df.index.month]).size()
obtaining:
U
1 2015 1 1
2 1
4 1
5 1
7 1
2 2014 1 1
2 1
3 1
2015 1 1
8 1
9 1
3 2014 1 1
2 1
2015 1 1
2 1
5 1
10 1
11 1
12 1
The third column is related to the occurrences in different months/year. In this case only U values of 02 and 03 contain at least three consecutive values in months/year. Now I can't figured out how can I select those users and getting them out in a list, for instance, or just keeping them in the original dataframe df and discard the others. I tried also:
g = m.groupby(level=[0,1]).diff()
But I can't get any useful information.
Finally I could come up with the solution :) .
to give you an idea of how custom function works , simply it subtracts the value of the month from it's preceding value , the result should be one of course , and this should happen twice , for example if you have a list of numbers [5 , 6 , 7] , so 7 - 6 = 1 and 6 - 5 = 1 , 1 here appeared twice so the condition has been fulfilled
In [80]:
df.reset_index(inplace=True)
In [281]:
df['month'] = df.Datetime.dt.month
df['year'] = df.Datetime.dt.year
df
Out[281]:
Datetime U month year
0 2015-01-01 20:00:00 1 1 2015
1 2015-02-01 20:05:00 1 2 2015
2 2015-04-01 21:00:00 1 4 2015
3 2015-05-01 22:00:00 1 5 2015
4 2015-07-01 22:05:00 1 7 2015
5 2015-08-01 20:00:00 2 8 2015
6 2015-09-01 21:00:00 2 9 2015
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
9 2015-01-01 20:00:00 2 1 2015
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
17 2014-01-01 20:00:00 3 1 2014
18 2014-02-01 21:00:00 3 2 2014
In [284]:
g = df.groupby([df['U'] , df.year])
In [86]:
res = g.filter(lambda x : is_at_least_three_consec(x['month'].diff().values.tolist()))
res
Out[86]:
Datetime U month year
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
if you want to see the result of the custom function
In [84]:
res = g['month'].agg(lambda x : is_at_least_three_consec(x.diff().values.tolist()))
res
Out[84]:
U year
1 2015 False
2 2014 True
2015 False
3 2014 False
2015 True
Name: month, dtype: bool
this is how custom function implemented
In [53]:
def is_at_least_three_consec(month_diff):
consec_count = 0
#print(month_diff)
for index , val in enumerate(month_diff):
if index != 0 and val == 1:
consec_count += 1
if consec_count == 2:
return True
else:
consec_count = 0
return False