Pandas substracting number of days from date - python

I am trying to create a new column "Starting_time" by subtracting 60 days out of "Harvest_date" but I get the same date each time. Can someone point out what did I do wrong please?
Harvest_date
20.12.21
12.01.21
10.03.21
import pandas as pd
from datetime import timedelta
df1 = pd.read_csv (r'C:\Flower_weight.csv')
def subtract_days_from_date(date, days):
subtracted_date = pd.to_datetime(date) - timedelta(days=days)
subtracted_date = subtracted_date.strftime("%Y-%m-%d")
return subtracted_date
df1['Harvest_date'] = pd.to_datetime(df1.Harvest_date)
df1.style.format({"Harvest_date": lambda t: t.strftime("%Y-%m-%d")})
for harvest_date in df1['Harvest_date']:
df1["Starting_date"]=subtract_days_from_date(harvest_date,60)
print(df1["Starting_date"])
Starting_date
2021-10-05
2021-10-05
2021-10-05

I am not sure if the use of the loop was necessary here. Perhaps try the following:
df1_dates['Starting_date'] = df1_dates['Harvest_date'].apply(lambda x: pd.to_datetime(x) - timedelta(days=60))
df1_dates['Starting_date'].dt.strftime("%Y-%m-%d")
df1_dates['Starting_date']

You're overwriting the series on each iteration of the last loop
for harvest_date in df1['Harvest_date']:
df1["Starting_date"]=subtract_days_from_date(harvest_date,60)
You can do away with the loop by vectorizing the subtract_days_from_date function.
You could also reference an index with enumerate
np.vectorize
import numpy as np
subtract_days_from_date = np.vectorize(subtract_days_from_date)
df1["Starting_date"]=subtract_days_from_date(df1["Harvest_date"], 60)
enumerate
for idx, harvest_date in enumerate(df1['Harvest_date']):
df1.iloc[idx][ "Starting_date"]=subtract_days_from_date(harvest_date,60)

Related

Sorting datetime; pandas

I have a big excel file with a datetime format column which are in strings. The column looks like this:
ingezameldop
2022-10-10 15:51:18
2022-10-10 15:56:19
I have found two ways of trying to do this, however they do not work.
First (nice way):
import pandas as pd
from datetime import datetime
from datetime import date
dagStart = datetime.strptime(str(date.today())+' 06:00:00', '%Y-%m-%d %H:%M:%S')
dagEind = datetime.strptime(str(date.today())+' 23:00:00', '%Y-%m-%d %H:%M:%S')
data = pd.read_excel('inzamelbestand.xlsx', index_col=9)
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
data.to_excel("oefenexcel.xlsx")
However, this returns me with an excel file identical to the original one. I cant seem to fix this.
Second way (sketchy):
import pandas as pd
from datetime import datetime
from datetime import date
df = pd.read_excel('inzamelbestand.xlsx', index_col=9)
# uitfilteren dag van vandaag
dag = str(date.today())
dag1 = dag[8]+dag[9]
vgl = df['ingezameldop']
vgl2 = vgl.str[8]+vgl.str[9]
df = df.loc[vgl2 == dag1]
# uitfilteren vanaf 6 uur 's ochtends
# str11 str12 = uur
df.to_excel("oefenexcel.xlsx")
This one works for filtering out the exact day. But when I want to filter out the hours it does not. Because I use the same way (getting the 11nd and 12th character from the string) but I cant use logic operators (>=) on strings, so I cant filter out for times >6
You can modify this line of code
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
as
(dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
to get boolean values that are only true for records within the date range.
dagStart, dagEind and data['ingezameldop'] must be in datetime format.
In order to apply it on individual element of the column, wrap it in a function and use apply as follows
def filter(ingezameldop, dagStart, dagEind):
return (dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
then apply the filter on the column in this way
data['filter'] = data['ingezameldop'].apply(filter, dagStart=dagStart, dagEind=dagEind)
That will apply the function on individual series element which must be in datetime format

Python: slice yearly data between February and June with pandas

I have a dataset with 10 years of data from 2000 to 2010. I have the initial datetime on 2000-01-01, with data resampled to daily. I also have a weekly counter for when I apply the slice() function, I will only ask for week 5 to week 21 (February 1 to May 30).
I am a little stuck with how I can slice it every year, does it involve a loop or is there a timeseries function in python that will know to slice for a specific period in every year? Below is the code I have so far, I had a for loop that was supposed to slice(5, 21) but that didn't work.
Any suggestions how might I get this to work?
import pandas as pd
from datetime import datetime, timedelta
initial_datetime = pd.to_datetime("2000-01-01")
# Read the file
df = pd.read_csv("D:/tseries.csv")
# Convert seconds to datetime
df["Time"] = df["Time"].map(lambda dt: initial_datetime+timedelta(seconds=dt))
df = df.set_index(pd.DatetimeIndex(df["Time"]))
resampling_period = "24H"
df = df.resample(resampling_period).mean().interpolate()
df["Week"] = df.index.map(lambda dt: dt.week)
print(df)
You can slice using loc:
df.loc[df.Week.isin(range(5,22))]
If you want separate calculations per year (f.e. mean), you can use groupby:
subset = df.loc[df.Week.isin(range(5,22))]
subset.groupby(subset.index.year).mean()

Am i doing something wrong with the loops?

I am using python to do some data cleaning and i've used the datetime module to split date time and tried to create another column with just the time.
My script works but it just takes the last value of the data frame.
Here is the code:
import datetime
i = 0
for index, row in df.iterrows():
date = datetime.datetime.strptime(df.iloc[i, 0], "%Y-%m-%dT%H:%M:%SZ")
df['minutes'] = date.minute
i = i + 1
This is the dataframe :
Output
df['minutes'] = date.minute reassigns the entire 'minutes' column with the scalar value date.minute from the last iteration.
You don't need a loop, as 99% of the cases when using pandas.
You can use vectorized assignment, just replace 'source_column_name' with the name of the column with the source data.
df['minutes'] = pd.to_datetime(df['source_column_name'], format='%Y-%m-%dT%H:%M:%SZ').dt.minute
It is also most likely that you won't need to specify format as pd.to_datetime is fairly smart.
Quick example:
df = pd.DataFrame({'a': ['2020.1.13', '2019.1.13']})
df['year'] = pd.to_datetime(df['a']).dt.year
print(df)
outputs
a year
0 2020.1.13 2020
1 2019.1.13 2019
Seems like you're trying to get the time column from the datetime which is in string format. That's what I understood from your post.
Could you give this a shot?
from datetime import datetime
import pandas as pd
def get_time(date_cell):
dt = datetime.strptime(date_cell, "%Y-%m-%dT%H:%M:%SZ")
return datetime.strftime(dt, "%H:%M:%SZ")
df['time'] = df['date_time'].apply(get_time)

Find date and value of a column maximum using pandas groupby

I'm trying to find dates when wipro close price was max per year. (What date and what price?) Here's an example of some code I've tried:
import pandas as pd
import numpy as np
from nsepy import get_history
import datetime as dt
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
wipro=get_history(symbol='WIPRO', start = start, end = end)
wipro.index = pd.to_datetime(wipro.index)
# This should get me my grouped results
wipro_agg = wipro.groupby(wipro.index.year).Close.idxmax()
Solving this problem requires 2 steps. First, get the max price each year. Then, find the exact date of that instance.
# Find max price each year:
# note: specific format to keep as a dataframe
wipro_max_yr = wipro.groupby(wipro.index.dt.year)[['Close']].max()
# Now, do an inner join to find exact dates
wipro_max_dates = wipro_max_yr.merge(wipro, how='inner')
You can simply call "max" the same way you called "idxmax"
In [25]: df_ids = pd.DataFrame(wipro.groupby(wipro.index.year).Close.idxmax())
In [26]: df_ids['price'] = wipro.groupby(wipro.index.year).Close.max()
In [27]: df_ids.rename({'Close': 'date'}, axis= 1).set_index('date')
Out[27]:
price
date
2015-03-03 672.45
2016-04-20 601.25
2017-06-06 560.55
2018-12-19 340.70
2019-02-26 387.65

Pandas; transform column with MM:SS,decimals into number of seconds

Hey: Spent several hours trying to do a quite simple thing,but couldnt figure it out.
I have a dataframe with a column, df['Time'] which contains time, starting from 0, up to 20 minutes,like this:
1:10,10
1:16,32
3:03,04
First being minutes, second is seconds, third is miliseconds (only two digits).
Is there a way to automatically transform that column into seconds with Pandas, and without making that column the time index of the series?
I already tried the following but it wont work:
pd.to_datetime(df['Time']).convert('s') # AttributeError: 'Series' object has no attribute 'convert'
If the only way is to parse the time just point that out and I will prepare a proper / detailed answer to this question, dont waste your time =)
Thank you!
Code:
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({'Time':['1:10,10', '1:16,32', '3:03,04']})
df['time'] = df.Time.apply(lambda x: datetime.datetime.strptime(x,'%M:%S,%f'))
df['timedelta'] = df.time - datetime.datetime.strptime('00:00,0','%M:%S,%f')
df['secs'] = df['timedelta'].apply(lambda x: x / np.timedelta64(1, 's'))
print df
Output:
Time time timedelta secs
0 1:10,10 1900-01-01 00:01:10.100000 00:01:10.100000 70.10
1 1:16,32 1900-01-01 00:01:16.320000 00:01:16.320000 76.32
2 3:03,04 1900-01-01 00:03:03.040000 00:03:03.040000 183.04
If you have also negative time deltas:
import pandas as pd
import numpy as np
import datetime
import re
regex = re.compile(r"(?P<minus>-)?((?P<minutes>\d+):)?(?P<seconds>\d+)(,(?P<centiseconds>\d{2}))?")
def parse_time(time_str):
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.iteritems():
if param and (name != 'minus'):
time_params[name] = int(param)
time_params['milliseconds'] = time_params['centiseconds']*10
del time_params['centiseconds']
return (-1 if parts['minus'] else 1) * datetime.timedelta(**time_params)
df = pd.DataFrame({'Time':['-1:10,10', '1:16,32', '3:03,04']})
df['timedelta'] = df.Time.apply(lambda x: parse_time(x))
df['secs'] = df['timedelta'].apply(lambda x: x / np.timedelta64(1, 's'))
print df
Output:
Time timedelta secs
0 -1:10,10 -00:01:10.100000 -70.10
1 1:16,32 00:01:16.320000 76.32
2 3:03,04 00:03:03.040000 183.04

Categories