i am trying to get the weeks between two dates and split into rows by week and here is the error message i got:
can only concatenate str (not "datetime.timedelta") to str
Can anyone help on this one? thanks!!!
import datetime
import pandas as pd
df=pd.read_csv(r'C:\Users\xx.csv')
print(df)
# Convert dtaframe to dates
df['Start Date'] = pd.to_datetime(df['start_date'])
df['End Date'] = pd.to_datetime(df['end_date'])
df_out = pd.DataFrame()
week = 7
# Iterate over dataframe rows
for index, row in df.iterrows():
date = row["start_date"]
date_end = row["end_date"]
dealtype = row["deal_type"]
ppg = row["PPG"]
# Get the weeks for the row
while date < date_end:
date_next = date + datetime.timedelta(week - 1)
df_out = df_out.append([[dealtype, ppg, date, date_next]])
date = date_next + datetime.timedelta(1)
# Remove extra index and assign columns as original dataframe
df_out = df_out.reset_index(drop=True)
df_out.columns = df.columns
df.to_csv(r'C:\Users\Output.csv', index=None)
date is a Timestamp object which is later converted to a datetime.timedelta object.
datetime.timedelta(week - 1) is a datetime.timedelta object.
Both of these objects can be converted to a string by using str().
If you want to concatenate the string, simply wrap it with str()
date_next = str(date) + str(datetime.timedelta(week - 1))
You converted the start_date and end_date column to datetime, but you added the converted columns as Start Date and End Date. Then, in the loop, you fetch row["start_date"], which is still a string. If you want to REPLACE the start_date column, then don't give it a new name. Spelling matters.
Related
I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.
start_date, end_date, days_between_start_and_end_date.
The issue is Campaign column value is not in a fixed format, for the below values my code block works well.
1. Season1 hero (18.02. -24.03.2021)
What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.
import pandas as pd
import re
import datetime
# read csv file
df = pd.read_csv("report.csv")
# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d+\.\d+)\.\s*-\s*(\d+\.\d+\.\d+)')
df['start_date'] = dates[0]
df['end_date'] = dates[1]
# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')
# Add year to start date
for index, row in df.iterrows():
if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
continue
start_month = row["start_date"].month
end_month = row["end_date"].month
year = row["end_date"].year
if start_month > end_month:
year = year - 1
dates_str = str(row["start_date"].strftime("%d.%m")) + "." + str(year)
df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
dates_str = str(row["end_date"].strftime("%d.%m")) + "." + str(row["end_date"].year)
df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
but, I have multiple different column values where my regex fail and I receive nan values, for example
1. Sales is on (30.12.21-12.01.2022)
2. Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3. M SALE (19.04 - 04.05.2022)
4. NEW SALE (29.12.2022-11.01.2023)
5. Year End (18.12. - 12.01.2023)
6. XMAS 1 - THE TRIBE CELEBRATES XMAS (18.11.-08.12.2021) (gifting communities)
Year End (18.12. - 12.01.2023)
in all the above 4 example, my date format is completely different.
expected output
start date end date
2021-12-30 2022-01-22
2023-03-24 2023-03-30
2022-04-19 2022-05-04
2022-12-29 2023-01-11
2022-18-12 2023-01-12
2021-11-18 2021-12-08
Can someone please help me here?
Since the datetimes in the data don't have a fixed format (some are dd.mm.yy, some are dd.mm.YYYY), it might be better if we apply a custom parser function that uses try-except. We can certainly do two conversions using pd.to_datetime and choose values using np.where etc. but it might not save any time given we need to do a lot of string manipulations beforehand.
To append the missing years for some rows, since pandas string methods are not optimized and as we'll need a few of them, (str.count(), str.cat() etc.) it's probably better to use Python string methods in a loop implementation instead.
Also, iterrows() is incredibly slow, so it's much faster if you use a python loop instead.
pd.to_datetime converts each element into datetime.datetime objects anyways, so we can use datetime.strptime from the built-in module to perform the conversions.
from datetime import datetime
def datetime_parser(date, end_date=None):
# remove space around dates
date = date.strip()
# if the start date doesn't have year, append it from the end date
dmy = date.split('.')
if end_date and len(dmy) == 2:
date = f"{date}.{end_date.rsplit('.', 1)[1]}"
elif end_date and not dmy[-1]:
edmy = end_date.split('.')
if int(dmy[1]) > int(edmy[1]):
date = f"{date}{int(edmy[-1])-1}"
else:
date = f"{date}{edmy[-1]}"
try:
# try 'dd.mm.YYYY' format (e.g. 29.12.2022) first
return datetime.strptime(date, '%d.%m.%Y')
except ValueError:
# try 'dd.mm.yy' format (e.g. 30.12.21) if the above doesn't work out
return datetime.strptime(date, '%d.%m.%y')
# extract dates into 2 columns (tentatively start and end dates)
splits = df['Campaign'].str.extract(r"\((.*?)-(.*?)\)").values.tolist()
# parse the dates
df[['start_date', 'end_date']] = [[datetime_parser(start, end), datetime_parser(end)] for start, end in splits]
# find difference
df['days_between_start_and_end_date'] = df['end_date'] - df['start_date']
I would do a basic regex with extract and then perform slicing :
ser = df["Campaign"].str.extract(r"\((.*)\)", expand=False)
start_date = ser.str.strip().str[-10:]
#or ser.str.strip().str.rsplit("-").str[-1]
end_date = ser.str.strip().str.split("\s*-\s*").str[0]
NB : You can assign the Series start_date and end_date to create your two new column.
Output :
start_date, end_date
(1.0 12.01.2022 # <- start_date
2.0 30.03.2023
3.0 04.05.2022
4.0 11.01.2023
Name: Campaign, dtype: object,
1.0 30.12.21 # <- end_date
2.0 24.03
3.0 19.04
4.0 29.12.2022
Name: Campaign, dtype: object)
I have a dataframe with data per second the original format of that data is '%H:%M:%S'; however, when I put pd.to_datetime, automatically a date was added to that column.
I would like to change that default date to the values I am obtaining from the csv file as year, month and day. I formatted it as '%Y-%m-%d'.
I do not know how to get the right date in the datetime column I set as index. Note: the date must be the same because its a daily data
df = pd.read_csv(url,header = None, index_col = 0)
year = int(df.iloc[2][1])
month = int(df.iloc[2][2])
day = int(df.iloc[2][3])
df.index.name = None
df.drop(index= df.iloc[:7, :].index.tolist(), inplace=True)
df.drop(columns=df.columns[-1], axis = 1, inplace=True)
df.columns = ['Name Column 1','Name Column 2']
d = pd.to_datetime((datetime(year, month, day).date()), format = '%Y-%m-%d')
df.index = pd.to_datetime(df.index, format='%H:%M:%S')
I am trying to set a new column(Day of year & Hour)
My date time consist of date and hour, i tried to split it up by using
data['dayofyear'] = data['Date'].dt.dayofyear
and
df['Various', 'Day'] = df.index.dayofyear
df['Various', 'Hour'] = df.index.hour
but it is always returning error, im not sure how i can split this up and get it to a new column.
I think problem is there is no DatetimeIndex, so use to_datetime first and then assign to new columns names:
df.index = pd.to_datetime(df.index)
df['Day'] = df.index.dayofyear
df['Hour'] = df.index.hour
Or use DataFrame.assign:
df.index = pd.to_datetime(df.index)
df = df.assign(Day = df.index.dayofyear, Hour = df.index.hour)
I imported a worksheet from a google sheets which happens to have a timestamp in string format in the ['Timestamp'] column. To filter the date by comparison and select some rows, I've created a variable which takes today's date (diaHoy) and another which is from the day before (diaAyer)
Then I'm trying to apply a mask which compares diaHoy and diaAyer with each timestamp element, but I can't because diaHoy and diaAyer are datetime elements and each timestamp cell is a string. I've tried applying strptime to ['Timestamp'] column but i can't because it's a list
Sample data:
df = pd.DataFrame ({'16/10/2019 14:56:36':['A','B'],'21/10/2019 14:56:36':['C','D'],'21/10/2019 14:56:36':['E','F']
diaHoy = 2019/10/21
diaAyer = 2019/10/20
import pandas as pd
diaHoy = datetime.today().date()
diaAyer = diaHoy + timedelta(days = -1)
wks1 = gc.open_by_url("CODE_URL").sheet1
df1 = wks1.get_all_values()
df1.pop(0)
mask1 = (df1 > diaAyer) & (df1 <= diaHoy)
pegado1 = df1.loc[mask1]
I expect that the mask filters out rows by the dates in the first column, by comparing them with diaHoy and diaAyer
Filter: between 21/10/2019 and 20/10/2019
Expected result:
df = pd.DataFrame ({'21/10/2019 14:56:36':['C','D'],'21/10/2019 14:56:36':['E','F']
you can convert the tuple of timestamp strings to a list of datetime objects:
import pandas as pd
df2 = pd.DataFrame({pd.to_datetime(key):df[key] for key in df})
I have a file which contain a date column. I want to check that datetime column is in specific range.(eg, i get 5 files per day (where i don't have control), In which I need to pick a file which contain reading nearly in midnight.
All rows in that particular file will defer by a minute.(it is all readings, so not more than a minute gap)
Using panda , I load date column as follows;
def read_dipsfile(writer):
atg_path = '/Users/ratha/PycharmProjects/DataLoader/data/dips'
files = os.listdir(atg_path)
df = pd.DataFrame()
dateCol = ['Dip Time']
for f in files:
if(f.endswith('.CSV')):
data = pd.read_csv(os.path.join(atg_path, f), delimiter=',', skiprows=[1], skipinitialspace=True,
parse_dates=dateCol)
if mid_day_check(data['Dip Time']): --< gives error
df = df.append(data)
def mid_day_check(startTime):
midnightTime = datetime.datetime.strptime(startTime, '%Y%m%d')
hourbefore = datetime.datetime.strptime(startTime, '%Y%m%d') + datetime.timedelta(hours=-1)
if startTime <= midnightTime and startTime>=hourbefore:
return True
else:
return False
In the above code, how can i pass the column to my function?
Currently I get following error;
midnightTime = datetime.datetime.strptime(startTime, '%Y%m%d')
TypeError: strptime() argument 1 must be str, not Series
How can i check a time range using panda date column?
I think you need:
def mid_day_check(startTime):
#remove times
midnightTime = startTime.dt.normalize()
#add timedelta
hourbefore = midnightTime + pd.Timedelta(hours=-1)
#test with between and return at least one True by any
return startTime.between(hourbefore, midnightTime).any()
It seems you are trying to pass pd Series in strptime() which is invalid.
You can use pd.to_datetime() method to achieve the same.
pd.to_datetime(data['Dip Time'], format='%b %d, %Y')
Check these links for explaination.
strptime
conversion from series