Dealing with 'OutOfBoundsDatetime: Out of bounds nanosecond timestamp:' - python

I am working on a DataFrame looking at baseball games Date and their Attendance so I can create a Calendar Heatmap.
Date Attendance
1 Apr 7 44723.0
2 Apr 8 42719.0
3 Apr 9 36139.0
4 Apr 10 41253.0
5 Apr 11 20480.0
I've tried different solutions that I've come across...
- df['Date'] = df['Date'].astype('datetime64[ns]')
- df['Date'] = pd.to_datetime(df['Date'])
but I'll get the error of
'Out of bounds nanosecond timestamp: 1-04-07 00:00:00'.
From looking at my data, I don't even have a date that goes with that timestamp. I also looked at other posts on this site, and 1 potential problem is that my Dates are NOT zero padded? Could that be the cause?

you can convert to datetime if you supply a format; Ex:
df
Out[33]:
Date Attendance
1 Apr 7 44723.0
2 Apr 8 42719.0
3 Apr 9 36139.0
4 Apr 10 41253.0
5 Apr 11 20480.0
pd.to_datetime(df["Date"], format="%b %d")
Out[35]:
1 1900-04-07
2 1900-04-08
3 1900-04-09
4 1900-04-10
5 1900-04-11
Name: Date, dtype: datetime64[ns]
If you're unhappy with the base year 1900, you can add a date offset, for example
df["datetime"] = pd.to_datetime(df["Date"], format="%b %d")
df["datetime"] += pd.tseries.offsets.DateOffset(years=100)
df["datetime"]
1 2000-04-07
2 2000-04-08
3 2000-04-09
4 2000-04-10
5 2000-04-11
Name: datetime, dtype: datetime64[ns]

Related

I am working with a Dataset, where few values in the Date Column are like ' 2 Months Ago', ' 28 Days Ago', how can we change these dates to Month-Year

I am thinking of trying if condition, but is there any library or method which I don't know about can solve this?
You can subtract actual month-year period with months from values with decimals with month(s) and assign back to DataFrames, for days convert values to timedeltas and subtract actual datetime by Series.rsub for subtract from right side:
print (df)
col
0 28 days ago
1 4 months ago
2 11 months ago
3 Oct, 2021
now = pd.Timestamp('now')
per = now.to_period('m')
date = now.floor('d')
s = df['col'].str.extract('(\d+)\s*month', expand=False).astype(float)
s1 = df['col'].str.extract('(\d+)\s*day', expand=False).astype(float)
mask, mask1 = s.notna(), s1.notna()
df.loc[mask, 'col'] = s[mask].astype(int).rsub(per).dt.strftime('%b, %Y')
df.loc[mask1, 'col'] = pd.to_timedelta(s1[mask1], unit='d').rsub(date).dt.strftime('%b, %Y')
print (df)
col
0 Sep, 2022
1 Jun, 2022
2 Nov, 2021
3 Oct, 2021
Assuming this input:
col
0 4 months ago
1 Oct, 2021
2 9 months ago
You can use:
# try to get a date:
s = pd.to_datetime(df['col'], errors='coerce')
# extract the month offset
offset = (df['col']
.str.extract(r'(\d+) months? ago', expand=False)
.fillna(0).astype(int)
)
# if the date it NaT, replace by today - n months
df['date'] = s.fillna(pd.Timestamp('today').normalize()
- offset*pd.DateOffset(months=1))
If you want a Mon, Year format:
df['date2'] = df['col'].where(offset.eq(0),
(pd.Timestamp('today').normalize()
-offset*pd.DateOffset(months=1)
).dt.strftime('%b, %Y')
)
output:
col date date2
0 4 months ago 2022-06-28 Jun, 2022
1 Oct, 2021 2021-10-01 Oct, 2021
2 9 months ago 2022-01-28 Jan, 2022

Parse object index with date, time, and time zone

Python Q. How to parse an object index in a data frame into its date, time, and time zone?
The format is "YYY-MM-DD HH:MM:SS-HH:MM"
where the right "HH:MM" is the timezone.
Example:
Midnight Jan 1st, 2020 in Mountain Time:
2020-01-01 00:00:00-07:00
I'm trying to convert this into seven columns in the data frame:
YYYY, MM, DD, HH, MM, SS, TZ
Use pd.to_datetime to parse a string column into a datetime array
datetimes = pd.to_datetime(column)
once you have this, you can access elements of the datetime object with the .dt datetime accessor:
final = pd.DataFrame({
"year": datetimes.dt.year,
"month": datetimes.dt.month,
"day": datetimes.dt.day,
"hour": datetimes.dt.hour,
"minute": datetimes.dt.minute,
"second": datetimes.dt.second,
"timezone": datetimes.dt.tz,
})
See the pandas user guide section on date/time functionality for more info
df
Date
0 2022-05-01 01:10:04+07:00
1 2022-05-02 05:09:10+07:00
2 2022-05-02 11:22:05+07:00
3 2022-05-02 10:00:30+07:00
df['Date'] = pd.to_datetime(df['Date'])
df['tz']= df['Date'].dt.tz
df['year']= df['Date'].dt.year
df['month']= df['Date'].dt.month
df['month_n']= df['Date'].dt.month_name()
df['day']= df['Date'].dt.day
df['day_n']= df['Date'].dt.day_name()
df['h']= df['Date'].dt.hour
df['mn']= df['Date'].dt.minute
df['s']= df['Date'].dt.second
df['T']= df['Date'].dt.time
df['D']= df['Date'].dt.date
Date tz year month month_n day day_n h mn s T D
0 2022-05-01 01:10:04+07:00 pytz.FixedOffset(420) 2022 5 May 1 Sunday 1 10 4 01:10:04 2022-05-01
1 2022-05-02 05:09:10+07:00 pytz.FixedOffset(420) 2022 5 May 2 Monday 5 9 10 05:09:10 2022-05-02
2 2022-05-02 11:22:05+07:00 pytz.FixedOffset(420) 2022 5 May 2 Monday 11 22 5 11:22:05 2022-05-02
3 2022-05-02 10:00:30+07:00 pytz.FixedOffset(420) 2022 5 May 2 Monday 10 0 30 10:00:30 2022-05-02

Aggregate columns with same date (sum) in csv

My code is returning the following data in CSV
Quantity Date of purchase
1 17 May 2022 at 5:40:20PM BST
1 2 Apr 2022 at 7:41:29PM BST
1 2 Apr 2022 at 6:42:05PM BST
1 29 Mar 2022 at 12:34:56PM BST
1 29 Mar 2022 at 10:52:54AM BST
1 29 Mar 2022 at 12:04:52AM BST
1 28 Mar 2022 at 4:49:34PM BST
1 28 Mar 2022 at 11:13:37AM BST
1 27 Mar 2022 at 8:53:05PM BST
1 27 Mar 2022 at 5:10:21PM BST
I am trying to get the dates only and adding the quantity data with the same date but below is the code for that
data = read_csv("products_sold_history_data.csv")
data['Date of purchase'] = pandas.to_datetime(data['Date of purchase'] , format='%d-%m-%Y').dt.date
but its giving me error can anyone please help how can I take the dates only from Date of purchase column and then add the quantity values in the same date.
Date format in your data is not the format that you specified: format='%d-%m-%Y'.
You could specify it explicitly, or let pandas infer the format for you by not providing the format:
pandas.to_datetime(data['Date of purchase']).dt.date
If you want to specify the format explicitly, you should provide the format that matches your data:
pandas.to_datetime(data['Date of purchase'], format='%d %b %Y at %H:%M:%S%p %Z')
here is one way to do it, where a date is created as a on-fly field and not making part of the DF.
Also, IIUC you're not concerned with the time part and only date is what you need to use for summing it up
extract the date part using regex, create a temp field dte using pandas.assign, and then a groupby to sum up the quantity
df.assign(dte = pd.to_datetime(
df['purchase'].str.extract(r'(.*)(at)')[0].str.strip())
).groupby('dte')['qty'].sum().reset_index()
dte qty
0 2022-02-06 3
1 2022-02-07 3
2 2022-02-08 2
3 2022-02-09 2
4 2022-02-10 2
5 2022-02-11 3
6 2022-02-14 1
7 2022-02-15 1
8 2022-02-19 1

Python: Pandas dataframe get the year to which the week number belongs and not the year of the date

I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication

Converting date using to_datetime

I am still quite new to Python, so please excuse my basic question.
After a reset of pandas grouped dataframe, I get the following:
year month pl
0 2010 1 27.4376
1 2010 2 29.2314
2 2010 3 33.5714
3 2010 4 37.2986
4 2010 5 36.6971
5 2010 6 35.9329
I would like to merge year and month to one column in pandas datetime format.
I am trying:
C3['date']=pandas.to_datetime(C3.year + C3.month, format='%Y-%m')
But it gives me a date like this:
year month pl date
0 2010 1 27.4376 1970-01-01 00:00:00.000002011
What is the correct way? Thank you.
You need to convert to str if necessary, then zfill the month col and pass this with a valid format to to_datetime:
In [303]:
df['date'] = pd.to_datetime(df['year'].astype(str) + df['month'].astype(str).str.zfill(2), format='%Y%m')
df
Out[303]:
year month pl date
0 2010 1 27.4376 2010-01-01
1 2010 2 29.2314 2010-02-01
2 2010 3 33.5714 2010-03-01
3 2010 4 37.2986 2010-04-01
4 2010 5 36.6971 2010-05-01
5 2010 6 35.9329 2010-06-01
If the conversion is unnecessary then the following should work:
df['date'] = pd.to_datetime(df['year'] + df['month'].str.zfill(2), format='%Y%m')
Your attempt failed as it treated the value as epoch time:
In [305]:
pd.to_datetime(20101, format='%Y-%m')
Out[305]:
Timestamp('1970-01-01 00:00:00.000020101')

Categories