A script to correct corrupted date values - python

my dataframe contains numerous incorrect datetime values that have been fat-fingered in by the people who entered those data. The errors are mostly 2019-11-12 was entered at 0019-12-12 and 2018 entered as 0018. There are so many of them, so I want came up with a script to correct them en mass. I used the following code:
df['A'].loc[df.A.dt.year<100]=df.A.dt.year+2000
Basically, I want tell python to detect any of the years less than 100 then add 2000 to the year. However, I am getting error :"Out of bounds nanosecond timestamp: 19-11-19 00:00:00" Is there any solution to my problem? Thanks

This is because of the limitations of timestamps : see this post about out of bounds nanosecond timestamp.
Therefore, I suggest correcting the column as a string before turning it into a datetime column, as follows:
import pandas as pd
import re
df = pd.DataFrame({"A": ["2019-10-04", "0019-04-02", "0018-06-08", "2018-07-08"]})
# I look for every date starting with zero and another number and replace by 20
r = re.compile(r"^0[0-9]{1}")
df["A"] = df["A"].apply(lambda x: r.sub('20', x))
# then I transform to datetime
df["A"] = pd.to_datetime(df["A"], format='%Y-%m-%d')
df
Here is the result
A
0 2019-10-04
1 2019-04-02
2 2018-06-08
3 2018-07-08
You need to make sure that you can only have dates in 20XX (where X is any number) and not dates in 19XX or other before applying this.

An option would be to export to csv. Then make the changes and import again.
df.to_csv('path/csvfile.csv')
text = open("path/csvfile.csv", "r")
text = ''.join([i for i in text]) \
.replace("0019-", "2019-")
x = open("path/newcsv.csv","w")
x.writelines(text)
x.close()
df_new = pd.read_csv("path/newcsv.csv")

Related

Create date from one year with string and int error - PYTHON

I have the following problem. I want to create a date from another. To do this, I extract the year from the database date and then create the chosen date (day = 30 and month = 9) being the year extracted from the database.
The code is the following
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
But error message is this
"cannot convert the series to <class 'int'>"
I think dt mean datetime, so the line 'dt.datetime(y,m,d)' create datetime object type.
bbdd20Q3['mydate'] should get int?
If so, try to think of another way to store the date (8 numbers maybe).
hope I helped :)
I assume that you did import datetime as dt then by doing:
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
You are delivering series as first argument to datetime.datetime, when it excepts int or something which can be converted to int. You should create one datetime.datetime for each element of series not single datetime.datetime, consider following example
import datetime
import pandas as pd
df = pd.DataFrame({"year":[2001,2002,2003]})
df["day"] = df["year"].apply(lambda x:datetime.datetime(x,9,30))
print(df)
Output:
year day
0 2001 2001-09-30
1 2002 2002-09-30
2 2003 2003-09-30
Here's a sample code with the required logic -
import pandas as pd
df = pd.DataFrame.from_dict({'date': ['2019-12-14', '2020-12-15']})
print(df.dtypes)
# convert the date in string format to datetime object,
# if the date column(Series) is already a datetime object then this is not required
df['date'] = pd.to_datetime(df['date'])
print(f'after conversion \n {df.dtypes}')
# logic to create a new data column
df['new_date'] = pd.to_datetime({'year':df['date'].dt.year,'month':9,'day':30})
#eollon I see that you are also new to Stack Overflow. It would be better if you can add a simple sample code, which others can tryout independently
(keeping the comment here since I don't have permission to comment :) )

Date change halfway through csv from YYYY-MM-DD to DD/MM/YY and after switch datetime no longer works

I have a csv of daily temperature data with 3 columns: dates, daily maximum temperatures, and daily minimum temperatures. I attached it here so you can see what I mean.
I am trying to break this data set into smaller datasets of 30 year periods. For the first few years of Old.csv the dates are entered in YYYY-MM-DD but then switch to DD/MM/YY in 1900. After this date format switches my code to split the years no longer works. Here is what I'm using:
df2 = pd.read_csv("Old.csv")
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1897-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1899-12-31'))
]
and it works...BUT when I switch to 1900 and beyond it stops. So this one doesnt work:
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1900-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1905-12-31'))
]
The above code gives me an empty data set, despite working pre 1900. I'm assuming this is some sort of a formatting issue but I thought that using ".to_datetime" would fix that. I also tried this:
df2['Date']=pd.to_datetime(df2['Date'])
to reformat the entire list before I ran the code above but it still didnt work. The other interesting thing is that I have a separate csv with dates consistently entered as MM/DD/YY and that one works with the code above. Could it be an issue with the turn of the century? Does anyone know how to fix this?
You're dealing with time/date data with different formats, for this you could you could use a more flexible parser, for instance dateutil.parser
Example:
>>> from dateutil.parser import parse
>>> df
Date
0 1897-01-01
1 1899-12-31
2 01/01/00
>>> df.Date.apply(parse)
0 1897-01-01 00:00:00
1 1899-12-31 00:00:00
2 2000-01-01
Name: Date, dtype: datetime64[ns]
and use your function on the parsed data.
As remarked in the comment above, it's still not clear whether year "00" refers to year 1900 or 2000, but maybe you can infer that from the context of the csv file.
To change all years in the 'DD/MM/YY' format to 1900 dates you could define your own parse function
>>> def my_parse(d):
... if d[-3]=='/':
... d = d[:-3]+'/19'+d[-2:]
... return parse(d)
>>> df.Date.apply(my_parse)
0 1897-01-01
1 1899-12-31
2 1900-01-01
Python is reading 00 as 2000 instead of 1900. So I tried this to edit 00 to read as 1900:
df2.Date.dt.year.replace(2000, 1990, inplace=True)
But python returned an error that said dates are not directly editable. So I then changed them to a string and edited that way using:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
This works but now I need to find a way to loop through 1896-1968 without having to type that line out every time.

Pandas.to_datetime giving an error when given 15-Jan-0001 is there a way around this?

I've got a dataset which goes back to 15-Jan-0001 (yes that is 1 CE!), it was originally 0 CE but since that year doesn't exist I cut those 12 months out of the data.
I am trying to get pandas to convert to date-time string in my datasdf.datetime=pd.to_datetime(df.datetime) to an internal datetime object.
I tried:
import pandas as pd
df = pd.load_csv(file)
df.datetime = pd.to_dtaetime(df.datetime)
and got:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-15 00:00:00
the first two lines of the csv file are:
datenum,year,month,day,datetime,data_mean_global,data_mean_nh,data_mean_sh
381,1,1,15,15-Jan-0001 00:00:00,277.876675965034,278.555895908363,277.197456021705
One way is convert this problematic values to NaTs:
df.datetime = pd.to_dtaetime(df.datetime, errors='coerce')

Parse timestamp having hour beyond 23 in python

I am learning python and came across an issue where I am trying to read timestamp from CSV file in below format,
43:32.0
here 43 is at hours position and convert it to DateTime format in Pandas.
I tried code,
df['time'] = df['time'].astype(str).str[:-2]
df['time'] = pd.to_datetime(df['time'], errors='coerce')
But, this is converting all values to NaT
I need the output to be in format - mm/dd/yyyy hh:mm:ss
I'm going to assume that this is a Date for 11-29-17 (today's date)?
I believe you need to add an extra 0: in the beginning of the string. Basic Example:
import pandas as pd
# creating a dataframe of your string
df1 = pd.DataFrame({'A':['43:32.0']})
# adding '0:' to the front
df1['A'] = '0:' + df1['A'].astype(str)
# making new column to show the output
df1['B'] = pd.to_datetime(df1['A'], errors='coerce')
#output
A B
0 0:43:32.0 2017-11-29 00:43:32

How to convert date format when reading from Excel - Python

I am reading from an Excel sheet. The header is date in the format of Month-Year and I want to keep it that way. But when it reades it, it changes the format to "2014-01-01 00:00:00". I wrote the following peice to fix it, but doesn't work.
import pandas as pd
import numpy as np
import datetime
from datetime import date
import time
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df.columns=pd.to_datetime(df.columns, format='%b-%y')
Which didn't do anything. On another try, I did the following:
df.columns = datetime.datetime.strptime(df.columns, '%Y-%m-%d %H:%M:%S').strftime('%b-%y')
Which returns the must be str, not datetime.datetime error. I don't know how make it read the row cell by cell to read the strings!
Here is a sample data:
NaT 11/14/2015 00:00:00 12/15/2015 00:00:00 1/15/2016 00:00:00
A 5 1 6
B 6 3 3
My main problem with this is that it does not recognize it as the header, e.g., df['11/14/2015 00:00:00'] retuns an keyError.
Any help is appreciated.
UPDATE: Here is a photo to illustrate what I keep geting! Box 6 is the implementation of apply, and box 7 is what my data looks like.
import datetime
df = pd.DataFrame({'data': ["11/14/2015 00:00:00", "11/14/2015 00:10:00", "11/14/2015 00:20:00"]})
df["data"].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %H:%M:%S').strftime('%b-%y'))
EDIT
If you'd like to work with df.columns you could use map function:
df.columns = list(map(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %H:%M:%S').strftime('%b-%y'), df1.columns))
You need list if you are using python 3.x because it's iterator by default.
The problem might be that the data in excel isn't stored in the string format you think it is. Perhaps it is stored as a number, and just displayed as a date string in excel.
Excel sometimes uses milliseconds after an epoch to store dates.
Check what the actual values you see in the df array.
What does this show?
from pprint import pprint
pprint(df)

Categories