Count of inconsistent dates formats in pandas - python

I have a column of type object it contains 500 rows of dates. I converted the column type to date and I am trying to get a count of the incorrect values, in order to fix them.
Sample of the column, you can see examples of the wrong values in rows: 3 and 5
0 2018-06-14
1 2018-11-12
2 2018-10-09
3 2018-24-08
4 2018-11-12
5 11-02-2018
6 2018-12-31
I can fix the dates if I use this code:
dirtyData['date'] = pd.to_datetime(dirtyData['date'],dayfirst=True)
But I would like to check that the format in every row is %Y-%m-%d' and get the count of the inconsistent formats first. Then change the values.
Is it possible to achieve this?

The below code will work. However, as Michael Gardner mentioned it wont distinguish between days and months if the day 12 or less
import datetime
import pandas as pd
date_list = ["2018-06-14", "2018-11-12", "2018-10-09", "2018-24-08",
"2018-11-12", "11-02-2018", "2018-12-31"]
series1 = pd.Series(date_list)
print(series1)
#The above code is to replicate your date series
count = 0
for item in series1:
try:
datetime.datetime.strptime(item, "%Y-%m-%d") #checks if the date format is Year, Month,Day.
except ValueError: #if there is a value error then it will count these errors
count += 1
print(count)

Related

Pandas convert series of one to datetime object

I have a data frame with a lot of columns and rows, the index column contains datetime objects.
date_time column1 column2
10-10-2010 00:00:00 1 10
10-10-2010 00:00:03 1 10
10-10-2010 00:00:06 1 10
Now I want to calculate the difference in time between the first and last datetime object. Therefore:
start = df["date_time"].head(1)
stop = df["date_time"].tail(1)
However I now want to extract this datetime value so that I can use the .total_seconds() seconds to calculate the number of seconds difference between the two datetime objects, something like:
delta_t_seconds = (start - stop).total_seconds()
This however doesn't give the desired result, since start and stop are still series with only one member.
please help

Drop certain character in Object before converting to Datetime column in Pandas

My dataframe has a column which measures time difference in the format HH:MM:SS.000
The pandas is formed from an excel file, the column which stores time difference is an Object. However some entries have negative time difference, the negative sign doesn't matter to me and needs to be removed from the time as it's not filtering a condition I have:
Note: I only have the negative time difference there because of the issue I'm currently having.
I've tried the following functions but I get errors as some of the time difference data is just 00:00:00 and some is 00:00:02.65 and some are 00:00:02.111
firstly how would I ensure that all data in this column is to 00:00:00.000. And then how would I remove the '-' from some the data.
Here's a sample of the time diff column, I cant transform this column into datetime as some of the entries dont have 3 digits after the decimal. Is there a way to iterate through the column and add a 0 if the length of the value isn't equal to 12 digits.
00:00:02.97
00:00:03:145
00:00:00
00:00:12:56
28 days 03:05:23.439
It looks like you need to clean your input before you can parse to timedelta, e.g. with the following function:
import pandas as pd
def clean_td_string(s):
if s.count(':') > 2:
return '.'.join(s.rsplit(':', 1))
return s
Applied to a df's column, this looks like
df = pd.DataFrame({'Time Diff': ['00:00:02.97', '00:00:03:145', '00:00:00', '00:00:12:56', '28 days 03:05:23.439']})
df['Time Diff'] = pd.to_timedelta(df['Time Diff'].apply(clean_td_string))
# df['Time Diff']
# 0 0 days 00:00:02.970000
# 1 0 days 00:00:03.145000
# 2 0 days 00:00:00
# 3 0 days 00:00:12.560000
# 4 28 days 03:05:23.439000
# Name: Time Diff, dtype: timedelta64[ns]

Convert row to format matching index

I have a df of shape 3000,125
The first row of my df represents bond tickers
The 2nd row represents the date they were sold
My index is a historical time series, and the values within the df represent the daily stock prices
e.g
AAPL GOOGLE IBM
16/02/2018 15/03/2022 22/08/2020
2019/jan/02 5 4 3
2019/jan/03. 4 4 4
2019/jan/04. 4 4 5
2019/jan/05 3 5 2
2012/Mar/03 10 20 22
I would like to run a loop on the values however to do so, the index and df.iloc[0] aka the first row needs to be in the same format.
I was able to convert the index to datetime format using the following code w/o issue:
dftest2.index = pd.to_datetime(dftest2.index, format='%Y%m%d')
Problem statement is that I'd like to convert the first row of the df to match the index format. The first row is in string format in the form '%d/%m/Y%') however in order for it to match the index it needs to be in '%Y%m%d'.
I've used the following code in order for it to match the date format of the index:
dftest2.iloc[0] = pd.to_datetime(dftest2.iloc[0]).dt.strftime('%Y-%m-%d')
And running the below code also produces the following error:
dftest2.iloc[0] = pd.to_datetime(dftest2.iloc[0]).datetime.strptime('%Y-%m-%d')
AttributeError: 'Series' object has no attribute 'datetime'
Stuck on how to convert this now into to datetime format matching index. Previous attempts to convert to datetime have resulted in the row being converted into int format with nonsensical numbers, 187745300000 etc.
How do i convert the row to match the index. The error I am getting now when running the loop is:
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'
I've looked all over stackoverflow for possible variations of my problem but w/o success.
IIUC, you just want to turn the first row into a datetime object to do some further operations?
if so, this worked for me -
test_ = pd.to_datetime(df.iloc[0].str.replace("*", "").str.replace(".", ""))
print(test_)
AAPL 2017-04-01
Google 2021-02-03
IBM 2020-03-03
Name: 0, dtype: datetime64[ns]
if you pass the .strftime method you will end up with an object.
hope that helps.

Pandas Dataframe Time column has float values

I am doing a cleaning of my Database. In one of the tables, the time column has values like 0.013391204. I am unable to convert this to time [mm:ss] format. Is there a function to convert this to the required format [mm:ss]
The head for the column
0 20:00
1 0.013391204
2 0.013333333
3 0.012708333
4 0.012280093
Use the below reproducible data:
import pandas as pd
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333", "0.012708333", "0.012280093"]})
I expect the output to be like the first row of the column values shown above.
What is the correct time interpretation for say the first entry? 0.013391204 is it 48 seconds?
Because, if we use datetime module we can convert float into the time format:
Updating answer to add the new information
import datetime
datetime.timedelta(days = 0.013391204)
str(datetime.timedelta(days = 0.013391204))
Output:'0:19:17.000026'
Hope this helps :))
First convert values by to_numeric with errors='coerce' for replace non floats to missing values and then replace them by original values with 00: for hours, last convert by to_timedelta with unit='d':
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333",
"0.012708333", "0.012280093"]})
s = pd.to_numeric(df['time'], errors='coerce').fillna(df['time'].radd('00:'))
df['new'] = pd.to_timedelta(s, unit='d')
print (df)
time new
0 20:00 00:20:00
1 0.013391204 00:19:17.000025
2 0.013333333 00:19:11.999971
3 0.012708333 00:18:17.999971
4 0.012280093 00:17:41.000035

Pandas: How can I add two timestamp values?

I am trying to add more than two timestamp values and I expect to see output in minutes/seconds. How can I add two timestamps? I basically want to do: '1995-07-01 00:00:01' + '1995-07-01 00:05:06' and see if total time>=60minutes.
I tried this code: df['timestamp'][0]+df['timestamp'][1]. I referred this post but my timestamps are coming from dataframe.
Head of my dataframe column looks like this:
0 1995-07-01 00:00:01
1 1995-07-01 00:00:06
2 1995-07-01 00:00:09
3 1995-07-01 00:00:09
4 1995-07-01 00:00:09
Name: timestamp, dtype: datetime64[ns]
I am getting this error:
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'
The problem is that adding Timestamps makes no sense. What if they were on different days? What you want are the sum of Timedeltas. We can create Timedeltas by subtracting a common date from the whole series. Let's subtract the minimum date. Then sum up the Timedeltas. Let s be your series of Timestamps
s.sub(s.dt.date.min()).sum().total_seconds()
34.0
#Adding two timestamps is not supported and not logical
#Probably, you really want to add the time rather than the timestamp itself
#This is how to extract the time from the timestamp then summing it up
import datetime
import time
t = ['1995-07-01 00:00:01','1995-07-01 00:00:06','1995-07-01 00:00:09','1995-07-01 00:00:09','1995-07-01 00:00:09']
tSum = datetime.timedelta()
df = pd.DataFrame(t, columns=['timestamp'])
for i in range(len(df)):
df['timestamp'][i] = datetime.datetime.strptime(df['timestamp'][i], "%Y-%m-%d %H:%M:%S").time()
dt=df['timestamp'][i]
(hr, mi, sec) = (dt.hour, dt.minute, dt.second)
sum = datetime.timedelta(hours=int(hr), minutes=int(mi),seconds=int(sec))
tSum += sum
if tSum.seconds >= 60*60:
print("more than 1 hour")
else:
print("less than 1 hour")

Categories