formatting inconsistent date data with pandas - python

I'm wondering how I might approach the problem of inconsistent data formats with pandas. Initially I used regular expression to extract a date from a large data set of urls. That worked great however there is an inconsistent date format among the extracted dates:
dates
20140609
20140624
20140404
3/18/14
3/10/14
3/14/2014
20140807
20140806
2014-07-18
As you can see there is an inconsistent formatting of the date data in this dataset. Is there a way to fix this formatting so that all the dates are of the same format?
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 122270 entries, 0 to 122269
Data columns (total 4 columns):
id 119534 non-null float64
x1 122270 non-null int64
url 122270 non-null object
date 122025 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 4.7+ MB

Use to_datetime it seems man/woman enough to handle your inconsistent formatting:
In [77]:
df['dates'] = pd.to_datetime(df['dates'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 1 columns):
dates 9 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 144.0 bytes
In [78]:
df
Out[78]:
dates
0 2014-06-09
1 2014-06-24
2 2014-04-04
3 2014-03-18
4 2014-03-10
5 2014-03-14
6 2014-08-07
7 2014-08-06
8 2014-07-18
For your sample dataset to_datetime works fine, if it didn't work for you it will be because you have some formats that it can't convert, you can either set the param coerce=True which will set any values that cannot be converted to NaT or errors='raise' to report any problems.

Related

Pandas convert datetime64 [ns] columns to datetime64 [ns, UTC] for mutliple column at once

I have a dataframe called query_df and some of the columns are in datetime[ns] datatype.
I want to convert all datetime[ns] to datetime[ns, UTC] all at once.
This is what I've done so far by retrieving columns that are datetime[ns]:
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
To convert it, I can use pd.to_datetime(query_df["column_name"], utc=True).
Using dt_columns, I want to convert all columns in dt_columns.
How can I do it all at once?
Attempt:
query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True)
Error:
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
You have to use lambda function to achieve this. Try doing this
df[dt_columns] = df[dt_columns].apply(pd.to_datetime, utc=True)
First part of the process is already done by you i.e. grouping the names of the columns whose datatype is to be converted , by using :
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
Now , all you have to do ,is to convert all the columns to datetime all at once using pandas apply() functionality :
query_df[dt_columns] = query_df[dt_columns].apply(pd.to_datetime)
This will convert the required columns to the data type you specify.
EDIT:
Without using the lambda function
step 1: Create a dictionary with column names (columns to be changed) and their datatype :
convert_dict = {}
Step 2: Iterate over column names which you extracted and store in the dictionary as key with their respective value as datetime :
for col in dt_columns:
convert_dict[col] = datetime
Step 3: Now convert the datatypes by passing the dictionary into the astype() function like this :
query_df = query_df.astype(convert_dict)
By doing this, all the values of keys will be applied to the columns matching the keys.
Your attempt query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True) is interpreting dt_columns as year, month, day. Below the example in the help of to_datetime():
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Below a code snippet that gives you a solution with a little example. Bear in mind that depending in your data format or your application the UTC might not give your the right date.
import pandas as pd
query_df = pd.DataFrame({"ts1":[1622098447.2419431, 1622098447], "ts2":[1622098427.370945,1622098427], "a":[1,2], "b":[0.0,0.1]})
query_df.info()
# convert to datetime in nano seconds
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns]")
query_df.info()
#convert to datetime with UTC
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns, UTC]")
query_df.info()
which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null float64
1 ts2 2 non-null float64
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns]
1 ts2 2 non-null datetime64[ns]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns, UTC]
1 ts2 2 non-null datetime64[ns, UTC]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns, UTC](2), float64(1), int64(1)
memory usage: 192.0 byte

How to remove rows in pandas of type datetime64[ns] by date?

I'm pretty newbie, started to use python for my project.
I have dataset, first column has datetime64[ns] type
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5889 entries, 0 to 5888
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 5889 non-null datetime64[ns]
1 title 5889 non-null object
2 stock 5889 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 138.1+ KB
and
type(BA['date'])
gives
pandas.core.series.Series
date has format 2020-06-10
I need to delete all instances before specific date, for example 2015-09-09
What I tried:
convert to string. Failed
Create conditions using:
.df.year <= %y & .df.month <= %m
<= ('%y-%m-%d')
create data with datetime() method
create variable with datetime64 format
just copy with .loc() and .copy()
All of this failed, I had all kinds of error, like it's not int, its not datetime, datetime mutable, not this, not that, not a holy cow
How can this pandas format can be more counterintuitive, I can't believe, for first time I feel like write a parser CSV in C++ seems easier than use prepared library in python
Thank you for understanding
Toy Example
df = pd.DataFrame({'date':['2021-1-1', '2020-12-6', '2019-02-01', '2020-02-01']})
df.date = pd.to_datetime(df.date)
df
Input df
date
0 2021-01-01
1 2020-12-06
2 2019-02-01
3 2020-02-01
Delete rows before 2020.01.01.
We are selecting the rows which have dates after 2020.01.01 and ignoring old dates.
df.loc[df.date>'2020.01.01']
Output
date
0 2021-01-01
1 2020-12-06
3 2020-02-01
If we want the index to be reset
df = df.loc[df.date>'2020.01.01']
df
Output
date
0 2021-01-01
1 2020-12-06
2 2020-02-01

Pandas dataframe adding zero-padding before the datetime

I'm using Pandas dataframe. And I have a dataFrame df as the following:
time id
-------------
5:13:40 1
16:20:59 2
...
For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like:
time id
-------------
05:13:40 1
16:20:59 2
...
The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
Before:
print(df)
time id
1 5:13:40 1.0
2 16:20:59 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null object
id 2 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes
After:
print(df)
time id
1 05:13:40 1.0
2 16:20:59 2.0
df.info()
d<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null timedelta64[ns]
id 2 non-null float64
dtypes: float64(1), timedelta64[ns](1)
memory usage: 48.0 bytes

Wrong Data Type while Reading a large Text File

I'm trying to read the following file using pandas. The code that I'm using is the following:
df = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';', nrows=5)
The df.info() is giving the correct output.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 9 columns):
Date 5 non-null object
Time 5 non-null object
Global_active_power 5 non-null float64
Global_reactive_power 5 non-null float64
Voltage 5 non-null float64
Global_intensity 5 non-null float64
Sub_metering_1 5 non-null float64
Sub_metering_2 5 non-null float64
Sub_metering_3 5 non-null float64
dtypes: float64(7), object(2)
memory usage: 440.0+ bytes
But when I'm trying to read the entire data set using the same code except nrows:
df_all = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';') the column types are becoming object.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
Global_active_power object
Global_reactive_power object
Voltage object
Global_intensity object
Sub_metering_1 object
Sub_metering_2 object
Sub_metering_3 float64
dtypes: float64(1), object(6)
memory usage: 126.7+ MB
Can anyone please tell me why this is happening? And how to resolve it?
Thanks!
My guess would be that when you read the full data set in there are values in the additional rows that are being interpreted as different data types, for example floats interpreted as integers. You can specify the data types explicitly using the dtype argument in read_csv - see docs here.
Alternatively you could try to force the data types after loading the data; e.g. like so:
df["Global_active_power"] = df["Global_active_power"].astype(float)

Pandas - Behaviour of DateTime x-axis when using secondary y-axis not as expected (what am I doing wrong?)

I'm trying to plot two dataframes over each other, both with a DateTimeIndex using two secondary axis. First how I load the data:
import pandas as pd
df1 = pd.read_csv('SmartIce_20140927_all_voltage.csv', encoding='latin1', parse_dates=['DateTime'], index_col='DateTime')
df2 = pd.read_csv('SmartIce_20140927_temperature.csv', encoding='latin1', parse_dates=['UTC_Time'], index_col='UTC_Time')
And check the data output:
In [7]: df1.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10302 entries, 2014-09-27 16:58:54 to 2014-09-29 11:56:20
Data columns (total 5 columns):
DLPIO20_AIN0 10302 non-null float64
DLPIO20_AIN1 10302 non-null float64
DLPIO20_AIN2 10302 non-null float64
DLPIO20_AIN3 10302 non-null float64
DLPIO20_AIN4 10302 non-null float64
dtypes: float64(5)
In [8]: df1.head()
Out[8]:
DLPIO20_AIN0 DLPIO20_AIN1 DLPIO20_AIN2 DLPIO20_AIN3 \
DateTime
2014-09-27 16:58:54 0.004883 3.642578 3.696289 4.980469
2014-09-27 16:59:09 0.004883 3.637695 3.637695 4.985352
In [12]: df2.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2580 entries, 2014-09-27 16:53:00 to 2014-09-29 11:52:00
Data columns (total 3 columns):
Sample 2580 non-null int64
Temp 2580 non-null float64
DateTime 2580 non-null object
dtypes: float64(1), int64(1), object(1)
In [14]: df2.head()
Out[14]:
Sample Temp DateTime
UTC_Time
2014-09-27 16:53:00 1 -15.44 9/27/2014 14:23
2014-09-27 16:54:00 2 -14.61 9/27/2014 14:24
Now when I try to plot:
df1.DLPIO20_AIN4.plot()
df2.Temp.plot(secondary_y=True, style='g')
I get two images (I can't attach images because I need ten reputation). Image one has a time axis that is just hours (formatted for example 18:00:00 at a diagonal). Image two, which I wasn't expecting, has a time axis formatted as hours and underneath the day (which I prefer). I was expecting to get one plot layed over the other plot. I've played around with various things but I'm not sure what I should be doing to fix it, nor how to proceed. I believe the DatetimeIndexes are identical, ...or at least I understand I have set them up like that.

Categories