converting object to datetime without time - python

I have dataframe that looks like below:
Date Region Data
0 200201 A 8.8
1 200201 B 14.3
...
1545 202005 C 7.3
1546 202005 D 131
I wanted to convert the Date column(data type: object) to DateTime index without time. yyyymm or yyyymmdd or yyyy-mm-dd all of these don't matter as long as I can erase the time part.
I've searched stackoverflow and tried these codes
# (1)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m", errors = "coerce", uts = False)
# (2)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m")
df["Date"] = df["Date"].dt.normalize()
# (3)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m")
df["Date"] = df["Date"].dt.date
For (1) and (2), I get ["Date"] with time like yyyy-mm-dd 00:00:00.
For (3), I do get ["Date"] as yyyymm but the dtype is object.
I can't use date range because same date is repeated for some time.
Will there be any way to convert yyyymm[object] to yyyymmdd[datetime] in python?
Thanks in advance.

It could be a display configuration issue on how your DataFrames are showing in your editor. The simplest way to get the data in the right format is:
df['Date'] = pd.to_datetime(df['Date'], format = '%Y%m')
Below are the results from repl.it with your DataFrame and this code. The date is properly formatted without the time component, and it has the proper dtype.
Date Region Data
0 2002-01-01 A 8.8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1 non-null datetime64[ns]
1 Region 1 non-null object
2 Data 1 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 152.0+ bytes
You can also try a more convoluted way of going from datetime to date string and back to datetime.
df['Date'] = pd.to_datetime(df['Date'], format = '%Y%m').dt.date
df['Date'] = df['Date'].astype('datetime64[ns]')
The final display and dtypes are the same.
Date Region Data
0 2002-01-01 A 8.8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1 non-null datetime64[ns]
1 Region 1 non-null object
2 Data 1 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 152.0+ bytes

The Date column in the question has the format YYYYMM (but no days). The function pd.to_datetime() implicitly sets the day to 1.
The function pd.Period() converts dates in the format YYYYMM to pandas periods. Note that df['Date'] can be strings or 6-digit integers.
df['Date'].apply(lambda x: pd.Period(x, freq='M'))
0 2002-01
1 2002-01
2 2020-05
3 2020-05
Name: Date, dtype: period[M]

Related

(Only) some dates in date index get interpreted wrong after import from csv

I want to analyse a dataframe in python. I loaded a csv which consists of two columns, one date/time and one mean value.
I loaded the data like this:
df = pd.read_csv('L_strom_30974_gerundet.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
The problem is, that some dates seem to get interpreted wrong by python. The csv only ranges from 01.01.2009 00:00 to 04.10.2010 23:45 (original format). But when I load the file into python it also shows dates from November and December 2010 in the plot and df.info:
PeriodIndex: 61628 entries, 2009-01-01 00:00 to 2010-12-09 23:45
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 61628 non-null float64
dtypes: float64(1)
I searched in the csv for values from this time, but couldn't find any. Also, the number of entries in the df.info matches the rows of my csv, so I reckon that some dates must have been interpreted wrong.
The tail of my dataframe after the import looks like this:
Mean
Timestamp
2010-12-09 22:45 186
2010-12-09 23:00 206
2010-12-09 23:15 168
2010-12-09 23:30 150
2010-12-09 23:45 132
I searched for similar problems, but could not find an explanation as to why most of the data is interpreted correctly, but some incorrectly. Any idea?
The assumed need for infer_datetime_format=True gives away that you are not passing the correct format. Have a look at the strftime documentation. You are using:
format='%d.%m.%y %H:%M'
# %y = Year without century as a zero-padded decimal number: 09, 10
But the format required is:
format='%d.%m.%Y %H:%M'
# %Y = Year with century as a decimal number: 2009, 2010
Apparently, infer_datetime_format isn't able to infer each string correctly, taking days as months and vice versa. Indeed, let's reproduce the error:
Create csv:
import pandas as pd
import numpy as np
data = {'Timestamp': pd.date_range('01-01-2009', '10-04-2010', freq='H'),
'Mean': np.random.randint(0,10,15385)}
df_orig = pd.DataFrame(data)
df_orig['Timestamp'] = df_orig['Timestamp'].dt.strftime('%d.%m.%Y %H:%M')
df_orig.to_csv('test.csv', sep=';', index=None, header=None)
# csv like:
01.01.2009 00:00;7
01.01.2009 01:00;6
01.01.2009 02:00;0
01.01.2009 03:00;2
01.01.2009 04:00;3
Load csv incorrectly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M',
infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
df.info() # note the incorrect `PeriodIndex`, ending with `2010-12-09 23:00`
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-12-09 23:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB
Load csv correctly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%Y %H:%M')
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df.info()
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-10-04 00:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB

Including minutes column in CSV breaks date parsing

Problem
I have a CSV file with components of the date and time in separate columns. When I use pandas.read_csv, I can use the parse_date kwarg to combine the components into a single datetime column if I don't include the minutes column.
Example
Consider the following example:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
df_has_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR']},
)
df_no_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']},
)
If I look at the .info() method of each dataframe, I get:
The first:
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns] # <--- good, but doesn't have minutes of course
1 GAUGE 10 non-null int64
2 MINUTE 10 non-null int64
3 PRECIP 10 non-null float64
The second:
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null object # <--- bad!
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
Indeed, df_no_dt.head() shows a very strange "datetime" column:
datetime GAUGE PRECIP
2008 03 27 19 30 1 0.02
2008 03 27 19 45 1 0.06
2008 03 27 20 0 1 0.01
2008 03 27 20 30 1 0.01
2008 03 27 21 0 1 0.12
Question:
What's causing this and how should I efficiently get the minutes of the time into the datetime column?
I'm not sure why just adding on the minutes column for datetime parsing isn't working. But you can specify a function to parse them like so:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
DT_COLS = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']
def dt_parser(*args):
return pandas.to_datetime(pandas.DataFrame(zip(*args), columns=DT_COLS))
df = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': DT_COLS},
date_parser=dt_parser,
)
Which outputs:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns]
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 368.0 bytes
How dt_parser works. It relies on an feature of pd.to_datetime that can recognise common names for date/time types. These aren't fully documented, but at the time of writing they can be found in the pandas source. They all are:
"year", "years",
"month", "months",
"day", "days",
"hour", "hours",
"minute", "minutes",
"second", "seconds",
"ms", "millisecond", "milliseconds",
"us", "microsecond", "microseconds",
"ns", "nanosecond", "nanoseconds",

Pandas convert datetime64 [ns] columns to datetime64 [ns, UTC] for mutliple column at once

I have a dataframe called query_df and some of the columns are in datetime[ns] datatype.
I want to convert all datetime[ns] to datetime[ns, UTC] all at once.
This is what I've done so far by retrieving columns that are datetime[ns]:
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
To convert it, I can use pd.to_datetime(query_df["column_name"], utc=True).
Using dt_columns, I want to convert all columns in dt_columns.
How can I do it all at once?
Attempt:
query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True)
Error:
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
You have to use lambda function to achieve this. Try doing this
df[dt_columns] = df[dt_columns].apply(pd.to_datetime, utc=True)
First part of the process is already done by you i.e. grouping the names of the columns whose datatype is to be converted , by using :
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
Now , all you have to do ,is to convert all the columns to datetime all at once using pandas apply() functionality :
query_df[dt_columns] = query_df[dt_columns].apply(pd.to_datetime)
This will convert the required columns to the data type you specify.
EDIT:
Without using the lambda function
step 1: Create a dictionary with column names (columns to be changed) and their datatype :
convert_dict = {}
Step 2: Iterate over column names which you extracted and store in the dictionary as key with their respective value as datetime :
for col in dt_columns:
convert_dict[col] = datetime
Step 3: Now convert the datatypes by passing the dictionary into the astype() function like this :
query_df = query_df.astype(convert_dict)
By doing this, all the values of keys will be applied to the columns matching the keys.
Your attempt query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True) is interpreting dt_columns as year, month, day. Below the example in the help of to_datetime():
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Below a code snippet that gives you a solution with a little example. Bear in mind that depending in your data format or your application the UTC might not give your the right date.
import pandas as pd
query_df = pd.DataFrame({"ts1":[1622098447.2419431, 1622098447], "ts2":[1622098427.370945,1622098427], "a":[1,2], "b":[0.0,0.1]})
query_df.info()
# convert to datetime in nano seconds
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns]")
query_df.info()
#convert to datetime with UTC
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns, UTC]")
query_df.info()
which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null float64
1 ts2 2 non-null float64
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns]
1 ts2 2 non-null datetime64[ns]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns, UTC]
1 ts2 2 non-null datetime64[ns, UTC]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns, UTC](2), float64(1), int64(1)
memory usage: 192.0 byte

How to remove rows in pandas of type datetime64[ns] by date?

I'm pretty newbie, started to use python for my project.
I have dataset, first column has datetime64[ns] type
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5889 entries, 0 to 5888
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 5889 non-null datetime64[ns]
1 title 5889 non-null object
2 stock 5889 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 138.1+ KB
and
type(BA['date'])
gives
pandas.core.series.Series
date has format 2020-06-10
I need to delete all instances before specific date, for example 2015-09-09
What I tried:
convert to string. Failed
Create conditions using:
.df.year <= %y & .df.month <= %m
<= ('%y-%m-%d')
create data with datetime() method
create variable with datetime64 format
just copy with .loc() and .copy()
All of this failed, I had all kinds of error, like it's not int, its not datetime, datetime mutable, not this, not that, not a holy cow
How can this pandas format can be more counterintuitive, I can't believe, for first time I feel like write a parser CSV in C++ seems easier than use prepared library in python
Thank you for understanding
Toy Example
df = pd.DataFrame({'date':['2021-1-1', '2020-12-6', '2019-02-01', '2020-02-01']})
df.date = pd.to_datetime(df.date)
df
Input df
date
0 2021-01-01
1 2020-12-06
2 2019-02-01
3 2020-02-01
Delete rows before 2020.01.01.
We are selecting the rows which have dates after 2020.01.01 and ignoring old dates.
df.loc[df.date>'2020.01.01']
Output
date
0 2021-01-01
1 2020-12-06
3 2020-02-01
If we want the index to be reset
df = df.loc[df.date>'2020.01.01']
df
Output
date
0 2021-01-01
1 2020-12-06
2 2020-02-01

Pandas, format date from dd/mm/yyyy to MMM dd/yy

I would like to change the date from dd/mm/yyyy to MMM dd/yy.
ie 15/04/2021 to APR 15/21
My Date column format is object
I am doing this:
df['Date'] = pd.to_datetime(df['Date'], format='%MMM %DD/%YY')
but I am getting ValueError: time data '15/04/2021' does not match format '%MMM %dd/%yy' (match)
Any help would be appreciated.
>>> import pandas as pd
>>> df = pd.DataFrame({'Date':['15/04/2021', '12/05/2021', '4/6/2021']})
>>> df
Date
0 15/04/2021
1 12/05/2021
2 4/6/2021
>>> df['Date'] = pd.to_datetime(df['Date'])
>>> df
Date
0 2021-04-15
1 2021-12-05
2 2021-04-06
>>> df['date_formated'] = df['Date'].dt.strftime('%b %d/%y').str.upper()
>>> df
Date date_formated
0 2021-04-15 APR 15/21
1 2021-12-05 DEC 05/21
2 2021-04-06 APR 06/21
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 3 non-null datetime64[ns]
1 date_formated 3 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 176.0+ bytes
You need to convert your original Date column to datetime format (pandas will read dates as a strings by default). After doing so, just change the display format.
Try this:
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')
# Solution for title: change display to 'MMM dd/y'
df['Date'] = df['Date'].dt.strftime('%b %d/%y').astype(str).str.upper()
# Solution for comment: change display to 'MMM dd y'
df['Date'] = df['date'].dt.strftime('%b %d %y').astype(str).str.upper()
Try -
df.loc['Date'] = pd.to_datetime(df.Date).dt.strftime('%b %d/%y').str.upper()

Categories