Problem
I have a CSV file with components of the date and time in separate columns. When I use pandas.read_csv, I can use the parse_date kwarg to combine the components into a single datetime column if I don't include the minutes column.
Example
Consider the following example:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
df_has_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR']},
)
df_no_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']},
)
If I look at the .info() method of each dataframe, I get:
The first:
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns] # <--- good, but doesn't have minutes of course
1 GAUGE 10 non-null int64
2 MINUTE 10 non-null int64
3 PRECIP 10 non-null float64
The second:
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null object # <--- bad!
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
Indeed, df_no_dt.head() shows a very strange "datetime" column:
datetime GAUGE PRECIP
2008 03 27 19 30 1 0.02
2008 03 27 19 45 1 0.06
2008 03 27 20 0 1 0.01
2008 03 27 20 30 1 0.01
2008 03 27 21 0 1 0.12
Question:
What's causing this and how should I efficiently get the minutes of the time into the datetime column?
I'm not sure why just adding on the minutes column for datetime parsing isn't working. But you can specify a function to parse them like so:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
DT_COLS = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']
def dt_parser(*args):
return pandas.to_datetime(pandas.DataFrame(zip(*args), columns=DT_COLS))
df = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': DT_COLS},
date_parser=dt_parser,
)
Which outputs:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns]
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 368.0 bytes
How dt_parser works. It relies on an feature of pd.to_datetime that can recognise common names for date/time types. These aren't fully documented, but at the time of writing they can be found in the pandas source. They all are:
"year", "years",
"month", "months",
"day", "days",
"hour", "hours",
"minute", "minutes",
"second", "seconds",
"ms", "millisecond", "milliseconds",
"us", "microsecond", "microseconds",
"ns", "nanosecond", "nanoseconds",
Related
I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?
When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)
so I'm having a lot of trouble getting a column of a pandas dataframe to read as datetime64[ns] dtype after having been saved in json format. I've tried pretty much everything I've seen online, pd.datetime(coerce, format), astype(datetime64[ns]), dateformat = 'iso', etc.
This is strange and very frustrating as all my other dataframes with date columns and saved as json files are being read correctly with the dtype as datetime64[ns].
I would really appreciate some help
Here are the last few lines of my code where I create the data frame and what it returns:
player = pd.DataFrame(full, index = list(range(len(full))), columns = ['Name', 'Handedness', 'Height', 'Bday'])
player.Height = player.Height.str[:-2]
player.Height = pd.to_numeric(player.Height)
player.Bday = pd.to_datetime(player.Bday, format = '%d/%m/%Y')
player = player.reset_index(drop = True)
player.to_json(f'../../Datasets/Singles_players/Player_Traits/{Event}_players.json', date_format = 'iso')
print(player.info())
print(player.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 25 non-null object
1 Handedness 25 non-null object
2 Height 25 non-null float64
3 Bday 25 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 928.0+ bytes
None
Name Handedness Height Bday
0 KENTO MOMOTA Left 175.0 1994-09-01
1 VIKTOR AXELSEN Right 194.0 1994-01-04
2 ANDERS ANTONSEN Right 183.0 1997-04-27
3 CHOU TIEN CHEN Right 180.0 1990-01-08
4 ANTHONY SINISUKA GINTING Right 171.0 1996-05-11
All good BUT here is what happens when I read the file:
player = pd.read_json('../Datasets/Singles_Players/Player_Traits/MS_players.json')
print(player.info())
print(player.head())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 24
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 25 non-null object
1 Handedness 25 non-null object
2 Height 25 non-null int64
3 Bday 25 non-null object
dtypes: int64(1), object(3)
memory usage: 1000.0+ bytes
None
Name Handedness Height Bday
0 KENTO MOMOTA Left 175 1994-09-01T00:00:00.000Z
1 VIKTOR AXELSEN Right 194 1994-01-04T00:00:00.000Z
2 ANDERS ANTONSEN Right 183 1997-04-27T00:00:00.000Z
3 CHOU TIEN CHEN Right 180 1990-01-08T00:00:00.000Z
4 ANTHONY SINISUKA GINTING Right 171 1996-05-11T00:00:00.000Z
I have a dataframe called query_df and some of the columns are in datetime[ns] datatype.
I want to convert all datetime[ns] to datetime[ns, UTC] all at once.
This is what I've done so far by retrieving columns that are datetime[ns]:
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
To convert it, I can use pd.to_datetime(query_df["column_name"], utc=True).
Using dt_columns, I want to convert all columns in dt_columns.
How can I do it all at once?
Attempt:
query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True)
Error:
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
You have to use lambda function to achieve this. Try doing this
df[dt_columns] = df[dt_columns].apply(pd.to_datetime, utc=True)
First part of the process is already done by you i.e. grouping the names of the columns whose datatype is to be converted , by using :
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
Now , all you have to do ,is to convert all the columns to datetime all at once using pandas apply() functionality :
query_df[dt_columns] = query_df[dt_columns].apply(pd.to_datetime)
This will convert the required columns to the data type you specify.
EDIT:
Without using the lambda function
step 1: Create a dictionary with column names (columns to be changed) and their datatype :
convert_dict = {}
Step 2: Iterate over column names which you extracted and store in the dictionary as key with their respective value as datetime :
for col in dt_columns:
convert_dict[col] = datetime
Step 3: Now convert the datatypes by passing the dictionary into the astype() function like this :
query_df = query_df.astype(convert_dict)
By doing this, all the values of keys will be applied to the columns matching the keys.
Your attempt query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True) is interpreting dt_columns as year, month, day. Below the example in the help of to_datetime():
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Below a code snippet that gives you a solution with a little example. Bear in mind that depending in your data format or your application the UTC might not give your the right date.
import pandas as pd
query_df = pd.DataFrame({"ts1":[1622098447.2419431, 1622098447], "ts2":[1622098427.370945,1622098427], "a":[1,2], "b":[0.0,0.1]})
query_df.info()
# convert to datetime in nano seconds
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns]")
query_df.info()
#convert to datetime with UTC
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns, UTC]")
query_df.info()
which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null float64
1 ts2 2 non-null float64
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns]
1 ts2 2 non-null datetime64[ns]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns, UTC]
1 ts2 2 non-null datetime64[ns, UTC]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns, UTC](2), float64(1), int64(1)
memory usage: 192.0 byte
I have dataframe that looks like below:
Date Region Data
0 200201 A 8.8
1 200201 B 14.3
...
1545 202005 C 7.3
1546 202005 D 131
I wanted to convert the Date column(data type: object) to DateTime index without time. yyyymm or yyyymmdd or yyyy-mm-dd all of these don't matter as long as I can erase the time part.
I've searched stackoverflow and tried these codes
# (1)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m", errors = "coerce", uts = False)
# (2)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m")
df["Date"] = df["Date"].dt.normalize()
# (3)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m")
df["Date"] = df["Date"].dt.date
For (1) and (2), I get ["Date"] with time like yyyy-mm-dd 00:00:00.
For (3), I do get ["Date"] as yyyymm but the dtype is object.
I can't use date range because same date is repeated for some time.
Will there be any way to convert yyyymm[object] to yyyymmdd[datetime] in python?
Thanks in advance.
It could be a display configuration issue on how your DataFrames are showing in your editor. The simplest way to get the data in the right format is:
df['Date'] = pd.to_datetime(df['Date'], format = '%Y%m')
Below are the results from repl.it with your DataFrame and this code. The date is properly formatted without the time component, and it has the proper dtype.
Date Region Data
0 2002-01-01 A 8.8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1 non-null datetime64[ns]
1 Region 1 non-null object
2 Data 1 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 152.0+ bytes
You can also try a more convoluted way of going from datetime to date string and back to datetime.
df['Date'] = pd.to_datetime(df['Date'], format = '%Y%m').dt.date
df['Date'] = df['Date'].astype('datetime64[ns]')
The final display and dtypes are the same.
Date Region Data
0 2002-01-01 A 8.8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1 non-null datetime64[ns]
1 Region 1 non-null object
2 Data 1 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 152.0+ bytes
The Date column in the question has the format YYYYMM (but no days). The function pd.to_datetime() implicitly sets the day to 1.
The function pd.Period() converts dates in the format YYYYMM to pandas periods. Note that df['Date'] can be strings or 6-digit integers.
df['Date'].apply(lambda x: pd.Period(x, freq='M'))
0 2002-01
1 2002-01
2 2020-05
3 2020-05
Name: Date, dtype: period[M]
I'm using Pandas dataframe. And I have a dataFrame df as the following:
time id
-------------
5:13:40 1
16:20:59 2
...
For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like:
time id
-------------
05:13:40 1
16:20:59 2
...
The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
Before:
print(df)
time id
1 5:13:40 1.0
2 16:20:59 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null object
id 2 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes
After:
print(df)
time id
1 05:13:40 1.0
2 16:20:59 2.0
df.info()
d<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null timedelta64[ns]
id 2 non-null float64
dtypes: float64(1), timedelta64[ns](1)
memory usage: 48.0 bytes