How to parse a date column as datetimes, not objects in Pandas? - python

I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?

When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)

Related

Pandas convert datetime64 [ns] columns to datetime64 [ns, UTC] for mutliple column at once

I have a dataframe called query_df and some of the columns are in datetime[ns] datatype.
I want to convert all datetime[ns] to datetime[ns, UTC] all at once.
This is what I've done so far by retrieving columns that are datetime[ns]:
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
To convert it, I can use pd.to_datetime(query_df["column_name"], utc=True).
Using dt_columns, I want to convert all columns in dt_columns.
How can I do it all at once?
Attempt:
query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True)
Error:
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
You have to use lambda function to achieve this. Try doing this
df[dt_columns] = df[dt_columns].apply(pd.to_datetime, utc=True)
First part of the process is already done by you i.e. grouping the names of the columns whose datatype is to be converted , by using :
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
Now , all you have to do ,is to convert all the columns to datetime all at once using pandas apply() functionality :
query_df[dt_columns] = query_df[dt_columns].apply(pd.to_datetime)
This will convert the required columns to the data type you specify.
EDIT:
Without using the lambda function
step 1: Create a dictionary with column names (columns to be changed) and their datatype :
convert_dict = {}
Step 2: Iterate over column names which you extracted and store in the dictionary as key with their respective value as datetime :
for col in dt_columns:
convert_dict[col] = datetime
Step 3: Now convert the datatypes by passing the dictionary into the astype() function like this :
query_df = query_df.astype(convert_dict)
By doing this, all the values of keys will be applied to the columns matching the keys.
Your attempt query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True) is interpreting dt_columns as year, month, day. Below the example in the help of to_datetime():
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Below a code snippet that gives you a solution with a little example. Bear in mind that depending in your data format or your application the UTC might not give your the right date.
import pandas as pd
query_df = pd.DataFrame({"ts1":[1622098447.2419431, 1622098447], "ts2":[1622098427.370945,1622098427], "a":[1,2], "b":[0.0,0.1]})
query_df.info()
# convert to datetime in nano seconds
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns]")
query_df.info()
#convert to datetime with UTC
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns, UTC]")
query_df.info()
which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null float64
1 ts2 2 non-null float64
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns]
1 ts2 2 non-null datetime64[ns]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns, UTC]
1 ts2 2 non-null datetime64[ns, UTC]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns, UTC](2), float64(1), int64(1)
memory usage: 192.0 byte

Select colums in an indexed dataframe

I've imported a csv file with pd.read_csv, used parse_dates and index_col.
This results in de following dataframe
DatetimeIndex: 195972 entries, 2018-02-01 to 2019-10-25
Data columns (total 19 columns):
account_manager 195972 non-null object
article_des 195896 non-null object
article_n 195972 non-null object
article_o 195972 non-null object
budget_code 195972 non-null object
budget_naam 195972 non-null object
country 195972 non-null object
currency 195972 non-null object
customer 195972 non-null object
industrie 195972 non-null object
klantnaam 195972 non-null object
month 195972 non-null int64
revenue 195972 non-null float64
revenue_local 195972 non-null float64
sap_code 195972 non-null object
volume 195972 non-null float64
week 195972 non-null int64
weight 195972 non-null float64
year 195972 non-null int64
dtypes: float64(4), int64(3), object(12)
memory usage: 20.9+ MB
None
I've tried every possible way to select only one column (weight) from this dataframe in a new dataframe. None of them work. What;s the trick to select columns in an indexed dataframe?
If I import the csv without an index_col I can make any selection I want.
df['weight'] will return a series while df[['weight']] will return a DataFrame.
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print(df)
# a b
# 0 1 2
# 1 3 4
print(type(df['a']))
# <class 'pandas.core.series.Series'>
print(type(df[['a']]))
# <class 'pandas.core.frame.DataFrame'>
To select multiple columns, you have to pass a list of column names.
print(df[['a', 'b']])
# a b
# 0 1 2
# 1 3 4
So to select a single column as a DataFrame, pass a list of one element.
print(df[['a']])
# a
# 0 1
# 1 3

.dropna() increases memory usage

First I import the whole file and get a memory consumption of 1002.0+ KB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
)
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 32063 entries, 0 to 32062
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1002.0+ KB
# None
then I drop NaN, run the script again and get a memory consumption of 1.2+ MB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
).dropna(how="all")
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 32062 entries, 0 to 32061
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1.2+ MB
# None
since I'm dropping one row I would expect that memory consumption goes down or at least remain the same no this.
Does any body know why is this happening? or how to fix it? or if this is a bug?
EDIT: chicago.csv
The change comes from the fact that your index changed from a RangeIndex to an Int64Index, which takes more memory.
You can "fix" this by resetting the index after the dropna(), but this will have the side effect of changing the row index (which you may not care about).
Here is an illustrative example:
First create a sample DataFrame:
df = pd.DataFrame({"a": range(10000)})
df.loc[1000, "a"] = None
Print the info:
print(df.info())
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 10000 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
Drop the na values:
print(df.dropna().info())
#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 9999 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 156.2 KB
Reset (and drop) the index:
df.dropna().reset_index(drop=True).info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 9999 entries, 0 to 9998
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
This is not a bug. It is working as intended, you are loading the file in so it is taking that amount of memory as before but more because you are then searching through the dataframe and removing the rows that have NaN which adds memory usage.

Pandas dataframe adding zero-padding before the datetime

I'm using Pandas dataframe. And I have a dataFrame df as the following:
time id
-------------
5:13:40 1
16:20:59 2
...
For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like:
time id
-------------
05:13:40 1
16:20:59 2
...
The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
Before:
print(df)
time id
1 5:13:40 1.0
2 16:20:59 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null object
id 2 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes
After:
print(df)
time id
1 05:13:40 1.0
2 16:20:59 2.0
df.info()
d<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null timedelta64[ns]
id 2 non-null float64
dtypes: float64(1), timedelta64[ns](1)
memory usage: 48.0 bytes

Wrong Data Type while Reading a large Text File

I'm trying to read the following file using pandas. The code that I'm using is the following:
df = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';', nrows=5)
The df.info() is giving the correct output.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 9 columns):
Date 5 non-null object
Time 5 non-null object
Global_active_power 5 non-null float64
Global_reactive_power 5 non-null float64
Voltage 5 non-null float64
Global_intensity 5 non-null float64
Sub_metering_1 5 non-null float64
Sub_metering_2 5 non-null float64
Sub_metering_3 5 non-null float64
dtypes: float64(7), object(2)
memory usage: 440.0+ bytes
But when I'm trying to read the entire data set using the same code except nrows:
df_all = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';') the column types are becoming object.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
Global_active_power object
Global_reactive_power object
Voltage object
Global_intensity object
Sub_metering_1 object
Sub_metering_2 object
Sub_metering_3 float64
dtypes: float64(1), object(6)
memory usage: 126.7+ MB
Can anyone please tell me why this is happening? And how to resolve it?
Thanks!
My guess would be that when you read the full data set in there are values in the additional rows that are being interpreted as different data types, for example floats interpreted as integers. You can specify the data types explicitly using the dtype argument in read_csv - see docs here.
Alternatively you could try to force the data types after loading the data; e.g. like so:
df["Global_active_power"] = df["Global_active_power"].astype(float)

Categories