Currency/Date datframe merge failing - python

Having issues with merging two datframes (xrate and df) based on currency_str and created_date_time
display(xrate.info())
Int64Index: 1611 entries, 6 to 112
Data columns (total 3 columns):
Date 1611 non-null datetime64[ns]
PX_LAST 1611 non-null object
Currency 1611 non-null object
display(xrate.head(3))
Date PX_LAST Currency
2018-05-30 1 CAD
2018-05-29 1 CAD
2018-05-28 1 CAD
I created a new date to merge on:
#df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d%m%Y')
df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d-%m-%Y')
#convert to date
#df['formatted_created_date_time'] = pd.to_datetime(df['formatted_created_date_time'], format='%d%m%Y')
df['formatted_created_date_time'] = pd.to_datetime(df['formatted_created_date_time'], format='%d-%m-%Y')
display(df.info())
RangeIndex: 3488 entries, 0 to 3487
Data columns (total 43 columns):
created_date_time 3488 non-null datetime64[ns]
rfq_create_date_time 3488 non-null datetime64[ns]
currency_str 3488 non-null object
display(df.head(3))
Now the two dataframes are merged:
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'], right_on=['Currency', 'Date'], how='left')
display(result.info())
RangeIndex: 3488 entries, 0 to 3487
Data columns (total 43 columns):
created_date_time 3488 non-null datetime64[ns]
rfq_create_date_time 3488 non-null datetime64[ns]
.
.
formatted_created_date_time 3488 non-null datetime64[ns]
The match has failed:
display(result.head(3))
System Datetime:
Any ideas on this one?

It should working nice.
But another solution is merge by strings:
df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d-%m-%Y')
xrate['Date'] = xrate['Date'].dt.strftime('%d-%m-%Y')
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')
Your solution should be simplify by floor or date
df['formatted_created_date_time'] = df['created_date_time'].dt.floor('d')
xrate['Date'] = xrate['Date'].dt.floor('d')
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')
df['formatted_created_date_time'] = df['created_date_time'].dt.date
xrate['Date'] = xrate['Date'].dt.date
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')

Related

How to parse a date column as datetimes, not objects in Pandas?

I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?
When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)

Including minutes column in CSV breaks date parsing

Problem
I have a CSV file with components of the date and time in separate columns. When I use pandas.read_csv, I can use the parse_date kwarg to combine the components into a single datetime column if I don't include the minutes column.
Example
Consider the following example:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
df_has_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR']},
)
df_no_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']},
)
If I look at the .info() method of each dataframe, I get:
The first:
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns] # <--- good, but doesn't have minutes of course
1 GAUGE 10 non-null int64
2 MINUTE 10 non-null int64
3 PRECIP 10 non-null float64
The second:
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null object # <--- bad!
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
Indeed, df_no_dt.head() shows a very strange "datetime" column:
datetime GAUGE PRECIP
2008 03 27 19 30 1 0.02
2008 03 27 19 45 1 0.06
2008 03 27 20 0 1 0.01
2008 03 27 20 30 1 0.01
2008 03 27 21 0 1 0.12
Question:
What's causing this and how should I efficiently get the minutes of the time into the datetime column?
I'm not sure why just adding on the minutes column for datetime parsing isn't working. But you can specify a function to parse them like so:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
DT_COLS = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']
def dt_parser(*args):
return pandas.to_datetime(pandas.DataFrame(zip(*args), columns=DT_COLS))
df = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': DT_COLS},
date_parser=dt_parser,
)
Which outputs:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns]
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 368.0 bytes
How dt_parser works. It relies on an feature of pd.to_datetime that can recognise common names for date/time types. These aren't fully documented, but at the time of writing they can be found in the pandas source. They all are:
"year", "years",
"month", "months",
"day", "days",
"hour", "hours",
"minute", "minutes",
"second", "seconds",
"ms", "millisecond", "milliseconds",
"us", "microsecond", "microseconds",
"ns", "nanosecond", "nanoseconds",

Pandas convert datetime64 [ns] columns to datetime64 [ns, UTC] for mutliple column at once

I have a dataframe called query_df and some of the columns are in datetime[ns] datatype.
I want to convert all datetime[ns] to datetime[ns, UTC] all at once.
This is what I've done so far by retrieving columns that are datetime[ns]:
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
To convert it, I can use pd.to_datetime(query_df["column_name"], utc=True).
Using dt_columns, I want to convert all columns in dt_columns.
How can I do it all at once?
Attempt:
query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True)
Error:
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
You have to use lambda function to achieve this. Try doing this
df[dt_columns] = df[dt_columns].apply(pd.to_datetime, utc=True)
First part of the process is already done by you i.e. grouping the names of the columns whose datatype is to be converted , by using :
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
Now , all you have to do ,is to convert all the columns to datetime all at once using pandas apply() functionality :
query_df[dt_columns] = query_df[dt_columns].apply(pd.to_datetime)
This will convert the required columns to the data type you specify.
EDIT:
Without using the lambda function
step 1: Create a dictionary with column names (columns to be changed) and their datatype :
convert_dict = {}
Step 2: Iterate over column names which you extracted and store in the dictionary as key with their respective value as datetime :
for col in dt_columns:
convert_dict[col] = datetime
Step 3: Now convert the datatypes by passing the dictionary into the astype() function like this :
query_df = query_df.astype(convert_dict)
By doing this, all the values of keys will be applied to the columns matching the keys.
Your attempt query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True) is interpreting dt_columns as year, month, day. Below the example in the help of to_datetime():
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Below a code snippet that gives you a solution with a little example. Bear in mind that depending in your data format or your application the UTC might not give your the right date.
import pandas as pd
query_df = pd.DataFrame({"ts1":[1622098447.2419431, 1622098447], "ts2":[1622098427.370945,1622098427], "a":[1,2], "b":[0.0,0.1]})
query_df.info()
# convert to datetime in nano seconds
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns]")
query_df.info()
#convert to datetime with UTC
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns, UTC]")
query_df.info()
which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null float64
1 ts2 2 non-null float64
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns]
1 ts2 2 non-null datetime64[ns]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns, UTC]
1 ts2 2 non-null datetime64[ns, UTC]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns, UTC](2), float64(1), int64(1)
memory usage: 192.0 byte

Reset index after groupby operation

by_month = df_omsk_last_year.groupby(df_omsk_last_year.index.month, as_index=False).agg({'T': ['mean', 'min', 'max']})
by_month = by_month.reset_index()
by_month = by_month.rename(columns={'mean':'mean__'})
by_month.info()
by_month['mean__']
I have key error, of course.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
(index, ) 12 non-null int64
(T, mean__) 12 non-null float64
(T, min) 12 non-null float64
(T, max) 12 non-null float64
dtypes: float64(3), int64(1)
memory usage: 464.0 bytes
What I should do? I have tried a lot of ways.
index is datetime, T is float.
Problem is MultiIndex in columns with same level T. You can prevent it by specify column after groupby for processing:
df_omsk_last_year = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'T':[7,8,9,4,2,3],
}, index=pd.date_range('2015-01-01', periods=6, freq='10D'))
by_month = (df_omsk_last_year.groupby(df_omsk_last_year.index.month.rename('month'))['T']
.agg(['mean', 'min', 'max'])
.rename(columns={'mean':'mean__'})
.reset_index())
print (by_month)
month mean__ min max
0 1 7.0 4 9
1 2 2.5 2 3
Or by named aggregations:
by_month = (df_omsk_last_year.groupby(df_omsk_last_year.index.month)
.agg(mean__=('T', 'mean'),
min__=('T', 'min'),
max__=('T', 'max'))
.reset_index())

Using set_index within a custom function

I would like to convert the date observations from a column into the index for my dataframe. I am able to do this with the code below:
Sample data:
test = pd.DataFrame({'Values':[1,2,3], 'Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
Indexing code:
test['Date Index'] = pd.to_datetime(test['Date'])
test = test.set_index('Date Index')
test['Index'] = test.index.date
However when I try to include this code in a function, I am able to create the 'Date Index' column but set_index does not seem to work as expected.
def date_index(df):
df['Date Index'] = pd.to_datetime(df['Date'])
df = df.set_index('Date Index')
df['Index'] = df.index.date
If I inspect the output of not using a function info() returns:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
If I inspect the output of the function info() returns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
Date 3 non-null object
Values 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 120.0+ bytes
I would like the DatetimeIndex.
How can set_index be used within a function? Am I using it incorrectly?
IIUC return df is missing:
df1 = pd.DataFrame({'Values':[1,2,3], 'Exam Completed Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
def date_index(df):
df['Exam Completed Date Index'] = pd.to_datetime(df['Exam Completed Date'])
df = df.set_index('Exam Completed Date Index')
df['Index'] = df.index.date
return df
print (date_index(df1))
Exam Completed Date Values Index
Exam Completed Date Index
2016-01-01 17:49:00 1/1/2016 17:49 1 2016-01-01
2016-01-02 07:10:00 1/2/2016 7:10 2 2016-01-02
2016-01-03 15:19:00 1/3/2016 15:19 3 2016-01-03
print (date_index(df1).info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Exam Completed Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
None

Categories