Compare two Pandas Series/DataFrames that are virtually equal - python

For a unittest I have to compare two pandas DataFrames (with one column, so they can also be cast to Series without losing information). The problem is that the index of one is of datetime type, the other date. For our purposes the information in the two is equal, since the time component of the datetime is not used.
To check if the two objects are equal for a unittest I could:
Extract the index of one of them and cast to date/datetime
Extract just the values of the one column, compare those and start and end dates
Am I missing any elegant way to compare the two?
Code example:
from datetime import date, datetime, timedelta
import pandas as pd
days_in_training = 40
start_date = date(2016, 12, 1)
dates = [start_date + timedelta(days=i) for i in range(days_in_training)]
actual = pd.DataFrame({'col1': range(days_in_training)}, index=dates)
start_datetime = datetime(2016, 12, 1)
datetimes = [start_datetime + timedelta(days=i) for i in range(days_in_training)]
expected = pd.DataFrame({'col1': range(days_in_training)}, index=datetimes)
assert(all(actual == expected))
Gives:
ValueError: Can only compare identically-labeled DataFrame objects

For future reference, through this blogpost (https://penandpants.com/2014/10/07/testing-with-numpy-and-pandas/) I found the function pandas.util.testing.assert_frame_equal() (https://github.com/pandas-dev/pandas/blob/29de89c1d961bea7aa030422b56b061c09255b96/pandas/util/testing.py#L621)
This function has some flexibility in what it tests for. In addition it prints a summary why the DataFrames might not be considered equal, the line assert(all(actual == expected)) only returns True or False, which makes debugging harder.

Related

Sorting datetime; pandas

I have a big excel file with a datetime format column which are in strings. The column looks like this:
ingezameldop
2022-10-10 15:51:18
2022-10-10 15:56:19
I have found two ways of trying to do this, however they do not work.
First (nice way):
import pandas as pd
from datetime import datetime
from datetime import date
dagStart = datetime.strptime(str(date.today())+' 06:00:00', '%Y-%m-%d %H:%M:%S')
dagEind = datetime.strptime(str(date.today())+' 23:00:00', '%Y-%m-%d %H:%M:%S')
data = pd.read_excel('inzamelbestand.xlsx', index_col=9)
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
data.to_excel("oefenexcel.xlsx")
However, this returns me with an excel file identical to the original one. I cant seem to fix this.
Second way (sketchy):
import pandas as pd
from datetime import datetime
from datetime import date
df = pd.read_excel('inzamelbestand.xlsx', index_col=9)
# uitfilteren dag van vandaag
dag = str(date.today())
dag1 = dag[8]+dag[9]
vgl = df['ingezameldop']
vgl2 = vgl.str[8]+vgl.str[9]
df = df.loc[vgl2 == dag1]
# uitfilteren vanaf 6 uur 's ochtends
# str11 str12 = uur
df.to_excel("oefenexcel.xlsx")
This one works for filtering out the exact day. But when I want to filter out the hours it does not. Because I use the same way (getting the 11nd and 12th character from the string) but I cant use logic operators (>=) on strings, so I cant filter out for times >6
You can modify this line of code
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
as
(dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
to get boolean values that are only true for records within the date range.
dagStart, dagEind and data['ingezameldop'] must be in datetime format.
In order to apply it on individual element of the column, wrap it in a function and use apply as follows
def filter(ingezameldop, dagStart, dagEind):
return (dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
then apply the filter on the column in this way
data['filter'] = data['ingezameldop'].apply(filter, dagStart=dagStart, dagEind=dagEind)
That will apply the function on individual series element which must be in datetime format

how to transform for loop to lambda function

I have written this function:
def time_to_unix(df,dateToday):
'''this function creates the timestamp column for the dataframe. it also gets today's date (ex: 2022-8-8 0:0:0)
and then it adds the seconds that were originally in the timestamp column.
input: dataframe, dateToday(type: pandas.core.series.Series)
output: list of times
'''
dateTime = dateToday[0]
times = []
for i in range(0,len(df['timestamp'])):
dateAndTime = dateTime + timedelta(seconds = float(df['timestamp'][i]))
unix = pd.to_datetime([dateAndTime]).astype(int) / 10**9
times.append(unix[0])
return times
so it takes a dataframe and it gets today's date and then its taking the value of the timestamp in the dataframe( which is in seconds like 10,20,.... ) then it applies the function and returns times in unix time
however, because I have approx 2million row in my dataframe, its taking me a lot of time to run this code.
how can I use lambda function or something else in order to speed up my code and the process.
something along the line of:
df['unix'] = df.apply(lambda row : something in here), axis = 1)
What I think you'll find is that most of the time is spent in the creation and manipulation of the datetime / timestamp objects in the dataframe (see here for more info). I also try to avoid using lambdas like this on large dataframes as they go row by row which should be avoided. What I've done when dealing with datetimes / timestamps / timezone changes in the past is to build a dictionary of the possible datetime combinations and then use map to apply them. Something like this:
import datetime as dt
import pandas as pd
#Make a time key column out of your date and timestamp fields
df['time_key'] = df['date'].astype(str) + '#' + df['timestamp']
#Build a dictionary from the unique time keys in the dataframe
time_dict = dict()
for time_key in df['time_key'].unique():
time_split = time_key.split('#')
#Create the Unix time stamp based on the values in the key; store it in the dictionary so it can be mapped later
time_dict[time_key] = (pd.to_datetime(time_split[0]) + dt.timedelta(seconds=float(time_split[1]))).astype(int) / 10**9
#Now map the time_key to the unix column in the dataframe from the dictionary
df['unix'] = df['time_key'].map(time_dict)
Note if all the datetime combinations are unique in the dataframe, this likely won't help.
I'm not exactly sure what type dateTime[0] has. But you could try a more vectorized approach:
import pandas as pd
df["unix"] = (
(pd.Timestamp(dateTime[0]) + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)
or
df["unix"] = (
(dateTime[0] + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)

How to compare date difference in python pandas

I'm appending a column to my pandas dataframe which is the time difference between two dates.
df['time_diff'] = datetime.dt(2018,1,1) - df['IN_TIME']
the type if the new column in <m8[ns]. I'm trying to filter the rows whose 'time_diff' is greater than 30 days but I can't compare <m8[ns] with a number. How can I do this comparison?
Here's one way. Note you don't need to use the datetime module for these calculations as Pandas has some intuitive functionality for these operations.
df['time_diff'] = pd.to_datetime('2018-01-01') - df['IN_TIME']
df = df[df['time_diff'].dt.days > 30]
This solution assumes df['IN_TIME'] is a datetime series; if it is not, you can convert via df['IN_TIME'] = pd.to_datetime(df['IN_TIME']).

Using pandas to sort data from a dataset

I have a dataset which includes a column for dates. The format for this column is dd.mm.yyyy.
I tried using the recommended methods for sorting the dates to restrict the range to 'December' and '2014.' However, none of the methods seem to be functioning properly. I am considering trying to rearrange it so that it is in the format of yyyy.mm.dd. I'm not sure how to go about doing this. Can someone help?
Code such as
(df['date']>'1-12-2014')&(df['date']<='31-12-2014') don't seem to work.
The problem is that your dates are strings that pandas isn't recognizing as dates. You want to convert them to datetime objects first. There are a couple ways to do this:
df['date'] = df['date'].apply(lambda d: pd.strptime(d, '%d.%m.%Y'))
or
df['date'] = pd.to_datetime(df['date'], format = '%d.%m.%Y')
In both cases, the key is using a format string that matches your data. Then, you can filter how you want:
from datetime import date
df[(df['date'] >= date(2014, 12, 1))&(df['date'] <= date(2014, 12, 31))]

Pandas: select all dates with specific month and day

I have a dataframe full of dates and I would like to select all dates where the month==12 and the day==25 and add replace the zero in the xmas column with a 1.
Anyway to do this? the second line of my code errors out.
df = DataFrame({'date':[datetime(2013,1,1).date() + timedelta(days=i) for i in range(0,365*2)], 'xmas':np.zeros(365*2)})
df[df['date'].month==12 and df['date'].day==25] = 1
Pandas Series with datetime now behaves differently. See .dt accessor.
This is how it should be done now:
df.loc[(df['date'].dt.day==25) & (cust_df['date'].dt.month==12), 'xmas'] = 1
Basically what you tried won't work as you need to use the & to compare arrays, additionally you need to use parentheses due to operator precedence. On top of this you should use loc to perform the indexing:
df.loc[(df['date'].month==12) & (df['date'].day==25), 'xmas'] = 1
An update was needed in reply to this question. As of today, there's a slight difference in how you extract months from datetime objects in a pd.Series.
So from the very start, incase you have a raw date column, first convert it to datetime objects by using a simple function:
import datetime as dt
def read_as_datetime(str_date):
# replace %Y-%m-%d with your own date format
return dt.datetime.strptime(str_date,'%Y-%m-%d')
then apply this function to your dates column and save results in a new column namely datetime:
df['datetime'] = df.dates.apply(read_as_datetime)
finally in order to extract dates by day and month, use the same piece of code that #Shayan RC explained, with this slight change; notice the dt.datetime after calling the datetime column:
df.loc[(df['datetime'].dt.datetime.month==12) &(df['datetime'].dt.datetime.day==25),'xmas'] =1

Categories