I have a couple of million DateTime objects in pandas. I could not find anything in the documentation for exploratory data analysis (EDA).
It looks like every single row has the same time in either data frame:
DF1
Timestamp('2018-02-20 00:00:00')
or
DF2
Timestamp('2018-01-01 05:00:00')
is there a way to use pandas to go through each column and check to see if there is a difference in the hours/minutes/seconds?
Everything I have found is about calculating differences between times.
I have tried a couple of basic techniques but all I get back are simple descriptive numbers.
min(data['date'])
data['date'].nunique()
I have tried:
print(data['TIMESTAMP_UTC'])
Which does show some dates that have different hours, but I need a way to manage this information:
0 2018-01-16 05:00:00
1 2018-05-04 04:00:00
2 2018-10-22 04:00:00
3 2018-01-02 05:00:00
4 2018-01-03 05:00:00
5 2018-01-04 05:00:00
6 2018-01-05 05:00:00
......
Ideally, I am looking for something that could spit out a .value_counts() of dates that deviate from everything else
You can use the .apply() method to transform the format from str to datetime. Then you use datetime to handle it.
To convert your column values into datetime :
df['TIMESTAMP_UTC'] = pd.to_datetime(df['TIMESTAMP_UTC'] )
df['TIMESTAMP_UTC'] = df['TIMESTAMP_UTC'].apply(lambda x: datetime.strptime(x, "%Y-%b-%d %H:%M:%S"))
then you can use the power of datetime to compare or extract information like this to extract hours for instance:
df['TIMESTAMP_UTC'].dt.day
Related
I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN
I have a column where there is only time. After reading that CSV file i have converted that column to datetime datatype as it was object when i read it in jupyter notebook. When i try to filter i am getting error like below
TypeError: Index must be DatetimeIndex
code
newdata = newdata['APPOINTMENT_TIME'].between_time('14:30:00', '20:00:00')
sample_data
APPOINTMENT_TIME Id
13:30:00 1
15:10:00 2
18:50:00 3
14:10:00 4
14:00:00 5
Here i am trying display the rows whose appointment_time is between 14:30:00 to 20:00:00
datatype info
Could anyone help. Thanks in advance
between_time is a special method that works with datetime objects as index, which is not your case. It would be useful if you had data like 2021-12-21 13:30:00
In your case, you can just use the between method on strings and the fact that times with your format HH:MM:SS will be naturally sorted:
filtered_data = newdata[newdata['APPOINTMENT_TIME'].between('14:30:00', '20:00:00')]
Output:
APPOINTMENT_TIME Id
1 15:10:00 2
2 18:50:00 3
NB. You can't use a range that starts before midnight and ends after.
Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8
The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.
I have been using the between_time method of TimeSeries in pandas, which returns all values between the specified times, regardless of their date.
But I need to select both date and time, because my timeseries structure
contains multiple dates.
One way of solving this, though quite inflexible, is just iterate over the values and remove those which are not relevant.
Is there a more elegant way of doing this ?
You can select the dates that are of interest first, and then use between_time. For example, suppose you have a time series of 72 hours:
import pandas as pd
from numpy.random import randn
rng = pd.date_range('1/1/2013', periods=72, freq='H')
ts = pd.Series(randn(len(rng)), index=rng)
To select the between 20:00 & 22:00 on the 2nd and 3rd of January you can simply do:
ts['2013-01-02':'2013-01-03'].between_time('20:00', '22:00')
Giving you something like this:
2013-01-02 20:00:00 0.144399
2013-01-02 21:00:00 0.886806
2013-01-02 22:00:00 0.126844
2013-01-03 20:00:00 -0.464741
2013-01-03 21:00:00 1.856746
2013-01-03 22:00:00 -0.286726