How to merge 3 Pandas Dataframes based on Timestamp? - python

I have three dataframes in Pandas, say df1, df2 and df3. The first column of all dataframes is the Timestamp (DateTime format like 2017-01-01 12:30:00 etc.) Here is an example of each's first column:-
df1 TimeStamp
2016-01-01 12:00:00
2016-01-01 12:10:00
.....
df2 TimeStamp
2016-01-01 12:00:00
2016-01-01 12:10:00
.....
df3 TimeStamp
2016-13-01 12:00:00
2016-13-01 12:30:00
.....
As you can see, for the first two are at 10 minutes intervals, while the third one is at 30 minutes intervals. What I would like to do is to merge all 3 dataframes together, such that for cases where there is not exact match due to non-available data(like 12:10:00 not available for 3rd dataframe ), it would be considered as 12:00:00 (the preceding measurement) for merging purposes. (But of course, the Date should be the same) Note that all the dataframes have different sizes, but I would like to merge them based on Timestamp together for analytical purposes. Thank you!
DESIRED RESULT:
df_final TimeStamp .. Columns of df1 Columns of df2 Columns of df3
2016-13-01 12:00:00
2016-13-01 12:10:00
2016-13-01 12:20:00
.....
MORE DETAILS BASED ON ANSWER SUGGESTED
Firstly, as my dataframes (all 3) did not have index as TimeStamps, but had columns as TimeStamps, I set index for each as the TimeStamps:
df1.index = df1.TimeStamp
df2.index = df2.TimeStamp
df3.index = df3.TimeStamp
On using this
u_index = df3.index.union(df2.index.union(df1.index))
I get a weird output strangely which is not at regularly 10 min intervals like needed.
Index(['2016-01-01 00:00:00.000', '2016-01-01 00:00:00.000',
'2016-01-01 00:00:00.000', '2016-01-01 00:00:00.000',
...
'2017-12-31 23:50:00.000', '2017-12-31 23:50:00.000',
'2017-12-31 23:50:00.000', '2017-12-31 23:50:00.000',
dtype='object', name='TimeStamp', length=3199372)
Accordingly, the final df1_n dataframe is at 30 min intervals and not 10 mins (as the Union of indices was not properly done). I think that there is something going wrong here and once Step 2 suggested (u_index) is working properly, everything will be easy to merge the dataframes.

So I'm not 100% sure if what you asked for is how to complete the missing values after merging the three dataframes with the next valid observation.
if so, that's the quickest way I found to do this (not the most elegant...):
create a new index which is the union of the three indexes (will result in timestamp with intervals of 10 minutes in you case).
reindex all three dfs according to the new index while filling in missing values separately.
merge columns of the three dfs (which will be easy since after step NO.2 they will have the same index).
taking a portion of the data:
df1
Out[48]:
val_1
TimeStamp
2016-01-01 12:00:00 11
2016-01-01 12:10:00 12
df2
Out[49]:
val_2
TimeStamp
2016-01-01 12:00:00 21
2016-01-01 12:10:00 22
df3
Out[50]:
val_3
TimeStamp
2016-01-01 12:00:00 31
2016-13-01 12:30:00 32
step NO.1
u_index = df3.index.union(df2.index.union(df1.index))
u_index
Out[38]: Index(['2016-01-01 12:00:00', '2016-01-01 12:10:00', '2016-13-01 12:30:00'], dtype='object', name='TimeStamp')
step NO.2
df3_n = df3.reindex(index=u_index,method='bfill')
df2_n = df2.reindex(index=u_index,method='bfill')
df1_n = df1.reindex(index=u_index,method='bfill')
step NO.3
df1_n.merge(df2_n,on='TimeStamp').merge(df3_n,on='TimeStamp')
Out[47]:
val_1 val_2 val_3
TimeStamp
2016-01-01 12:00:00 11.0 21.0 31
2016-01-01 12:10:00 12.0 22.0 32
2016-13-01 12:30:00 NaN NaN 32
You might need to adjust the last row, since it has no following row to fill values from. but that's it pretty much.

Related

Time Series Data Reformat

I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN

Pandas dataframe: Sum up rows by date and keep only one row per day without timestamp

I have such a dataframe:
ds y
2018-07-25 22:00:00 1
2018-07-25 23:00:00 2
2018-07-26 00:00:00 3
2018-07-26 01:00:00 4
2018-07-26 02:00:00 5
What I want to get is a new dataframe which looks like this
ds y
2018-07-25 3
2018-07-26 12
I want to get a new dataframe df1 where all the entries of one day are summed up in y and I only want to keep one column of this day without a timestamp.
What I did so far is this:
df1 = df.groupby(df.index.date).transform(lambda x: x[:24].sum())
24 because I have 24 entries every day (for every hour). I get the correct sum for every day but I also get 24 rows for every day together with the existing timestamps. How can I achieve what I want?
If need sum all values per days then filtering first 24 rows is not necessary:
df1 = df.groupby(df.index.date)['y'].sum().reset_index()
Try out:
df.groupby([df.dt.year, df.dt.month, df.dt.day])['y'].sum()

Filter out unique values of a column within a set time interval

I have a dataframe with dates and tick-data like below
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.192 160.225
3 20160601 00:00:00.327 160.230
4 20160601 00:00:01.606 160.231
5 20160601 00:00:01.613 160.230
I want to filter out unique values in the 'Bid' column at set intervals
E.g: 2016-06-01 00:00:00 - 00:15:00, 2016-06-01 00:15:00 - 00:30:00...
The result will be a new dataframe (keeping the filtered values with its datetime).
Here's the code I have so far:
#Convert Date column to index with seconds as base
df['Date'] = pd.DatetimeIndex(df['Date'])
df['Date'] = df['Date'].astype('datetime64[s]')
df.set_index('Date', inplace=True)
#Create new DataFrame with filtered values
ts = pd.DataFrame(df.loc['2016-06-01'].between_time('00:00', '00:30')['Bid'].unique())
With the method above I loose the [Dates] (datetime) of the filtered values in the process of creating a new DataFrame plus I have to manually input each date and time interval which is unrealistic.
Output:
0
0 160.225
1 160.226
2 160.230
3 160.231
4 160.232
5 160.228
6 160.227
Ideally I'm looking for an operation where I can set the time interval as a timedelta and have an operation done on the whole file (about 8Gb) at once, creating a new DataFrame with Date and Bid columns of the unique values within the set interval. Like this
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.327 160.230
3 20160601 00:00:01.606 160.231
...
805 20160601 00:15:00.606 159.127
PS. I also tried using pd.rolling() & pd.resample() methods with apply(lambda x: function (eg. pd['Bid'].unique()) but it never was able to cut it, maybe someone better at it could attempt.
Just to clarify: This is not a rolling calculation. You mentioned attempting to solve this using rolling, but from your clarification it seems you want to split the time series into discrete, non-overlapping 15 minutes sequences.
Setup
df = pd.DataFrame({
'Date': [
'2016-06-01 00:00:00.020', '2016-06-01 00:00:00.136',
'2016-06-01 00:15:00.636', '2016-06-01 00:15:02.836',
],
'Bid': [150, 150, 200, 200]
})
print(df)
Date Bid
0 2016-06-01 00:00:00.020 150
1 2016-06-01 00:00:00.136 150 # Should be dropped
2 2016-06-01 00:15:00.636 200
3 2016-06-01 00:15:02.836 200 # Should be dropped
First, verify that your Date column is datetime:
df.Date = pd.to_datetime(df.Date)
Now use dt.floor to round each value down to the nearest 15 minutes, and use this new column to drop_duplicates per 15 minute window, but still keep the precision of your dates.
df.assign(flag=df.Date.dt.floor('15T')).drop_duplicates(['flag', 'Bid']).drop('flag', 1)
Date Bid
0 2016-06-01 00:00:00.020 150
2 2016-06-01 00:15:00.636 200
From my original answer, but I still believe it holds value. If you'd like to access the unique values per group, you can make use of pd.Grouper and unique, and I believe learning to leverage pd.Grouper is a powerful tool to have with pandas:
df.groupby(pd.Grouper(key='Date', freq='15T')).Bid.unique()
Date
2016-06-01 00:00:00 [150]
2016-06-01 00:15:00 [200]
Freq: 15T, Name: Bid, dtype: object

Pandas: How to group the non-continuous date column?

I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))

adding column with per-row computed time difference from group start?

(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?
Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]

Categories