Pandas resample dataframe - python

I have a resampling (downsampling) problem that should be straightforward to do but I'm not able!!
Here is a simplified example:
df:
Time A
0 0.01591 0.108929
1 0.27973 0.411764
2 0.55044 0.064253
3 0.81386 0.317394
4 1.07983 0.722707
5 1.35051 1.154193
6 1.61495 1.151492
7 1.88035 0.123389
8 2.15462 0.093583
9 2.41534 0.260944
10 2.67992 1.007564
11 2.95148 0.325353
12 3.21364 0.555593
13 3.47980 0.740621
15 4.01519 1.619669
16 4.28679 0.477371
17 4.55482 0.432049
18 4.81570 0.194224
19 5.07992 0.331936
The Time column is in seconds. I would like to make the Time column the index and downsample the dataframe to 1s. Help please?

You can use reindex and choose one fill method
In [37]: df.set_index('Time').reindex(range(0,6), method='bfill')
Out[37]:
A
0 0.108929
1 0.722707
2 0.093583
3 0.555593
4 1.619669
5 0.331936

First convert your index to datetime format:
df.index=pd.to_datetime(df.Time,unit='s')
Then resample by second (this is the mean value by default but can be changed to sum etc - e.g. add how='sum' as parameter):
d.resample('S')
Time A
Time
1970-01-01 00:00:00 0.414985 0.225585
1970-01-01 00:00:01 1.481410 0.787945
1970-01-01 00:00:02 2.550340 0.421861
1970-01-01 00:00:03 3.346720 0.648107
1970-01-01 00:00:04 4.418125 0.680828
1970-01-01 00:00:05 5.079920 0.331936
The year/date can be changed if important.

Related

How to replace by NaN a time delta object in a pandas serie?

I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30

Prepare Data Frames to be compared. Index manipulation, datetime and beyond

Ok, this is a question in two steps.
Step one: I have a pandas DataFrame like this:
date time value
0 20100201 0 12
1 20100201 6 22
2 20100201 12 45
3 20100201 18 13
4 20100202 0 54
5 20100202 6 12
6 20100202 12 18
7 20100202 18 17
8 20100203 6 12
...
As you can see, for instance between rows 7 and 8 there is data missing (in this case, the value for the 0 time). Sometimes, several hours or even a full day could be missing.
I would like to convert this DataFrame to the format like this:
value
2010-02-01 00:00:00 12
2010-02-01 06:00:00 22
2010-02-01 12:00:00 45
2010-02-01 18:00:00 13
2010-02-02 00:00:00 54
2010-02-02 06:00:00 12
2010-02-02 12:00:00 18
2010-02-02 18:00:00 17
...
I want this because I have another DataFrame (let's call it "reliable DataFrame") in this format that I am sure it has no missing values.
EDIT 2016/07/28: Studying the problem it seems there were also duplicated data in the dataframe. See the solution to also address this problem.
Step two: With the previous step done I want to compare row by row the index in the "reliable DataFrame" with the index in the DataFrame with missing values.
I want to add a row with the value NaN where there are missing entries in the first DataFrame. The final check would be to be sure that both DataFrames have the same dimension.
I know this is a long question, but I am stacked. I have tried to manage the dates with the dateutil.parser.parse and to use set_index as the method to set a new index, but I have lots of errors in the code. I am afraid this is clearly above my pandas level.
Thank you in advance.
Step 1 Answer
df['DateTime'] = (df['date'].astype(str) + ' ' + df['time'].astype(str) +':'+'00'+':'+'00').apply(lambda x: pd.to_datetime(str(x)))
df.set_index('DateTime', drop=True, append=False, inplace=True, verify_integrity=False)
df.drop(['date', 'time'], axis=1, level=None, inplace=True, errors='raise')
If there are duplicates these can be removed by:
df = df.reset_index().drop_duplicates(subset='DateTime',keep='last').set_index('DateTime')
Step 2
df_join = df.join(df1, how='outer', lsuffix='x',sort=True)

Time format when using pandas.to_csv()

I have a out put from a Pandas DataFrame as following.
id value exit enter time_diff
0 1 a 2012-11-27 10:41:20 2012-11-27 10:39:00 00:02:20
1 2 a 2012-12-07 06:00:10 2012-12-07 06:00:09 00:00:01
2 2 c 2012-12-27 06:05:17 2012-12-27 06:00:17 00:05:00
3 3 a 2012-12-27 06:00:13 2012-12-27 06:00:13 00:00:00
Why following doesn’t work?
df.to_csv('diff.csv', date_format='%H:%M:%S')
For the first one in csv following is there for time_diff
140000000000
Time diff is an integer given in nanoseconds, not a date. I would recommend either pickling or hdf5 if you need to round-trip.

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Dividing a series containing datetime by a series containing an integer in Pandas

I have a series s1 which is of type datetime and has a time which represents a range between a start time and an end time - typical values are 7 days, 4 hours 5 mins etc. I have series s2 which contains integers for the number of events that happened in that time range.
I want to calculate the event frequency by:
event_freq = s1 / s2
I get the error:
cannot operate on a series with out a rhs of a series/ndarray of type datetime64[ns] or a timedelta
Whats the best way to fix this?
Thanks in advance!
EXAMPLE of s1 is:
some_id
1 2012-09-02 09:18:40
3 2012-04-02 09:36:39
4 2012-02-02 09:58:02
5 2013-02-09 14:31:52
6 2012-01-09 12:59:20
EXAMPLE of s2 is:
some_id
1 3
3 1
4 1
5 2
6 1
8 1
10 3
12 2
This might possibly be a bug but what works is to operate on the underlying numpy array like so:
import pandas as pd
from pandas import Series
startdate = Series(pd.date_range('2013-01-01', '2013-01-03'))
enddate = Series(pd.date_range('2013-03-01', '2013-03-03'))
s1 = enddate - startdate
s2 = Series([2, 3, 4])
event_freq = Series(s1.values / s2)
Here are the Series:
>>> s1
0 59 days, 00:00:00
1 59 days, 00:00:00
2 59 days, 00:00:00
dtype: timedelta64[ns]
>>> s2
0 2
1 3
2 4
dtype: int64
>>> event_freq
0 29 days, 12:00:00
1 19 days, 16:00:00
2 14 days, 18:00:00
dtype: timedelta64[ns]

Categories