pandas transform timeseries into multiple column DataFrame - python

I have a timeseries of intraday day data looks like below
ts =pd.Series(np.random.randn(60),index=pd.date_range('1/1/2000',periods=60, freq='2h'))
I am hoping to transform the data into a DataFrame, with the columns as each date, and rows as the time in the date.
I have tried these,
key = lambda x:x.date()
grouped = ts.groupby(key)
But how do I transform the groups into date columned DataFrame? or is there any better way?

import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=60, freq='2h')
ts = pd.Series(np.random.randn(60), index = index)
key = lambda x: x.time()
groups = ts.groupby(key)
print pd.DataFrame({k:g for k,g in groups}).resample('D').T
out:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2000-01-06 \
00:00:00 0.109959 -0.124291 -0.137365 0.054729 -1.305821 -1.928468
03:00:00 1.336467 0.874296 0.153490 -2.410259 0.906950 1.860385
06:00:00 -1.172638 -0.410272 -0.800962 0.568965 -0.270307 -2.046119
09:00:00 -0.707423 1.614732 0.779645 -0.571251 0.839890 0.435928
12:00:00 0.865577 -0.076702 -0.966020 0.589074 0.326276 -2.265566
15:00:00 1.845865 -1.421269 -0.141785 0.433011 -0.063286 0.129706
18:00:00 -0.054569 0.277901 0.383375 -0.546495 -0.644141 -0.207479
21:00:00 1.056536 0.031187 -1.667686 -0.270580 -0.678205 0.750386
2000-01-07 2000-01-08
00:00:00 -0.657398 -0.630487
03:00:00 2.205280 -0.371830
06:00:00 -0.073235 0.208831
09:00:00 1.720097 -0.312353
12:00:00 -0.774391 NaN
15:00:00 0.607250 NaN
18:00:00 1.379823 NaN
21:00:00 0.959811 NaN

Related

python masking each day in dataframe

I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)

Filtering out specific hours each day in pandas

Given a dataset where each row represent a hour sample, that is each day has 24 entries with the following index set
...
2020-10-22T20:00:00
2020-10-22T21:00:00
2020-10-22T22:00:00
...
2020-10-22T20:00:00
2020-10-22T20:00:00
2020-10-22T20:00:00
...
Now I want to filter out so that for each day only the hours between 9am-3pm is left.
The only way I know would be to iterate over the dataset and filter each row given a condition, however knowing pandas there is always some trick for this kind of filtering that does not involve explicit iterating.
You can use the aptly named pd.DataFrame.between_time method. This will only work if your dataframe has a DatetimeIndex.
Data Creation
date_index = pd.date_range("2020-10-22T20:00:00", "2020-11-22T20:00:00", freq="H")
values = np.random.rand(len(dates), 1)
df = pd.DataFrame(values, index=date_index, columns=["value"])
print(df.head())
value
2020-10-22 20:00:00 0.637542
2020-10-22 21:00:00 0.590626
2020-10-22 22:00:00 0.474802
2020-10-22 23:00:00 0.058775
2020-10-23 00:00:00 0.904070
Method
subset = df.between_time("9:00am", "3:00pm")
print(subset.head(10))
value
2020-10-23 09:00:00 0.210816
2020-10-23 10:00:00 0.086677
2020-10-23 11:00:00 0.141275
2020-10-23 12:00:00 0.065100
2020-10-23 13:00:00 0.892314
2020-10-23 14:00:00 0.214991
2020-10-23 15:00:00 0.106937
2020-10-24 09:00:00 0.900106
2020-10-24 10:00:00 0.545249
2020-10-24 11:00:00 0.793243
import pandas as pd
# sample data (strings)
data = [f'2020-10-{d:02d}T{h:02d}:00:00' for h in range(24) for d in range(1, 21)]
# series of DT values
ds = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S')
# filter by hours
ds_filter = ds[(ds.dt.hour >= 9) & (ds.dt.hour <= 15)]

Merge two pandas dataframes with timeseries index

I have two pandas dataframes that I would like to merge/join together
For example:
#required packages
import os
import pandas as pd
import numpy as np
import datetime as dt
# create sample time series
dates1 = pd.date_range('1/1/2000', periods=4, freq='10min')
dates2 = dates1
column_names = ['A','B','C']
df1 = pd.DataFrame(np.random.randn(4, 3), index=dates1,
columns=column_names)
df2 = pd.DataFrame(np.random.randn(4, 3), index=dates2,
columns=column_names)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=('_x', '_y'))
From here I would like to merge the two datasets in the following manner(Note the order of columns):
A_x A_y B_x B_y C_x C_y
2000-01-01 00:00:00 2000-01-01 00:00:00 -0.572616 -0.867554 -0.382594 1.866238 -0.756318 0.564087
2000-01-01 00:10:00 2000-01-01 00:10:00 -0.814776 -0.458378 1.011491 0.196498 -0.523433 -0.296989
2000-01-01 00:20:00 2000-01-01 00:20:00 -0.617766 0.081141 1.405145 -1.183592 0.400720 -0.872507
2000-01-01 00:30:00 2000-01-01 00:30:00 1.083721 0.137422 -1.013840 -1.610531 -1.258841 0.142301
I would like to preserve both dataframe indexes by either creating a multi-index dataframe or creating a column for the second index. Would it be easier to use merge_ordered instead of merge or join?
Any help is appreciated.
I think you want to concat rather than merge:
In [11]: pd.concat([df1, df2], keys=["df1", "df2"], axis=1)
Out[11]:
df1 df2
A B C A B C
2000-01-01 00:00:00 1.621737 0.093015 -0.698715 0.319212 1.021829 1.707847
2000-01-01 00:10:00 0.780523 -1.169127 -1.097695 -0.444000 0.170283 1.652005
2000-01-01 00:20:00 1.560046 -0.196604 -1.260149 0.725005 -1.290074 0.606269
2000-01-01 00:30:00 -1.074419 -2.488055 -0.548531 -1.046327 0.895894 0.423743
Using concat
pd.concat([df1.reset_index().add_suffix('_x'),\
df2.reset_index().add_suffix('_y')], axis = 1)\
.set_index(['index_x', 'index_y'])
A_x B_x C_x A_y B_y C_y
index_x index_y
2000-01-01 00:00:00 2000-01-01 00:00:00 -1.437311 -1.414127 0.344057 -0.533669 -0.260106 -1.316879
2000-01-01 00:10:00 2000-01-01 00:10:00 0.662025 1.860933 -0.485169 -0.825603 -0.973267 -0.760737
2000-01-01 00:20:00 2000-01-01 00:20:00 -0.300213 0.047812 -2.279631 -0.739694 -1.872261 2.281126
2000-01-01 00:30:00 2000-01-01 00:30:00 1.499468 0.633967 -1.067881 0.174793 1.197813 -0.879132
merge will indeed merge both indices.
You can create the extra column in df2 before you merge :
df2["index_2"]=df2.index
Which will create a column in the final result that will be the value of the index in df2.
Please note that the only case this column will be different from the index is when the element does not appear in df2, in which case it will be null, so I'm not sure I understand your final goal in this.

Copy certain rows from pandas dataframe to a new one (Time condition)

I have a dataframe which looks like this:
pressure mean pressure std
2016-03-01 00:00:00 615.686441 0.138287
2016-03-01 01:00:00 615.555000 0.067460
2016-03-01 02:00:00 615.220000 0.262840
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 04:00:00 615.075000 0.072778
2016-03-01 05:00:00 615.513333 0.162049
................
The first column is the index column.
I want to create a new dataframe with only the rows of 3pm and 3am,
so it will look like this:
pressure mean pressure std
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 15:00:00 616.613333 0.129493
2016-03-02 03:00:00 615.600000 0.068889
..................
Any ideas ?
Thank you !
I couldn't load your data using pd.read_clipboard(), so I'm going to recreate some data:
df = pd.DataFrame(index=pd.date_range('2016-03-01', freq='H', periods=72),
data=np.random.random(size=(72,2)),
columns=['pressure', 'mean'])
Now your dataframe should have a DatetimeIndex. If not, you can use df.index = pd.to_datetime(df.index).
Then its really easy using boolean indexing:
df.ix[(df.index.hour == 3) | (df.index.hour == 15)]

Pandas resample numpy array

So I have a dataframe of the form: index is a date and then I have a column that consists of np.arrays with a shape of 180x360. What I want to do is calculate the weekly mean of the data set. Example of the dataframe:
vika geop
1990-01-01 06:00:00 [[50995.954225, 50995.954225, 50995.954225, 50...
1990-01-02 06:00:00 [[51083.0576138, 51083.0576138, 51083.0576138,...
1990-01-03 06:00:00 [[51045.6321168, 51045.6321168, 51045.6321168,...
1990-01-04 06:00:00 [[50499.8436192, 50499.8436192, 50499.8436192,...
1990-01-05 06:00:00 [[49823.5114237, 49823.5114237, 49823.5114237,...
1990-01-06 06:00:00 [[50050.5148846, 50050.5148846, 50050.5148846,...
1990-01-07 06:00:00 [[50954.5188533, 50954.5188533, 50954.5188533,...
1990-01-08 06:00:00 [[50995.954225, 50995.954225, 50995.954225, 50...
1990-01-09 06:00:00 [[50628.1596088, 50628.1596088, 50628.1596088,...
What I've tried so far is the simple
df = df.resample('W-MON')
But I get this error:
pandas.core.groupby.DataError: No numeric types to aggregate
I've tried to change the datatype of the column to list, but it still does not work. Any idea of how to do it with resample, or any other method?
You can use Panel to represent 3d data:
import pandas as pd
import numpy as np
index = pd.date_range("2012/01/01", "2012/02/01")
p = pd.Panel(np.random.rand(len(index), 3, 4), items=index)
p.resample("W-MON")

Categories