python resample/group by OHLC data

python resample/group by OHLC data - python

I have hourly OHLC data that I am trying to regroup to see only from 9pm to 5am in one row and than for every day like that.
I've tried several ways suggested here, but without success.
index_21_09 = eur.index.indexer_between_time('21:00','05:00')
df = eur.iloc[index_21_09]
With this I filter data from 21 - 05, but in several rows, I need them in one row.
then, I tried this:
df_day_max = df.groupby(pd.Grouper(freq='8h')).max()
df_day_min = df.groupby(pd.Grouper(freq = '8h')).min()
df_group = (pd.concat([df_day_max['High'], df_day_min['Low ']], axis=1).dropna())
But, I get data from 16.00 - 00.00! How, if I previously filter them from 21-05?
df_resample = df.resample('8H').ohlc()
With this, I get same result, only with NaN values.
Any help with this? Thanks.

Related

Obtaining Data from a dataframe at desired timestep

I have a long timeseries with a 15 minute time step. I want to obtain timeseries at 3H time step from the existing series. I have tried different methods including the resample method. But the resample method does not work for me. I decided to run a loop to obtain these value. I used the following piece of code. But I am not sure why it is not working as I expect it to work. I cannot use the resample.mean() since I don't want to miss any actual peak values e.g. that of a flood wave. I want to keep the original data as it is.
station_number = []
timestamp = []
water_level = []
discharge = []
for i in df3.index:
station_number.append(df3['Station ID'][i])
timestamp.append(df3['Timestamp'][i])
water_level.append(df3['Water Level (m)'][i])
discharge.append(df3['Discharge (m^3/s)'][i])
i = i + 12
pass
df5 = pd.DataFrame(station_number, columns=['Station ID'],)
df5['Timestamp']= timestamp
df5['Water Level (m)']= water_level
df5['Discharge (m^3/s)']= discharge
df5
Running this code returns me the same dateframe. My logic is that the value of i updates by 12 steps and pick up the corresponding values from the dataset. Please guide if I am doing something wrong.

Pandas average per timeslot over different days, then sum within hour

I have a question about pandas and working with time series.
I get my data as a json from an API, the data includes traffic counts on different locations, measured every 5 minutes. The simplified data looks like this:
[
{
"name": "123", // location id
"date": "2020-01-01T00:45:00Z", // date and time
"intensity": 7 // number of vehicles counted
},
...
]
There is a month's worth of data, read using pandas and concatenated into one big dataframe:
# in loop:
dfs = []
df = pd.read_json(de.path)
df['date'] = pd.to_datetime(df['date'])
dfs.append(df)
data = pd.concat(dfs)
I average the counts for equivalent time slots over different days:
data = data.set_index('date')
data = data.groupby(df.index.time).aggregate("mean")
The final step is where I have a problem. I have tried using the pandas.resample function, but that requires a TimeDateIndex, which is lost in the previous step - if I print the index out, I get this:
print(data.index)
>>> Index([00:00:00, 00:05:00, 00:10:00, 00:15:00, 00:20:00, 00:25:00, 00:30:00,
00:35:00, 00:40:00, 00:45:00,
...
23:10:00, 23:15:00, 23:20:00, 23:25:00, 23:30:00, 23:35:00, 23:40:00,
23:45:00, 23:50:00, 23:55:00],
dtype='object', length=288)
I tried converting the index to PeriodIndex, but failed.
Is there a common way of doing this? I feel I have probably missed something simple.

I solved the problem by using the groupby function by like this:
for i, g in df.groupby(np.arange(len(df)) // 12):
counts_by_name[name].append((i+1, sum(g['intensity'])))
Does anyone know of a nicer solution that would allow actual resampling?

How to use a for loop to calculate daily returns?

I created a Dataframe with shareprices from 30 companies with the following code:
df_2 = pd.read_csv('tickerdata.csv');
df_2["Date"] = pd.to_datetime(df_2["Date"]);
df_2.set_index('Date')
Now, I need to calculate the daily log returns across all 30 columns for the different companies using a for loop. I know how to do it for a dataset with one column:
df_1 = np.log(p) - np.log(p).shift(1);
df_1["DJIA"] = 100 * df_1["DJIA"];
df_1 = df_1.dropna(axis=0, how='any')
However, now I need to do it for 30 columns using a for loop.. Have no idea how to do that, and after some intensive searching on google I still dont know how.
Thank you

numpy.where does not work properly with pandas dataframe

I am trying to divide a huge log data sets containing log data with StartTime and EndTime and other stuff.
I am using np.where to compare pandas dataframe object and then to divide it to hourly (may be half hour or quarterly) chunks, depends on hr and timeWindow value.
Below, Here, I am trying to divide the entire day logs to 1 hour chunks, but It does not gives me expected output.
I am out of ideas like where exactly my fault is!
# Holding very first time in the log data and stripping off
# second, minutes and microseconds.
today = datetime.strptime(log_start_time, "%Y-%m-%d %H:%M:%S.%f").replace(second = 0, minute = 0, microsecond = 0)
today_ts = int(time.mktime(today.timetuple())*1e9)
hr = 1
timeWindow = int(hr*60*60*1e9) #hour*minute*second*restdigits
parts = [df.loc[np.where((df["StartTime"] >= (today_ts + (i)*timeWindow)) & \
(df["StartTime"] < (today_ts + (i+1)*timeWindow)))].dropna(axis= 0, \
how='any') for i in range(0, rngCounter)]
If I check for first log entry inside my parts data, it is something like below:
00:00:00.
00:43:23.
01:12:59.
01:53:55.
02:23:52.
....
Where as I expect the output to be like below:
00:00:00
01:00:01
02:00:00
03:00:00
04:00:01
....
Though I have implemented it using an alternative way, but that's a work around and I lost few features by not having it like this.
Can Someone please figure out what exactly wrong with this approach?
Note: I am using python notebook with pandas, numpy.

Thanks to #raganjosh.
I got my solution to the problem by using pandas Grouper.
Below is my implementation.
I have used dynamic value for 'hr'.
timeWindow = str(hr)+'H'
# Dividing the log into "n" parts. Depends on timewindow initialisation.
df["ST"] = df['StartTime']
df = df.set_index(['ST'])
# Using the copied column as an index.
df.index = pd.to_datetime(df.index)
# Here the parts contain grouped chunks of data as per timewindow, list[0] = key of the group, list[1] = values.
parts = list(df.groupby(pd.TimeGrouper(freq=timeWindow))['StartTime', "ProcessTime", "EndTime"])

Sorting column entries in Pandas by name and week

I have a csv file which I am trying to manipulate and plot. This is tabular data of an entire year of statistics per companies. I would like to plot the earnings of (say) Google each week for a decade. So, I know I have to splice together several years of data in the form of arrays. However, I am not sure how to organize this data in terms of weeks.
(1) How do I search columns do find only 'Google' and (2) how can I plot this by week? I think I would have to sum from days 1-7
fname = "file.csv"
import pandas as pd
data = pd.read_csv(rita1989)
data.columns
#OUTPUT ..., 'Date', 'DayWeek',..., 'Companies', ..., 'Earnings'
01/05/2008 7 Yahoo 5678.89
01/06/2008 1 Google 3486.84
01/07/2008 2 Google 2379.23
01/08/2008 3 Ask 3578.22
01/09/2008 4 Google 2341.10
01/10/2008 5 DuckDuckGo 8410.00

Something like this:
data['week'] = pd.DatetimeIndex(data['Date']).to_period('W-SAT').to_timestamp(how='end')
data[data['Companies']=='Google'].groupby('week')['Earnings'].sum()
I suspect there's a more elegant way to get the week variable than what I'm doing.
This should also work, but we have to make the index the date:
data.index = pd.DatetimeIndex(data['Date'])
goog_totals = data['Earnings'].resample('w', how='sum').dropna()
At that point, you can just plot with goog_totals.plot().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python resample/group by OHLC data - python

Related

Obtaining Data from a dataframe at desired timestep

Pandas average per timeslot over different days, then sum within hour

How to use a for loop to calculate daily returns?

numpy.where does not work properly with pandas dataframe

Sorting column entries in Pandas by name and week

Categories

Resources