numpy.where does not work properly with pandas dataframe - python

I am trying to divide a huge log data sets containing log data with StartTime and EndTime and other stuff.
I am using np.where to compare pandas dataframe object and then to divide it to hourly (may be half hour or quarterly) chunks, depends on hr and timeWindow value.
Below, Here, I am trying to divide the entire day logs to 1 hour chunks, but It does not gives me expected output.
I am out of ideas like where exactly my fault is!
# Holding very first time in the log data and stripping off
# second, minutes and microseconds.
today = datetime.strptime(log_start_time, "%Y-%m-%d %H:%M:%S.%f").replace(second = 0, minute = 0, microsecond = 0)
today_ts = int(time.mktime(today.timetuple())*1e9)
hr = 1
timeWindow = int(hr*60*60*1e9) #hour*minute*second*restdigits
parts = [df.loc[np.where((df["StartTime"] >= (today_ts + (i)*timeWindow)) & \
(df["StartTime"] < (today_ts + (i+1)*timeWindow)))].dropna(axis= 0, \
how='any') for i in range(0, rngCounter)]
If I check for first log entry inside my parts data, it is something like below:
00:00:00.
00:43:23.
01:12:59.
01:53:55.
02:23:52.
....
Where as I expect the output to be like below:
00:00:00
01:00:01
02:00:00
03:00:00
04:00:01
....
Though I have implemented it using an alternative way, but that's a work around and I lost few features by not having it like this.
Can Someone please figure out what exactly wrong with this approach?
Note: I am using python notebook with pandas, numpy.

Thanks to #raganjosh.
I got my solution to the problem by using pandas Grouper.
Below is my implementation.
I have used dynamic value for 'hr'.
timeWindow = str(hr)+'H'
# Dividing the log into "n" parts. Depends on timewindow initialisation.
df["ST"] = df['StartTime']
df = df.set_index(['ST'])
# Using the copied column as an index.
df.index = pd.to_datetime(df.index)
# Here the parts contain grouped chunks of data as per timewindow, list[0] = key of the group, list[1] = values.
parts = list(df.groupby(pd.TimeGrouper(freq=timeWindow))['StartTime', "ProcessTime", "EndTime"])

Related

Obtaining Data from a dataframe at desired timestep

I have a long timeseries with a 15 minute time step. I want to obtain timeseries at 3H time step from the existing series. I have tried different methods including the resample method. But the resample method does not work for me. I decided to run a loop to obtain these value. I used the following piece of code. But I am not sure why it is not working as I expect it to work. I cannot use the resample.mean() since I don't want to miss any actual peak values e.g. that of a flood wave. I want to keep the original data as it is.
station_number = []
timestamp = []
water_level = []
discharge = []
for i in df3.index:
station_number.append(df3['Station ID'][i])
timestamp.append(df3['Timestamp'][i])
water_level.append(df3['Water Level (m)'][i])
discharge.append(df3['Discharge (m^3/s)'][i])
i = i + 12
pass
df5 = pd.DataFrame(station_number, columns=['Station ID'],)
df5['Timestamp']= timestamp
df5['Water Level (m)']= water_level
df5['Discharge (m^3/s)']= discharge
df5
Running this code returns me the same dateframe. My logic is that the value of i updates by 12 steps and pick up the corresponding values from the dataset. Please guide if I am doing something wrong.

Optimize file reading with numpy

I have a .dat file made by an FPGA. The file contains 3 columns: the first is the input channel (it can be 1 or 2), the second column is the timestamp at which an event occurred, the third is the local time at which the same event occurred. The third column is necessary because sometimes the FPGA has to reset the clock counter in such a way that it doesn't count in a continuous way. An example of what I am saying is represented in the next figure.
An example of some lines from the .datfile is the following:
1 80.80051152 2022-02-24T18:28:49.602000
2 80.91821978 2022-02-24T18:28:49.716000
1 80.94284154 2022-02-24T18:28:49.732000
2 0.01856876 2022-02-24T18:29:15.068000
2 0.04225772 2022-02-24T18:29:15.100000
2 0.11766780 2022-02-24T18:29:15.178000
The time column is given by the FPGA (in tens of nanosecond), the date column is written by the python script that listen the data from the FPGA, when it has to write a timestamp it saves also the local time as a date.
I am interested in getting two arrays (one for each channel) where I have for each event the time at which that event occurs relatively to the starting time of the acquisition. An example of how the data given before should look at the end is the following:
8.091821978000000115e+01
1.062702197800000050e+02
1.062939087400000062e+02
1.063693188200000179e+02
These data refere to the second channel only. Double check can be made by observing third column in the previous data.
I tried to achieve this whit a function (too messy to me) where I check every time if the difference between two consecutive events in time is greater than 1 second respect to the difference in local time, if that's the case I evaluate the time interval through the local time column. So I correct the timestamp by the right amount of time:
ch, time, date = np.genfromtxt("events220302_1d.dat", unpack=True,
dtype=(int, float, 'datetime64[ms]'))
mask1 = ch==1
mask2 = ch==2
time1 = time[mask1]
time2 = time[mask2]
date1 = date[mask1]
date2 = date[mask2]
corr1 = np.zeros(len(time1))
for idx, val in enumerate(time1):
if idx < len(time1) - 1:
if check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1]) == 0:
corr1[idx+1] = val + (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
time1 = time1 + corr1.cumsum()
Where check_dif is a function that returns 0 if the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before.
Is there any more elegant or even faster way to get what I want with maybe some fancy NumPy coding?
A simple initial way to optimize your code is to make the code if-less, thus getting rid of both the if statements. To do so, instead of returning 0 in check_dif, you can return 1 when "the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before", otherwise 0.
Your for loop will be something like that:
for idx in range(len(time1) - 1):
is_dif = check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1])
# Correction value: if is_dif == 0, no correction; otherwise a correction takes place
correction = is_dif * (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
corr1[idx+1] = time1[idx] + correction
A more numpy way to do things could be through vectorization. I don't know if you have some benchmark on the speed or how big the file is, but I think in your case the previous change should be good enough

Pandas average per timeslot over different days, then sum within hour

I have a question about pandas and working with time series.
I get my data as a json from an API, the data includes traffic counts on different locations, measured every 5 minutes. The simplified data looks like this:
[
{
"name": "123", // location id
"date": "2020-01-01T00:45:00Z", // date and time
"intensity": 7 // number of vehicles counted
},
...
]
There is a month's worth of data, read using pandas and concatenated into one big dataframe:
# in loop:
dfs = []
df = pd.read_json(de.path)
df['date'] = pd.to_datetime(df['date'])
dfs.append(df)
data = pd.concat(dfs)
I average the counts for equivalent time slots over different days:
data = data.set_index('date')
data = data.groupby(df.index.time).aggregate("mean")
The final step is where I have a problem. I have tried using the pandas.resample function, but that requires a TimeDateIndex, which is lost in the previous step - if I print the index out, I get this:
print(data.index)
>>> Index([00:00:00, 00:05:00, 00:10:00, 00:15:00, 00:20:00, 00:25:00, 00:30:00,
00:35:00, 00:40:00, 00:45:00,
...
23:10:00, 23:15:00, 23:20:00, 23:25:00, 23:30:00, 23:35:00, 23:40:00,
23:45:00, 23:50:00, 23:55:00],
dtype='object', length=288)
I tried converting the index to PeriodIndex, but failed.
Is there a common way of doing this? I feel I have probably missed something simple.
I solved the problem by using the groupby function by like this:
for i, g in df.groupby(np.arange(len(df)) // 12):
counts_by_name[name].append((i+1, sum(g['intensity'])))
Does anyone know of a nicer solution that would allow actual resampling?

PYTHON: Filtering a dataset and truncating a date

I am fairly new to python, so any help would be greatly appreciated. I have a dataset that I need to filter down to specific events. For example, I have a column with dates and I need to know what dates are in the current month and have happened within the past week. The column is called POS_START_DATE with dates formatted like 2019-01-27T00:00:00-0500. I need to truncate that date and compare it to the previous week. No luck so far.
Here is my code so far:
## import data package
import datetime
## assign date variables
today = datetime.date.today()
six_day = datetime.timedelta(days = 6)
## Create week parameter
week = today + six_day
## Statement to extract recent job movements
if fields.POS_START_DATE < week and fields.POS_START_DATE > today:
out1 += in1
Here is sample of the table:
Sample Table
I am looking for the same table filtered down to only rows that happened within one week. The bottom of the sample table(not shown) will have dates in this month. I'd like the final output to only show those rows, and any other rows in the current month of November.
I am not too sure to understand what is your expected output, but this will help you create an extra column which will be used as flag for those cases that fulfill with the condition you state in your if-statement:
import numpy as np
fields['flag_1'] = np.where(((fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)),1,0)
This will generate an extra column in your dataframe with a 1 for the cases that meet the criteria you stated. Finally you can perform this calculation to get the total of cases that actually met the criteria:
total_cases = fields['flag_1'].sum()
Edit:
If you need to filter the data with only the cases that meet the criteria you can either use pandas filtering with the original if-statement (without creating the extra flag field) like this:
df_filtered = fields[(fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)]
Or, if you created the flag, then much simpler:
df_filtered = fields[fields['flag'] == 1]
Both should work to generate a new dataframe, with only the cases that match your criteria.

How to filter this data in pandas data frame or numpy array?

I'm trying to plot performance metrics of various assets in a back test.
I have imported the 'test_predictions.json' into a pandas data frame. It is a list of dictionaries and contains results from various asset (listed one after the other),
Here is a sample is the data:
trading_pair return timestamp prediction
[u'Poloniex_ETH_BTC' 0.003013302628677 1450753200L -0.157053292753482]
[u'Poloniex_ETH_BTC' 0.006013302628677 1450753206L -0.187053292753482]
...
[u'Poloniex_FCT_BTC' 0.006013302628677 1450753100L 0.257053292753482]
Each backtest starts and ends at different times.
Here' is the data for the assets of interest
'''
#These are the assets I would like to analyse
Poloniex_DOGE_BTC 2015-10-21 02:00:00 1445392800
Poloniex_DOGE_BTC 2016-01-12 05:00:00 1452574800
Poloniex_XRP_BTC 2015-10-28 06:00:00 1446012000
Poloniex_XRP_BTC 2016-01-12 05:00:00 1452574800
Poloniex_XMR_BTC 2015-10-21 14:00:00 1445436000
Poloniex_XMR_BTC 2016-01-12 06:00:00 1452578400
Poloniex_VRC_BTC 2015-10-25 07:00:00 1445756400
Poloniex_VRC_BTC 2016-01-12 00:00:00 1452556800
'''
So i'm trying to make an new array that contains the data for these assets. Each asset must be sliced appropriately so they all start from the latest start time and end at earliest end time (other wise there will be incomplete data).
#each array should start and end:
#start 2015-10-28 06:00:00
#end 2016-01-12 00:00:00
So the question is:
How can I search for an asset ie Poloniex_DOGE_BTC then acquire the index for start and end times specified above ?
I will be plotting the data via numpy so maybe its better turn into a numpy array, df.valuesand the conduct the search? Then i could use np.hstack(df_index_asset1, def_index_asset2) so it's in the right form to plot. So the problem is: using either pandas or numpy how do i retrieve the data for the specified assets which fall into the master start and end times?
On a side note here the code i wrote to get the start and end dates, it's not to most efficient so improving that would be a bonus.
EDIT:
From Kartik's answer I tried to obtain just the data for asset name: 'Poloniex_DOGE_BTC' using the follow code:
import pandas as pd
import numpy as np
preds = 'test_predictions.json'
df = pd.read_json(preds)
asset = 'Poloniex_DOGE_BTC'
grouped = df.groupby(asset)
print grouped
But throws this error
EDIT2: I have changed the link to the data so it is test_predictions.json`
EDIT3: this worked a treat:
preds = 'test_predictions.json'
df = pd.read_json(preds)
asset = 'Poloniex_DOGE_BTC'
grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`
#each array should start and end:
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800
Now how can we truncate the data so that it starts and ends from the above timestamps ?
Firstly, why like this?
data = pd.read_json(preds).values
df = pd.DataFrame(data)
You can just write that as:
df = pd.read_json(preds)
And if you want a NumPy array from df then you can execute data = df.values later.
And it should put the data in a DataFrame. (Unless I am much mistaken, because I have never used read_json() before.
The second thing, is getting the data for each asset out. For that, I am assuming you need to process all assets. To do that, you can simply do:
# To convert it to datetime.
# This is not important, and you can skip it if you want, because epoch times in
# seconds will perfectly work with the rest of the method.
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
# This will give you a group for each asset on which you can apply some function.
# We will apply min and max to get the desired output.
grouped = df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset names
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max
Now start_times and end_times will be Series. The index of this series will be your asset names, and the value will be the minimum and maximum times respectively.
I think this is the answer you are looking for, from my understanding of your question. Please let me know if that is not the case.
EDIT
If you are specifically looking for a few (one or two or ten) assets, you can modify the above code like so:
asset = ['list', 'of', 'required', 'assets'] # Even one element is fine.
req_df = df[df['trading_pair'].isin(asset)]
grouped = req_df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max
EDIT2 this worked a treat:
preds = 'test_predictions.json'
df = pd.read_json(preds)
asset = 'Poloniex_DOGE_BTC'
grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`
#each array should start and end:
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800
Now how can we truncate the data for that it starts from the above starts and ends at the above timestamps ?
As an aside, plotting datetimes from Pandas is very convenient as well. I use it all the time to produce most of the plots I create. And all of my data is timestamped.

Categories