I want to modify this SO topic here to three hourly interval.
I have a database of events at minute resolution. I need to group them in three hourly, and extract the count of this grouping.
The output would ideally look something like a table like the following:
3hourly count
0 10
3 3
6 5
9 2
...
You haven't provided much detail, but you can use the 'TimeGrouper':
df.groupby(pd.TimeGrouper(key='your_time_column', freq='3H')).count()
The key parameter is optional if your time is the index.
Related
so I have a data frame similar to the one below where stamp is an date time index;
for context, it represents orders received and my goal is to match orders that may be the same but have come as two separate orders.
Stamp
Price.
indicator
EX
qty
1234
10
1
d
12
2345
30
-1
d
13
I want to group entries that have the same date time stamp, given that those entries have the same EX and Indicator.
I think I know how to do this with just the stamp however I'm unsure how to add the conditions of EX and indicator to the groupby.
Beginner here so any help is greatly appreciated!
Try this:
df.groupby(["Stamp", "EX", "indicator"])
And if you then want to get the sum of quantities and prices you can do this:
df.groupby(["Stamp", "EX", "indicator"]).sum()
you can groupby more than one column: df.groupby(['Stamp', 'EX'])
Then you can check the length of each group to see if there are multiple rows that share both columns:
df.groupby(['Stamp', 'EX']).apply(len)
I have a dataframe similar to the one shown below and was wondering how I can loop through and calculate fitting parameters every set number of days. For example, I would like to be able to input 30 days and have be able to get new constants for the first 30 days, then the first 60 days and so on until the end of the date range.
ID date amount delta_t
1 2020/1/1 10.2 0
1 2020/1/2 11.2 1
2 2020/1/1 12.3 0
2 2020/1/2 13.3 1
I would like to have the parameters stored in another dataframe which is what I am currently doing for the entire dataset but that is over the whole time period rather than n day blocks. Then using the constants for each set period I will calculate the graph points and plot them.
Right now I am using groupby to group the wells by ID then using the apply method to calculate the constants for each ID. This works for the entire dataframe but the constants will change if I am only using 30 day periods.
I don't know if there is a way in the apply method to more easily do this and output the constants either to a new column or a seperate dataframe that is one row per ID. Any input is greatly appreciated.
def parameters(x):
variables, _ = curve_fit(expo, x['delta_t'], x['amount'])
return pd.Series({'param1': variables[0], 'param2': variables[1], 'param3': variables[2]})
param_series = df_filt.groupby('ID').apply(parameters)
I have a CSV dataset of roughly 250k rows that is stored as a dataframe using pandas.
Each row is a record of a client coming in. A client could come in multiple times, which would result in multiple records being created. Some clients have only ever come in once, other clients come in dozens of times.
The CSV dataset has many columns that I use for other purposes, but the ones that this specific problem uses include:
CLIENT_ID | DATE_ARRIVED
0001 1/01/2010
0002 1/02/2010
0001 2/01/2010
0001 2/22/2010
0002 4/01/2010
....
I am trying to create a new column that would assign a number denoting what # occurrence the row is based on the ID. Then if there is a # occurrence >1, have it take the difference of the dates in days from the prior occurrence.
Important note:
The dataset is not ordered, so the script has to be able to determine which is the first based on the earliest date. If the client came in multiple times in the day, it would look at which is the earliest time within the date.
I tried to create a set using the CLIENT_ID, then looping through each element in the set to get the count. This gives me the total count, but I can't figure out how to get it to create a new column with those incrementally increasing counts.
I haven't gotten far enough to the DATE_ARRIVED differences based on # occurrence.
Nothing viable, hoping to get some ideas! If there is an easier way to determine differences between two dates next to each other for a client, I'm also open to ideas! I have a way of doing this manually through Excel, which involves:
ordering the dataset by ID and date,
checking each to see if the ID before was equal (and if it is, increment by 1)
creating a new column that takes the difference of the above only if the previous number was >1
... but I have no idea how to do this in Python.
The output should look something like:
CLIENT_ID | DATE_ARRIVED | OCCURRENCE | DAYS_SINCE_LAST
0001 1/01/2019 1 N/A
0002 1/02/2019 1 N/A
0001 2/01/2019 2 31
0001 2/22/2010 3 21
0002 4/01/2010 2 90
Using groupby with transform count + diff
df['OCCURRENCE']=df.groupby('CLIENT_ID').CLIENT_ID.transform('count')
df['DAYS_SINCE_LAST']=df.groupby('CLIENT_ID')['DATE_ARRIVED'].diff().dt.days
df
Out[45]:
CLIENT_ID DATE_ARRIVED OCCURRENCE DAYS_SINCE_LAST
0 1 2010-01-01 2 NaN
1 2 2010-01-02 1 NaN
2 1 2010-02-01 2 31.0
I have multiple timeseries that are outputs of various algorithms. These algorithms can have various parameters and they produce timeseries as a result:
timestamp1=1;
value1=5;
timestamp2=2;
value2=8;
timestamp3=3;
value3=4;
timestamp4=4;
value4=12;
resultsOfAlgorithms=[
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'200',
'result-of-algorithm':[[timestamp1,value1],[timestamp2,value2]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp1,value1],[timestamp3,value3]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
},
{
'algorithm':'delta',
'param-a':'12',
'param-b':'50',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
}
]
I would like to be able to filter the timeseries by algorithm and parameters and plot filtered timeseries to see how given parameters affect the output. To do that I need to know all the occurring values for given parameter and then to be able to select timeseries with desired parameters. E.g. I would like to plot all results of minmax algorithm with param-b==30. There are 2 results that were produced with minmax algorithm and param-b==30. Thus I would like to have a plot with 2 timeseries in it.
Is this possible with pandas or is this out of pandas functionality? How could this be implemented?
Edit:
Searching more the internet I think I am looking for a way to use hierarchical indexing. Also the timeseries should stay separated. Each result is a an individual time-series. It should not be merged together with other result. I need to filter the results of algorithms by parameters used. The result of filter should be still a list of timeseries.
Edit 2:
There are multiple sub-problems:
Find all existing values for each parameter (user does not know all the values since parameters can be auto-generated by system)
user selects some of values for filtering
One way this could be provided by user is a dictionary (but more-user friendly ideas are welcome):
filter={
'param-b':[30,50],
'algorithm':'minmax'
}
Timeseries from resultsOfAlgorithms[1:2] (2nd and 3rd result) are given as a result of filtering, since these results were produced by minmax algorithm and param-b was 30. Thus in this case
[
[[timestamp1,value1],[timestamp3,value3]],
[[timestamp1,value1],[timestamp3,value3]]
]
The result of filtering will return multiple time series, which I want to plot and compare.
user wants to try various filters to see how they affect results
I am doing all this in Jupyter notebook. And I would like to allow user to try various filters with the least hassle possible.
Timestamps in results are not shared. Timestamps between results are not necessarily shared. E.g. all timeseries might occur between 1pm and 3 pm and have roundly same amount of values but the timestamps nor the amount of values are not identical.
So there are two options here, one is to clean up the dict first, then convert it easily to a dataframe, the second is to convert it to a dataframe, then clean up the column that will have nested lists in it. For the first solution, you can just restructure the dict like this:
import pandas as pd
from collections import defaultdict
data = defaultdict(list)
for roa in resultsOfAlgorithms:
for i in range(len(roa['result-of-algorithm'])):
data['algorithm'].append(roa['algorithm'])
data['param-a'].append(roa['param-a'])
data['param-b'].append(roa['param-b'])
data['time'].append(roa['result-of-algorithm'][i][0])
data['value'].append(roa['result-of-algorithm'][i][1])
df = pd.DataFrame(data)
In [31]: df
Out[31]:
algorithm param-a param-b time value
0 minmax 12 200 1 5
1 minmax 12 200 2 8
2 minmax 12 30 1 5
3 minmax 12 30 3 4
4 minmax 12 30 2 8
5 minmax 12 30 4 12
6 delta 12 50 2 8
7 delta 12 50 4 12
And from here you can do whatever analysis you need with it, whether it's plotting or making the time column the index or grouping and aggregating, and so on. You can compare this to making a dataframe first in this link:
Splitting a List inside a Pandas DataFrame
Where they basically did the same thing, with splitting a column of lists into multiple rows. I think fixing the dictionary will be easier though, depending on how representative your fairly simple example is of the real data.
Edit: If you wanted to turn this into a multi-index, you can add one more line:
df_mi = df.set_index(['algorithm', 'param-a', 'param-b'])
In [25]: df_mi
Out[25]:
time value
algorithm param-a param-b
minmax 12 200 1 5
200 2 8
30 1 5
30 3 4
30 2 8
30 4 12
delta 12 50 2 8
50 4 12
I have dataset where I calculate service times based on request and response times. I would like to add a calculation of requests in the last second to show the obvious relationship that as we get more requests per second the system slows. Here is the data that I have, for example:
serviceTimes.head()
Out[71]:
Id Req_Time Rsp_Time ServiceTime
0 3_1 2015-02-13 14:07:08.729000 2015-02-13 14:07:08.821000 00:00:00.092000
1 3_2 2015-02-13 14:07:08.929000 2015-02-13 14:07:08.929000 00:00:00
2 3_12 2015-02-13 14:11:53.908000 2015-02-13 14:11:53.981000 00:00:00.073000
3 3_14 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.250000 00:00:00.139000
4 3_15 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.282000 00:00:00.171000
For this I would like a rolling data set of something like:
0 14:07:08 2
1 14:11:53 1
2 14:11:54 2
I've tried rolling_sum and rolling_count, but unless I am using them wrong or not understanding the period function, it is not working for me.
For your problem, it looks like you want to summarize your data set using a split-apply-combine approach. See here for the documentation that will help you get your code in working but basically, you'll want to do the following:
Create a new column (say, 'Req_Time_Sec that includes Req_Time down to only second resolution (e.g. 14:07:08.729000 becomes 14:07:08)
use groups = serviceTimes.groupby('Req_Time_Sec) to separate your data set into sub-groups based on which second each request occurs in.
Finally, create a new data set by calculating the length of each sub group (which represents the number of requests in that second) and aggregating the results into a single DataFrame (something like new_df = groups.aggregate(len))
The above is all untested pseudo-code, but the code, along with the link to the documentation, should help you get where you want to go.
You first need to transform the timestamp into a string which you then groupby, showing the count and average service times:
serviceTimes['timestamp'] = [t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time]
serviceTimes.groupby('timestamp')['ServiceTime'].agg(['mean', 'count'])
Alternatively, create a data frame of the request time in the appropriate string format, e.g. 15-13-15 17:27, then count the occurrence of each time stamp using value_counts(). You can also plot the results quite easily.
df = pd.DataFrame([t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time],
columns=['timestamp'])
response = df.timestamp.value_counts()
response.plot(rot=90)