looking for better iteration approach for slicing a dataframe

looking for better iteration approach for slicing a dataframe - python

First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))

For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505

Related

Selecting values in one column until the sum of another column reaches a value

I have two columns Demand and square footage, want to maximize demand until a the capacity it reached (30 sqft).
Demand
square footage
10
10
2
5
5
10
12
5
7
10
13
20
I know this is basically a solver question but i have too many values, this is my first time creating something like this.

Special case of counting empty cells "before" an occupied cell in Pandas

Pandas question here.
I have a specific dataset in which we are sampling subjective ratings several times over a second. The information is sorted as below. What I need is a way to "count" the number of blank cells before every "second" (i.e. "1" in the second's column that occur at regular intervals), so I can feed that value into a greatest common factor equation and create somewhat of a linear extrapolation based on milliseconds. In the example below that number would be "2" and I would feed that into the GCF formula. The end goal is to make a more accurate/usable timestamp. Sampling rates may vary by dataset.
index
rating
seconds
1
26
2
28
3
30
1
4
33
5
40
6
45
1
7
50
8
48
9
49
1

If you just want to count the number of NaNs before the first 1:
df['seconds'].isna().cummin().sum()
If you have another value (e.g. empty string)
df['seconds'].eq('').cummin().sum()
Output: 2
Or, if you have a range Index:
df['seconds'].first_valid_index()

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help

First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

How to resample/reindex/groupby a time series based on a column's data?

SO I've got a pandas data frame that contains 2 values of water use at a 1 second resolution. The values are "hotIn" and "hotOut". The hotIn can record down to the tenth of a gallon at a one second resolution while the hotOut records whole number pulses representing a gallon, i.e. when a pulse occurs, one gallon has passed through the meter. The pulses occur roughly every 14-15 seconds.
Data looks roughly like this:
Index hotIn(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:00 4 0
2019-03-23T00:00:01 5 0
2019-03-23T00:00:02 4 0
2019-03-23T00:00:03 4 0
2019-03-23T00:00:04 3 0
2019-03-23T00:00:05 4 1
2019-03-23T00:00:06 4 0
2019-03-23T00:00:07 5 0
2019-03-23T00:00:08 3 0
2019-03-23T00:00:09 3 0
2019-03-23T00:00:10 4 0
2019-03-23T00:00:11 4 0
2019-03-23T00:00:12 5 0
2019-03-23T00:00:13 5 1
What I'm trying to do is resample or reindex the data frame based on the occurrence of pulses and sum the hotIn between the new timestamps.
For example, sum the hotIn between 00:00:00 - 00:00:05 and 00:00:06 - 00:00:13.
Results would ideally look like this:
Index hotIn sum(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 32 1
I've explored using a two step for-elif loop that just checks if the hotOut == 1, it works but its painfully slow on large datasets. I'm positive the timestamp functionality of Pandas will be superior if this is possible.
I also can't simply resample on a set frequency because the interval between pulses changes periodically. The primary issue is the period of timestamps between pulses changes so a general resample rule would not work. I've also run into problems with matching data frame lengths when pulling out the timestamps associated with pulses and applying them to the main as a new index.

IIUC, you can do:
s = df['hotOut(pulse=1gal)'].shift().ne(0).cumsum()
(df.groupby(s)
.agg({'Index':'last', 'hotIn(gpm)':'sum'})
.reset_index(drop=True)
)
Output:
Index hotIn(gpm)
0 2019-03-23T00:00:05 24
1 2019-03-23T00:00:13 33

You don't want to group on the Index. You want to group whenever 'hotOut(pulse=1gal)' changes.
s = df['hotOut(pulse=1gal)'].cumsum().shift().bfill()
(df.reset_index()
.groupby(s, as_index=False)
.agg({'Index': 'last', 'hotIn(gpm)': 'sum', 'hotOut(pulse=1gal)': 'last'})
.set_index('Index'))
hotIn(gpm) hotOut(pulse=1gal)
Index
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 33 1

Divide 2 columns and create new column with results

I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.

It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000

You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000

I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.