Average for similar looking data in a column using Pandas - python

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help

First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

Related

looking for better iteration approach for slicing a dataframe

First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))
For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505

python sort a list of strings based on substrings using pandas

I have an excel sheet with 4 columns, Filename, SNR, Dynamic Range, Level.
Filename
SNR
Dynamic Range
Level
1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx
5
11
8
19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx
15
31
23
10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx
10
21
24
28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx
20
41
23
37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx
25
51
12
I need to reorganize the first column of the table, Xls filename, such that the bolded part is in order from least to greatest.
i.e.
Filename
SNR
Dynamic Range
Level
1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx
5
11
8
37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx
25
51
12
10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx
10
21
24
19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx
15
31
23
28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx
20
41
23
I don't want to change the actual excel file. I was hoping to use pandas because I am doing some other manipulation later on.
I tried this
df.sort_values(by='Xls Filename', key=lambda col: col.str.contains('_FS'),ascending=True)
but it didn't work.
Thank you in advance!
Extract the pattern, find the sort index using argsort and then sort with the sort index:
# extract the number to sort by into a Series
fs = df.Filename.str.extract('FS(\d+)_\w+\.xlsx$', expand=False)
# find the sort index using `argsort` and reorder data frame with the sort index
df.loc[fs.astype(int).argsort()]
# Filename ... Level
#0 1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx ... 8
#4 37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx ... 12
#2 10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx ... 24
#1 19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx ... 23
#3 28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx ... 23
Where regex FS(\d+)_\w+\.xlsx$ will capture digits that immediately follow FS and precede _\w+\.xlsx.
In case you might have patterns that don't match, convert to float instead of int due to possible nans:
df.loc[fs.astype(float).values.argsort()]

Sort index order with duplicate index numbers

I have two dataframes
df1:
Type Number
24 variation 2.5
25 variation 2.6
26 variation 4
27 variation 4
dfn:
Type Number
24 variable
26 variable
I'm trying to append these two data frames and sort them by index
dfn = dfn.append(df).sort_index()
The end result should be
Type Number
24 variable
24 variation 2.5
25 variation 2.6
26 variable
26 variation 4
27 variation 4
However, I am getting results like:
Type Number
24 variable
24 variation 2.5
25 variation 2.6
26 variation 4
26 variable
27 variation 4
I want the row with variable type above the variation type, which works fine with the first index (24) but not for the next index (26) and so on. How can I get the desired results?
Please Try, append, reset index, sort values by multiple columns and drop the reset index column as follows
df1.append(dfn).reset_index().sort_values(['index','Type','Number']).drop('index',1)
Let us try
dfn = df.append(dfn).sort_index()

Creating Multiple DataFrames from single DataFrame based on different values of single column

I have 3 days of time series data with multiple columns in it. I have one single DataFrame which includes all 3 days data. I want 3 different DataFrames based on Column name "Dates" i.e df["Dates"]
For Example:
Available Dataframe is: df
Expected Output: Based on Three different Dates
First DataFrame: df_23
Second DataFrame: df_24
Third DataFrame: df_25
I want to use these all three DataFrames separately for analysis.
I tried below code but I am not able to use those three dataframes (Rather I don't know how to use.) Can anybody help me to work my code better. Thank you.
Above code is just printing the DataFrame in three DataFrames that too not as expected as per code!
Unsure if your saving your variable into a csv or keep it in memory for further use,
you could pass each unique value into a dict and access by it's value :
print(df)
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
4 54 24
5 10 24
6 77 24
7 95 24
8 58 25
9 53 25
10 44 25
11 94 25
d = {}
for frame, data in df.groupby('Dates'):
d[f'df{frame}'] = data
print(d['df23'])
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
edit updated request :
for k,v in d.items():
i = (v['Cal'].loc[v['Cal'] > 70].count())
print(f"{v['Dates'].unique()[0]} --> {i} times")
23 --> 4 times
24 --> 2 times
25 --> 1 times

Index contains multiples value dataframe pivots

I currently have data in the following format in a dataframe:
metric__name sample sample_date
0 ga:visitBounceRate 100 2012-11-13
1 ga:uniquePageviews 20 2012-11-13
2 ga:newVisits 19 2012-11-13
3 ga:visits 20 2012-11-13
4 ga:percentNewVisits 95 2012-11-13
5 ga:pageviewsPerVisit 1 2012-11-13
6 ga:pageviews 20 2012-11-13
7 ga:visitBounceRate 72 2012-11-14
8 ga:uniquePageviews 63 2012-11-14
9 ga:newVisits 39 2012-11-14
That being said, I am trying to break out the metric__name column into something like this.
ga:visitBounceRate ga:uniquePageviews ga:newVisits etc...
sample_date
2012-11-13 100 20 19 etc...
I am doing the following to get my desired result.
df.pivot(index='sample_dates', columns='metric__name', values='samples')
All I keep getting is index contains multiple values which it indeed does, but why wouldn't it understand that there are similar and map them to the same line as I did in my desired output?
Use pivot_table (which doesn't throw this exception):
In [11]: df.pivot_table('sample', 'sample_date', 'metric__name')
Out[11]:
metric__name ga:newVisits ga:pageviews ga:pageviewsPerVisit ga:percentNewVisits ga:uniquePageviews ga:visitBounceRate ga:visits
sample_date
2012-11-13 19 20 1 95 20 100 20
2012-11-14 39 NaN NaN NaN 63 72 NaN
It accepts an aggregation function (by default is mean):
aggfunc : function, default numpy.mean, or list of functions
If list of functions passed, the resulting pivot table will have hierarchical columns
whose top level are the function names (inferred from the function objects themselves)
Regarding the difference between the two, I think pivot just does reshaping (and throws an error if there is a problem), whereas pivot_table offers more advanced functionality, aka "spreadsheet-style pivot tables".

Categories