SQL average ok values per group

SQL average ok values per group - python

I want to get some values out from my table, however, due to the structure of the table, I am not sure if that is possible:
A sample of a table is here, I have put some spaces in the table to make it easier to read.
As you can see there are many pages, with 3rows/page and 3channels/row.
There are also devices, each device consists in 3 segments. 9 devices/page.
And finally there are results per segment from certain analysis, passed=1 or passed=0.
ID channel row page segment device passed
1 1 1 1 1 1 1
2 1 1 1 2 1 0
3 1 1 1 3 1 1
4 2 1 1 1 2 0
5 2 1 1 2 2 0
6 2 1 1 3 2 0
7 3 1 1 1 3 1
8 3 1 1 2 3 0
9 3 1 1 3 3 1
10 1 2 1 1 4 1
11 1 2 1 2 4 1
12 1 2 1 3 4 1
13 2 2 1 1 5 0
14 2 2 1 2 5 1
15 2 2 1 3 5 1
16 3 2 1 1 6 1
17 3 2 1 2 6 1
18 3 2 1 3 6 1
19 1 3 1 1 7 1
20 1 3 1 2 7 0
21 1 3 1 3 7 1
22 2 3 1 1 8 1
23 2 3 1 2 8 0
24 2 3 1 3 8 1
25 3 3 1 1 9 1
26 3 3 1 2 9 0
27 3 3 1 3 9 1
NEXT PAGE..................................
28 1 1 2 1 1 0
29 1 1 2 2 1 0
30 1 1 2 3 1 1
31 2 1 2 1 2 1
32 2 1 2 2 2 1
33 2 1 2 3 2 0
34 3 1 2 1 3 1
35 3 1 2 2 3 0
36 3 1 2 3 3 1
37 1 2 2 1 4 1
38 1 2 2 2 4 0
39 1 2 2 3 4 0
40 2 2 2 1 5 0
41 2 2 2 2 5 1
42 2 2 2 3 5 0
43 3 2 2 1 6 1
44 3 2 2 2 6 1
45 3 2 2 3 6 1
46 1 3 2 1 7 0
47 1 3 2 2 7 1
48 1 3 2 3 7 1
49 2 3 2 1 8 1
50 2 3 2 2 8 1
51 2 3 2 3 8 1
52 3 3 2 1 9 0
53 3 3 2 2 9 1
54 3 3 2 3 9 0
etc.......................................
The calculation that I am looking for is the % of passed devices (passed=1) but grouped by device, which is for me the difficult part because, each device is passed just if all segments are passed within the device.
So device=1 if seg1=1 and seg2=1 and seg3=1
An expected output would be something like:(not real data here)
Device passed
1 27.45
2 56.78
3 78.9
4 11.23
5 etc
6 etc
7 etc
8 etc
9 etc
I know that I have to use something like that, however it is not working:
SELECT device, count(passed) as devicesOK
FROM myTable
WHERE passed=1
group by device
having devicesOK=3

To get the percentage passed, you have to divide the number that passed by the total count per device.
SELECT device, ROUND(SUM(passed)/COUNT(*)/100, 2) AS passed_percent
FROM myTable
GROUP BY device
HAVING SUM(passed) = 3
COUNT(passed) counts the number of non-NULL values of passed. Since all your values are not NULL, it counts them all.

Related

pandas restart cumsum every time the value is zero

so I have a series, I want to cumsum, but start over every time I hit a 0, somthing like this:
orig
wanted result
0
0
0
1
1
1
2
1
2
3
1
3
4
1
4
5
1
5
6
1
6
7
0
0
8
1
1
9
1
2
10
1
3
11
0
0
12
1
1
13
1
2
14
1
3
15
1
4
16
1
5
17
1
6
any ideas? (pandas, pure python, other)

Use df['orig'].eq(0).cumsum() to generate groups starting on each 0, then cumcount to get the increasing values:
df['result'] = df.groupby(df['orig'].eq(0).cumsum()).cumcount()
output:
orig wanted result result
0 0 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 1 6 6
7 0 0 0
8 1 1 1
9 1 2 2
10 1 3 3
11 0 0 0
12 1 1 1
13 1 2 2
14 1 3 3
15 1 4 4
16 1 5 5
17 1 6 6
Intermediate:
df['orig'].eq(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
11 3
12 3
13 3
14 3
15 3
16 3
17 3
Name: orig, dtype: int64

import pandas as pd
condition = df.Orig.eq(0)
df['reset'] = condition.cumsum()

Change repeating groups in a column to incremental groups

I have the following dataframe:
df = pd.DataFrame({'group_nr':[0,0,1,1,1,2,2,3,3,0,0,1,1,2,2,2,3,3]})
print(df)
group_nr
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 0
10 0
11 1
12 1
13 2
14 2
15 2
16 3
17 3
and would like to change from repeating group numbers to incremental group numbers:
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
I can't find a way of doing this without looping through the rows. Does someone have an idea how to implement this nicely?

You can check if the values are equal to the following, and take a cumsum of the boolean series to generate the groups:
df['incremental_group_nr'] = df.group_nr.ne(df.group_nr.shift()).cumsum().sub(1)
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7

Compare by shifted values by Series.shift with not equal by Series.ne and then add cumulative sum with subract 1:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift()).cumsum() - 1
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Another idea is use backfilling first missing value after shift by bfill:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift().bfill()).cumsum()
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7

Better way other than for loops,

I want to create a DataFrame that has the columns feature1, month and feature_segment. I have over 3,000 unique values in feature1 and 3 feature_segments, I now have to map each feature to each month and feature_segment,
for example:
feature1 = 1 so the mapping should create a data frame as such:
feature1 month feature_Segment
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
1 4 1
1 4 2
1 4 3
1 5 1
1 5 2
1 5 3
1 6 1
1 6 2
1 6 3
1 7 1
1 7 2
1 7 3
1 8 1
1 8 2
1 8 3
1 9 1
1 9 2
1 9 3
1 10 1
1 10 2
1 10 3
1 11 1
1 11 2
1 11 3
1 12 1
1 12 2
1 12 3
Now is there any way to create this data frame without using a for loop?
All the df columns are in lists.

Use itertools.product:
from itertools import product
feature = [1]
feature_Segment = [1,2,3]
month = range(1, 13)
df = pd.DataFrame(product(feature, month, feature_Segment),
columns=['feature1','month','feature_Segment'])
print (df.head(10))
feature1 month feature_Segment
0 1 1 1
1 1 1 2
2 1 1 3
3 1 2 1
4 1 2 2
5 1 2 3
6 1 3 1
7 1 3 2
8 1 3 3
9 1 4 1

Repeating rows in a DataFrame based on a column

I have a dataframe now：
class1 class2 value value2
0 1 0 1 4
1 2 1 2 3
2 2 0 3 5
3 3 1 4 6
I want to repeat rows and insert an increment column in the same amount according to the difference between value and value2. I want to get the dataframe should like this:
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6
I tried it like：
def func(x):
copy = x.copy()
num = x.value2+1-x.value
return pd.concat([copy]*num.values[0])
df= df.groupby(['class1','class2']).apply(lambda x:func(x))
But there will be a oredr problem that leads me to not know how to add column value3. And I'd like to have an elegant way of doing it.
Can anyone help me? Thanks in advance

Compute the difference and call Index.repeat:
idx = df.index.repeat(df.value2 - df.value + 1)
Now, either use reindex:
df = df.reindex(idx).reset_index(drop=True)
Or loc:
df = df.loc[idx].reset_index(drop=True)
And you get
df
class1 class2 value value2
0 1 0 1 4
1 1 0 1 4
2 1 0 1 4
3 1 0 1 4
4 2 1 2 3
5 2 1 2 3
6 2 0 3 5
7 2 0 3 5
8 2 0 3 5
9 3 1 4 6
10 3 1 4 6
11 3 1 4 6
For the second part of your question, you'll need groupby.cumcount:
s = idx.to_series()
df['value3'] = df['value'] + s.groupby(idx).cumcount().values
df
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6

Here's a sequence of things that would get you the desired output:
df.join(df
.apply(lambda x: pd.Series(range(x.value, x.value2+1)), axis=1)
.stack().astype(int)
.reset_index(level=1, drop=1)
.to_frame('value3')).reset_index(drop=1)
Out[]:
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6

Separating elements of a Pandas DataFrame in Python

I have a pandas DataFrame that looks like the following:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
which can be generated with the following code:
import pandas
time=[0,1,2,3,4]
repeat_1_conc_1=[1,2,3,4,5]
repeat_1_conc_2=[2,3,4,5,6]
repeat_1_conc_3=[3,4,5,6,7]
d1=pandas.DataFrame([time,repeat_1_conc_1]).transpose()
d2=pandas.DataFrame([time,repeat_1_conc_2]).transpose()
d3=pandas.DataFrame([time,repeat_1_conc_3]).transpose()
repeat_2_conc_1=[1,2,3,4,5]
repeat_2_conc_2=[2,3,4,5,6]
repeat_2_conc_3=[3,4,5,6,7]
d4=pandas.DataFrame([time,repeat_2_conc_1]).transpose()
d5=pandas.DataFrame([time,repeat_2_conc_2]).transpose()
d6=pandas.DataFrame([time,repeat_2_conc_3]).transpose()
df= pandas.concat([d1,d2,d3,d4,d5,d6]).reset_index()
df.drop('index',axis=1,inplace=True)
df.columns=['Time','Measurement']
print df
If you look at the code, you'll see that I have two experimental repeats in the same DataFrame which should be separated at df.iloc[:15]. Additionally, within each experiment I have 3 sub-experiments that can be thought of like the starting conditions of a dose response, i.e. first sub-experiment starts with 1, second with 2 and third with 3. These should be separated at index intervals of `len(time)', which is 0-4, 5 elements for each experimental repeat. Could somebody please tell me the best way to separate this data into individual time course measurements for each experiment? I'm not exactly sure what the best data structure would be to use but I just need to be able to access each data for each sub experiment for each experimental repeat easily. Perhaps sometime like:
repeat1=
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
Repeat 2=
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7

IIUC, you may set a multiindex so that you can index your DF accessing experiments and subexperiments easily:
In [261]: dfi = df.set_index([df.index//15+1, df.index//5 - df.index//15*3 + 1])
In [262]: dfi
Out[262]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
2 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
selecting subexperiments
In [263]: dfi.loc[1,1]
Out[263]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
In [264]: dfi.loc[2,2]
Out[264]:
Time Measurement
2 2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
select second experiment with all subexperiments:
In [266]: dfi.loc[2,:]
Out[266]:
Time Measurement
1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
alternatively you can create your own slicing function:
def my_slice(rep=1, subexp=1):
rep -= 1
subexp -= 1
return df.ix[rep*15 + subexp*5 : rep*15 + subexp*5 + 4, :]
demo:
In [174]: my_slice(1,1)
Out[174]:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
In [175]: my_slice(2,1)
Out[175]:
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
In [176]: my_slice(2,2)
Out[176]:
Time Measurement
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
PS bit more convenient way to concatenate your DFs:
df = pandas.concat([d1,d2,d3,d4,d5,d6], ignore_index=True)
so you don't need the following .reset_index() and drop()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SQL average ok values per group - python

Related

pandas restart cumsum every time the value is zero

Change repeating groups in a column to incremental groups

Better way other than for loops,

Repeating rows in a DataFrame based on a column

Separating elements of a Pandas DataFrame in Python

Categories

Resources