Separating elements of a Pandas DataFrame in Python - python

I have a pandas DataFrame that looks like the following:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
which can be generated with the following code:
import pandas
time=[0,1,2,3,4]
repeat_1_conc_1=[1,2,3,4,5]
repeat_1_conc_2=[2,3,4,5,6]
repeat_1_conc_3=[3,4,5,6,7]
d1=pandas.DataFrame([time,repeat_1_conc_1]).transpose()
d2=pandas.DataFrame([time,repeat_1_conc_2]).transpose()
d3=pandas.DataFrame([time,repeat_1_conc_3]).transpose()
repeat_2_conc_1=[1,2,3,4,5]
repeat_2_conc_2=[2,3,4,5,6]
repeat_2_conc_3=[3,4,5,6,7]
d4=pandas.DataFrame([time,repeat_2_conc_1]).transpose()
d5=pandas.DataFrame([time,repeat_2_conc_2]).transpose()
d6=pandas.DataFrame([time,repeat_2_conc_3]).transpose()
df= pandas.concat([d1,d2,d3,d4,d5,d6]).reset_index()
df.drop('index',axis=1,inplace=True)
df.columns=['Time','Measurement']
print df
If you look at the code, you'll see that I have two experimental repeats in the same DataFrame which should be separated at df.iloc[:15]. Additionally, within each experiment I have 3 sub-experiments that can be thought of like the starting conditions of a dose response, i.e. first sub-experiment starts with 1, second with 2 and third with 3. These should be separated at index intervals of `len(time)', which is 0-4, 5 elements for each experimental repeat. Could somebody please tell me the best way to separate this data into individual time course measurements for each experiment? I'm not exactly sure what the best data structure would be to use but I just need to be able to access each data for each sub experiment for each experimental repeat easily. Perhaps sometime like:
repeat1=
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
Repeat 2=
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7

IIUC, you may set a multiindex so that you can index your DF accessing experiments and subexperiments easily:
In [261]: dfi = df.set_index([df.index//15+1, df.index//5 - df.index//15*3 + 1])
In [262]: dfi
Out[262]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
2 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
selecting subexperiments
In [263]: dfi.loc[1,1]
Out[263]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
In [264]: dfi.loc[2,2]
Out[264]:
Time Measurement
2 2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
select second experiment with all subexperiments:
In [266]: dfi.loc[2,:]
Out[266]:
Time Measurement
1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
alternatively you can create your own slicing function:
def my_slice(rep=1, subexp=1):
rep -= 1
subexp -= 1
return df.ix[rep*15 + subexp*5 : rep*15 + subexp*5 + 4, :]
demo:
In [174]: my_slice(1,1)
Out[174]:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
In [175]: my_slice(2,1)
Out[175]:
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
In [176]: my_slice(2,2)
Out[176]:
Time Measurement
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
PS bit more convenient way to concatenate your DFs:
df = pandas.concat([d1,d2,d3,d4,d5,d6], ignore_index=True)
so you don't need the following .reset_index() and drop()

Related

pandas restart cumsum every time the value is zero

so I have a series, I want to cumsum, but start over every time I hit a 0, somthing like this:
orig
wanted result
0
0
0
1
1
1
2
1
2
3
1
3
4
1
4
5
1
5
6
1
6
7
0
0
8
1
1
9
1
2
10
1
3
11
0
0
12
1
1
13
1
2
14
1
3
15
1
4
16
1
5
17
1
6
any ideas? (pandas, pure python, other)
Use df['orig'].eq(0).cumsum() to generate groups starting on each 0, then cumcount to get the increasing values:
df['result'] = df.groupby(df['orig'].eq(0).cumsum()).cumcount()
output:
orig wanted result result
0 0 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 1 6 6
7 0 0 0
8 1 1 1
9 1 2 2
10 1 3 3
11 0 0 0
12 1 1 1
13 1 2 2
14 1 3 3
15 1 4 4
16 1 5 5
17 1 6 6
Intermediate:
df['orig'].eq(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
11 3
12 3
13 3
14 3
15 3
16 3
17 3
Name: orig, dtype: int64
import pandas as pd
condition = df.Orig.eq(0)
df['reset'] = condition.cumsum()

Assign 1 value to random sample of group where the sample size is equal to the value of another column

I want to randomly assign 1 value to the IsShade column (output) such that value 1 can be assigned only D times (see column Shading for ex 2 times or 5 times or 3 times) and have to iterate it for E times (Total column for ex 6 times or 8 times or 5 times)
There are 1 million rows of dataset and attached is sample input and image.
Input:
In[1]:
Sr Series Parallel Shading Total Cell
0 0 3 2 2 6 1
1 1 3 2 2 6 2
2 2 3 2 2 6 3
3 3 3 2 2 6 4
4 4 3 2 2 6 5
5 5 3 2 2 6 6
6 6 4 2 5 8 1
7 7 4 2 5 8 2
8 8 4 2 5 8 3
9 9 4 2 5 8 4
10 10 4 2 5 8 5
11 11 4 2 5 8 6
12 12 4 2 5 8 7
13 13 4 2 5 8 8
14 14 5 1 3 5 1
15 15 5 1 3 5 2
16 16 5 1 3 5 3
17 17 5 1 3 5 4
18 18 5 1 3 5 5
If you can help me in how to achieve or python code that will be helpful. Thank you and appreciate it.
Example Expected Output:
Out[1]:
Sr Series Parallel Shading Total Cell IsShade
0 0 3 2 2 6 1 0
1 1 3 2 2 6 2 0
2 2 3 2 2 6 3 1
3 3 3 2 2 6 4 0
4 4 3 2 2 6 5 0
5 5 3 2 2 6 6 1
6 6 4 2 5 8 1 1
7 7 4 2 5 8 2 0
8 8 4 2 5 8 3 1
9 9 4 2 5 8 4 1
10 10 4 2 5 8 5 0
11 11 4 2 5 8 6 0
12 12 4 2 5 8 7 1
13 13 4 2 5 8 8 1
14 14 5 1 3 5 1 0
15 15 5 1 3 5 2 1
16 16 5 1 3 5 3 0
17 17 5 1 3 5 4 1
18 18 5 1 3 5 5 1
You can create a new column that does a .groupby and randomly selects x number of rows based off the integer in the Shading column using .sample. From there, I returned True or False and converted to an integer (True becomes 1 and False becomes 0 with .astype(int)):
s = df['Series'].ne(df['Series'].shift()).cumsum() #s is a unique identifier group
df['IsShade'] = (df.groupby(s, group_keys=False)
.apply(lambda x: x['Shading'].sample(x['Shading'].iloc[0])) > 0)
df['IsShade'] = df['IsShade'].fillna(False).astype(int)
df
Out[1]:
Sr Series Parallel Shading Total Cell IsShade
0 0 3 2 2 6 1 0
1 1 3 2 2 6 2 0
2 2 3 2 2 6 3 0
3 3 3 2 2 6 4 0
4 4 3 2 2 6 5 1
5 5 3 2 2 6 6 1
6 6 4 2 5 8 1 1
7 7 4 2 5 8 2 1
8 8 4 2 5 8 3 0
9 9 4 2 5 8 4 0
10 10 4 2 5 8 5 1
11 11 4 2 5 8 6 1
12 12 4 2 5 8 7 1
13 13 4 2 5 8 8 0
14 14 5 1 3 5 1 1
15 15 5 1 3 5 2 0
16 16 5 1 3 5 3 0
17 17 5 1 3 5 4 1
18 18 5 1 3 5 5 1

Change repeating groups in a column to incremental groups

I have the following dataframe:
df = pd.DataFrame({'group_nr':[0,0,1,1,1,2,2,3,3,0,0,1,1,2,2,2,3,3]})
print(df)
group_nr
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 0
10 0
11 1
12 1
13 2
14 2
15 2
16 3
17 3
and would like to change from repeating group numbers to incremental group numbers:
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
I can't find a way of doing this without looping through the rows. Does someone have an idea how to implement this nicely?
You can check if the values are equal to the following, and take a cumsum of the boolean series to generate the groups:
df['incremental_group_nr'] = df.group_nr.ne(df.group_nr.shift()).cumsum().sub(1)
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Compare by shifted values by Series.shift with not equal by Series.ne and then add cumulative sum with subract 1:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift()).cumsum() - 1
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Another idea is use backfilling first missing value after shift by bfill:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift().bfill()).cumsum()
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7

With python dataframes add column of counts of rows that meet condition to each row that meets it

Say I have a python DataFrame with the following structure:
pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,3,5,6],[1,4,6,7],[1,4,6,7],[1,4,6,7]])
Out[262]:
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 1 3 5 6
3 1 4 6 7
4 1 4 6 7
5 1 4 6 7
How can I add a column called 'ct' that counts the instances of the DataFrame where column 1-3 match to each row that matches... so the DataFrame would look like this when all is completed.
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3
You can use groupby + transform + size:
df['ct'] = df.groupby([1,2,3])[1].transform('size')
#alternatively
#df['ct'] = df.groupby([1,2,3])[1].transform(len)
print (df)
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3

SQL average ok values per group

I want to get some values out from my table, however, due to the structure of the table, I am not sure if that is possible:
A sample of a table is here, I have put some spaces in the table to make it easier to read.
As you can see there are many pages, with 3rows/page and 3channels/row.
There are also devices, each device consists in 3 segments. 9 devices/page.
And finally there are results per segment from certain analysis, passed=1 or passed=0.
ID channel row page segment device passed
1 1 1 1 1 1 1
2 1 1 1 2 1 0
3 1 1 1 3 1 1
4 2 1 1 1 2 0
5 2 1 1 2 2 0
6 2 1 1 3 2 0
7 3 1 1 1 3 1
8 3 1 1 2 3 0
9 3 1 1 3 3 1
10 1 2 1 1 4 1
11 1 2 1 2 4 1
12 1 2 1 3 4 1
13 2 2 1 1 5 0
14 2 2 1 2 5 1
15 2 2 1 3 5 1
16 3 2 1 1 6 1
17 3 2 1 2 6 1
18 3 2 1 3 6 1
19 1 3 1 1 7 1
20 1 3 1 2 7 0
21 1 3 1 3 7 1
22 2 3 1 1 8 1
23 2 3 1 2 8 0
24 2 3 1 3 8 1
25 3 3 1 1 9 1
26 3 3 1 2 9 0
27 3 3 1 3 9 1
NEXT PAGE..................................
28 1 1 2 1 1 0
29 1 1 2 2 1 0
30 1 1 2 3 1 1
31 2 1 2 1 2 1
32 2 1 2 2 2 1
33 2 1 2 3 2 0
34 3 1 2 1 3 1
35 3 1 2 2 3 0
36 3 1 2 3 3 1
37 1 2 2 1 4 1
38 1 2 2 2 4 0
39 1 2 2 3 4 0
40 2 2 2 1 5 0
41 2 2 2 2 5 1
42 2 2 2 3 5 0
43 3 2 2 1 6 1
44 3 2 2 2 6 1
45 3 2 2 3 6 1
46 1 3 2 1 7 0
47 1 3 2 2 7 1
48 1 3 2 3 7 1
49 2 3 2 1 8 1
50 2 3 2 2 8 1
51 2 3 2 3 8 1
52 3 3 2 1 9 0
53 3 3 2 2 9 1
54 3 3 2 3 9 0
etc.......................................
The calculation that I am looking for is the % of passed devices (passed=1) but grouped by device, which is for me the difficult part because, each device is passed just if all segments are passed within the device.
So device=1 if seg1=1 and seg2=1 and seg3=1
An expected output would be something like:(not real data here)
Device passed
1 27.45
2 56.78
3 78.9
4 11.23
5 etc
6 etc
7 etc
8 etc
9 etc
I know that I have to use something like that, however it is not working:
SELECT device, count(passed) as devicesOK
FROM myTable
WHERE passed=1
group by device
having devicesOK=3
To get the percentage passed, you have to divide the number that passed by the total count per device.
SELECT device, ROUND(SUM(passed)/COUNT(*)/100, 2) AS passed_percent
FROM myTable
GROUP BY device
HAVING SUM(passed) = 3
COUNT(passed) counts the number of non-NULL values of passed. Since all your values are not NULL, it counts them all.

Categories