how to use drop_duplicates() with a condition? - python

I have a dataframe consisting of two columns. Column A consists of strings, column B consists of numbers. Column A has duplicates that I want to remove. However, I only want to retain those duplicates that have the highest number in column B. This is an example of how my dataframe looks like:
columnA | columnB
---------------------
a | 1
a | 2
b | 2
b | 1
What I want is this:
columnA | columnB
---------------------
a | 2
b | 2
using drop_duplicates()

You can sort your dataframe in descending order based on 'columnB', and use drop_duplicates() on your columnA keeping the first occurence:
df.sort_values(by='columnB',ascending=False).drop_duplicates('columnA',keep='first')
columnA columnB
13 d 555
27 h 6
16 f 6
6 c 3
1 a 2
2 b 2
15 e 1
Sample data (slightly enhanced than your sample):
df.to_dict()
{'columnA': {0: 'a',
1: 'a',
2: 'b',
3: 'b',
4: 'c',
5: 'c',
6: 'c',
7: 'd',
8: 'd',
9: 'd',
10: 'd',
11: 'd',
12: 'd',
13: 'd',
14: 'e',
15: 'e',
16: 'f',
17: 'f',
18: 'f',
19: 'f',
20: 'f',
21: 'f',
22: 'h',
23: 'h',
24: 'h',
25: 'h',
26: 'h',
27: 'h'},
'columnB': {0: 1,
1: 2,
2: 2,
3: 1,
4: 1,
5: 2,
6: 3,
7: 33,
8: 223,
9: 3,
10: 2,
11: 1,
12: 3,
13: 555,
14: 1,
15: 1,
16: 6,
17: 5,
18: 4,
19: 3,
20: 2,
21: 1,
22: 1,
23: 2,
24: 3,
25: 4,
26: 5,
27: 6}}

Just group by 'A' and take the max 'B'
df.groupby('A').max()

Grouping your dataframe by column a, taking only the max of column b and creating a new dataframe by this method can also help as it retains the original dataframe as it is:
df.groupby('columnA')['columB'].max()

Related

Dataframe Groupby ID to JSON

I'm trying to convert the following dataframe into a JSON file:
id email surveyname question answer
1 lol#gmail s 1 apple
1 lol#gmail s 3 apple/juice
1 lol#gmail s 2 apple-pie
1 lol#gmail s 4 apple-pie
1 lol#gmail s 5 apple|pie|yes
1 lol#gmail s 6 apple
1 lol#gmail s 8 apple
1 lol#gmail s 7 apple
1 lol#gmail s 9 apple
1 lol#gmail s 12 apple
1 lol#gmail s 11 apple
1 lol#gmail s 10 apple_sauce
2 ll#gmail s 1 orange
2 ll#gmail s 3 juice
.
.
To:
{
"df":[
{
"id":"1",
"email:"lol#gmail"
"surveyname":"s",
"1":"apple",
"2":"apple-pie",
"3":"apple/juice",
"4":"apple-pie",
"5":"apple|pie|yes",
"6":"apple",
"7":"apple",
"8":"apple",
"9":"apple",
"10":"apple_sauce",
"11":"apple",
"12":"apple"
},
{
"id": "vid",
"email:"llgmail"
"surveyname: "s"
"1":"orange",
"2":"", # empty
"3":"juice",
.
.
.
}
]
}
It should map all the ids in the df and skip the numbers if they're empty.
Below is a sample for the df I used above. If the whole df for id = 2 needs to be constructed, please let me know and I can edit that in. However, some entries don't have completed values inside the actual df.
d = {'id': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 2,
13: 2},
'email': {0: 'lol#gmail',
1: 'lol#gmail',
2: 'lol#gmail',
3: 'lol#gmail',
4: 'lol#gmail',
5: 'lol#gmail',
6: 'lol#gmail',
7: 'lol#gmail',
8: 'lol#gmail',
9: 'lol#gmail',
10: 'lol#gmail',
11: 'lol#gmail',
12: 'll#gmail',
13: 'll#gmail'},
'surveyname': {0: 's',
1: 's',
2: 's',
3: 's',
4: 's',
5: 's',
6: 's',
7: 's',
8: 's',
9: 's',
10: 's',
11: 's',
12: 's',
13: 's'},
'question': {0: 1,
1: 3,
2: 2,
3: 4,
4: 5,
5: 6,
6: 8,
7: 7,
8: 9,
9: 12,
10: 11,
11: 10,
12: 1,
13: 3},
'answer': {0: 'apple',
1: 'apple/juice',
2: 'apple-pie',
3: 'apple-pie',
4: 'apple|pie|yes',
5: 'apple',
6: 'apple',
7: 'apple',
8: 'apple',
9: 'apple',
10: 'apple',
11: 'apple_sauce',
12: 'orange',
13: 'juice'}}
df = pd.DataFrame.from_dict(d)
You can pivot the dataframe before exporting to JSON:
(
df.pivot_table(
index=["id", "email", "surveyname"],
columns="question",
values="answer",
aggfunc="first",
)
.reindex(columns=np.arange(1, 13))
.fillna("")
.reset_index()
.to_json("data.json", orient="records")
)

pandas dataframe How to sequence rows based on conditions

I am trying to sequence an already sequenced dataset based on conditions. The dataframe looks like the following (the intended output is my desired output):
id
activity
duration_sec
cycle
intended_cycle
1
1
start
0.7
1
1
2
1
a
0.3
1
1
3
1
b
0.4
1
1
4
1
c
0.5
1
1
5
1
c
0.5
1
2
6
1
d
0.4
1
2
7
1
e
0.6
1
3
8
1
stop
2
1
3
9
1
start
0.1
2
4
10
1
b
0.3
2
4
11
1
stop
0.2
2
4
12
1
f
0.3
3
5
13
2
stop
40
4
6
14
2
start
3
5
7
15
2
a
0.7
5
7
16
2
a
3
5
7
17
2
b
0.2
5
7
18
2
stop
0.2
5
7
19
2
start
0.1
6
8
20
2
f
0.4
6
8
21
2
g
0.2
6
8
22
2
h
0.5
6
8
23
2
h
6
6
8
24
2
stop
9
6
8
25
2
start
0.2
7
9
26
2
e
0.3
7
9
27
2
f
0.4
7
10
28
2
stop
0.7
7
10
The alphabets are representative of activity names. I'm hoping to sequence as per the intended cycle column. This is based on the condition that within the current sequence:
the first activity duration>0.5s after the "start", would be considered as 1 sequence (or a sub-sequence, if you will).
the duplicate activities occurring consecutively will be considered as one if either value is >0.5s
the next activity that is >0.5s, after the first>0.5s would be considered as the next sequence
if there is no more activity >0.5s after, the sequence will be considered till the cycle column value changes
if there is no activity within the current cycle that is >0.5s, the activities will be considered independent (i.e. row 25-28)
I'd really appreciate any help. Big thanks to there being a community for this!
sample code for df:
data = {'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28},
'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2},
'activity': {0: 'start', 1: 'a', 2: 'b', 3: 'c', 4: 'c', 5: 'd', 6: 'e', 7: 'stop', 8: 'start', 9: 'b', 10: 'stop', 11: 'f', 12: 'stop', 13: 'start', 14: 'a', 15: 'a', 16: 'b', 17: 'stop', 18: 'start', 19: 'f', 20: 'g', 21: 'h', 22: 'h', 23: 'stop', 24: 'start', 25: 'e', 26: 'f', 27: 'stop'},
'duration_sec': {0: 0.7, 1: 0.3, 2: 0.4, 3: 0.5, 4: 0.5, 5: 0.4, 6: 0.6, 7: 2.0, 8: 0.1, 9: 0.3, 10: 0.2, 11: 0.3, 12: 40.0, 13: 3.0, 14: 0.7, 15: 3.0, 16: 0.2, 17: 0.2, 18: 0.1, 19: 0.4, 20: 0.2, 21: 0.5, 22: 6.0, 23: 9.0, 24: 0.2, 25: 0.3, 26: 0.4, 27: 0.7}, 'cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 3, 12: 4, 13: 5, 14: 5, 15: 5, 16: 5, 17: 5, 18: 6, 19: 6, 20: 6, 21: 6, 22: 6, 23: 6, 24: 7, 25: 7, 26: 7, 27: 7}, 'intended_cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4, 10: 4, 11: 5, 12: 6, 13: 7, 14: 7, 15: 7, 16: 7, 17: 7, 18: 8, 19: 8, 20: 8, 21: 8, 22: 8, 23: 8, 24: 9, 25: 9, 26: 10, 27: 10}}
df = pd.DataFrame.from_dict(data)

pd.groupby() intermediary sums

on Python 3.6, pandas 1.1.2
I am trying to make intermediary sums. I could certainly use the sum(level) but this is not elegant nor optimal and I am wondering if there would be a better way.
For example:
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'}, 'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}}).groupby(by=['level_0', 'level_1', 'level_2']).sum()
Gives me:
value
level_0 level_1 level_2
a aa aaa 5
aab 2
bb bba 3
b aa aaa 5
aab 9
aac 2
cc cca 2
ccb 9
c bb bba 1
bbb 9
cc cca 9
ccb 5
ccc 5
dd dda 5
ddb 5
ddc 3
Now, I would like to be able to get the subtotal for each level_0 and level_1, such as below:
There you are:
import pandas as pd
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'},
'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}})
gb1 = df.groupby(by=['level_0', 'level_1', 'level_2']).sum().reset_index()
gb2 = df.groupby(by=['level_0', 'level_1']).sum().reset_index()
gb3 = df.groupby(by=['level_0']).sum().reset_index()
gb2['level_2'] = ''
gb3['level_1'] = ''
gb3['level_2'] = ''
gb_all = pd.concat((gb1, gb2, gb3), axis=0)
gb_all.sort_values(['level_0', 'level_1', 'level_2'], inplace=True)
gb_all.reset_index(inplace=True, drop=True)
print(gb_all)
Output:
level_0 level_1 level_2 value
0 a 10
1 a aa 7
2 a aa aaa 5
3 a aa aab 2
4 a bb 3
5 a bb bba 3
6 b 27
7 b aa 16
8 b aa aaa 5
9 b aa aab 9
10 b aa aac 2
11 b cc 11
12 b cc cca 2
13 b cc ccb 9
14 c 42
15 c bb 10
16 c bb bba 1
17 c bb bbb 9
18 c cc 19
19 c cc cca 9
20 c cc ccb 5
21 c cc ccc 5
22 c dd 13
23 c dd dda 5
24 c dd ddb 5
25 c dd ddc 3

How to select rows with certain value between 2 columns from another DataFrame in pandas?

For example, I have 2 Frames, the first one is the one I want to select rows from, the second one contains the creteria for selection.
df1 = pd.DataFrame({'chr': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7},
0: {0: 55241686,
1: 55242415,
2: 55248986,
3: 55259412,
4: 55260459,
5: 55266410,
6: 55268009},
1: {0: 55241736,
1: 55242513,
2: 55249171,
3: 55259567,
4: 55260534,
5: 55266556,
6: 55268064}})
df1
df2 = pd.DataFrame({'chr': {0: 7,
1: 7,
2: 7,
3: 7,
4: 7,
5: 7,
6: 7,
7: 7,
8: 7,
9: 7,
10: 7,
11: 7,
12: 7,
13: 7,
14: 7,
15: 7,
16: 7,
17: 7,
18: 7,
19: 7},
's': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'e': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'ref': {0: 'T',
1: 'T',
2: 'A',
3: 'G',
4: 'C',
5: 'G',
6: 'G',
7: 'A',
8: 'G',
9: 'G',
10: 'C',
11: 'G',
12: 'C',
13: 'G',
14: 'G',
15: 'G',
16: 'G',
17: 'G',
18: 'C',
19: 'C'},
'alt': {0: 'C',
1: 'G',
2: 'C',
3: 'A',
4: 'T',
5: 'A',
6: 'A',
7: 'G',
8: 'A',
9: 'A',
10: 'T',
11: 'A',
12: 'G',
13: 'A',
14: 'C',
15: 'A',
16: 'C',
17: 'A',
18: 'G',
19: 'T'}})
df2 here only shows a small part.
df2
what I want to achieve is
for each row in df1, if this row(row_df1) match with certain row in df2 (row_df2) (match means, row_df1['chr']==row_df2['chr'] & row_df1[0] >= row_df2['s'] & row_df11 <= row_df2['e']
in brief,
if the value is fall into certain intervals constructed by df2['s'] and df2['e'], return it.
I believe best case scenario for you is to merge both dataframes first using a common column. In your case "chr". For example as I understand you want all 'chr' from df1 which exist df2, so in that case you just do:
merged_df = df1.merge(df2, on='chr', how='left')
In merge you can use "indicator=True" which will create a new column called "_merge" for you which will indicate the source of each row.
Now when you have your data merged on you can make simple condition statements to get all the needed columns like:
merged_df.loc[(merged_df[0] >= merged_df['s']) & (merged_df[1] >= merged_df ['e'])]
Or you could add a new column as a result, using apply and etc.

Apply function to a MultiIndex dataframe with pandas/python

I have the following DataFrame that I wish to apply some date range calculations to. I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (i.e. the first sample).
Here is an example dataset. The actual dataset can exceed 200,000 records.
labno name sex dob id location sample_date
1 John A M 12/07/1969 12345 A 12/05/2112
2 John B M 10/01/1964 54321 B 6/12/2010
3 James M 30/08/1958 87878 A 30/04/2012
4 James M 30/08/1958 45454 B 29/04/2012
5 Peter M 12/05/1935 33322 C 15/07/2011
6 John A M 12/07/1969 12345 A 14/05/2012
7 Peter M 12/05/1935 33322 A 23/03/2011
8 Jack M 5/12/1921 65655 B 15/08/2011
9 Jill F 6/08/1986 65459 A 16/02/2012
10 Julie F 4/03/1992 41211 C 15/09/2011
11 Angela F 1/10/1977 12345 A 23/10/2006
12 Mark A M 1/06/1955 56465 C 4/04/2011
13 Mark A M 1/06/1955 45456 C 3/04/2011
14 Mark B M 9/12/1984 55544 A 13/09/2012
15 Mark B M 9/12/1984 55544 A 1/01/2012
Unique persons are those with the same name and dob. For example John A, James, Mark A, and Mark B are unique persons. Mark A however has different id values.
I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. It takes forever.
I would welcome a few pointers as to how I might attempt this with python/pandas. I started by making a MultiIndex with name/dob/id. The structure looks like what I want. What I need to do is try applying some of the functions I use in R to select out the rows I need. I have tried selecting with df.xs() but I am not getting very far.
Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order).
{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3:
'30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935',
7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977',
11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14:
'9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211,
10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544},
'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7:
8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15},
'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A',
6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C',
13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2:
'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7:
'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0:
'12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012',
4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7:
'15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10:
'23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13:
'13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M',
3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}
I think what you might be looking for is
def differ(df):
delta = df.sample_date.diff().abs() # only care about magnitude
cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
return df[cond].max()
delta = df.groupby(['dob', 'name']).apply(differ)
Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them.
Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64/datetime64 for numpy < 1.7.

Categories