Dataframe Groupby ID to JSON - python

I'm trying to convert the following dataframe into a JSON file:
id email surveyname question answer
1 lol#gmail s 1 apple
1 lol#gmail s 3 apple/juice
1 lol#gmail s 2 apple-pie
1 lol#gmail s 4 apple-pie
1 lol#gmail s 5 apple|pie|yes
1 lol#gmail s 6 apple
1 lol#gmail s 8 apple
1 lol#gmail s 7 apple
1 lol#gmail s 9 apple
1 lol#gmail s 12 apple
1 lol#gmail s 11 apple
1 lol#gmail s 10 apple_sauce
2 ll#gmail s 1 orange
2 ll#gmail s 3 juice
.
.
To:
{
"df":[
{
"id":"1",
"email:"lol#gmail"
"surveyname":"s",
"1":"apple",
"2":"apple-pie",
"3":"apple/juice",
"4":"apple-pie",
"5":"apple|pie|yes",
"6":"apple",
"7":"apple",
"8":"apple",
"9":"apple",
"10":"apple_sauce",
"11":"apple",
"12":"apple"
},
{
"id": "vid",
"email:"llgmail"
"surveyname: "s"
"1":"orange",
"2":"", # empty
"3":"juice",
.
.
.
}
]
}
It should map all the ids in the df and skip the numbers if they're empty.
Below is a sample for the df I used above. If the whole df for id = 2 needs to be constructed, please let me know and I can edit that in. However, some entries don't have completed values inside the actual df.
d = {'id': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 2,
13: 2},
'email': {0: 'lol#gmail',
1: 'lol#gmail',
2: 'lol#gmail',
3: 'lol#gmail',
4: 'lol#gmail',
5: 'lol#gmail',
6: 'lol#gmail',
7: 'lol#gmail',
8: 'lol#gmail',
9: 'lol#gmail',
10: 'lol#gmail',
11: 'lol#gmail',
12: 'll#gmail',
13: 'll#gmail'},
'surveyname': {0: 's',
1: 's',
2: 's',
3: 's',
4: 's',
5: 's',
6: 's',
7: 's',
8: 's',
9: 's',
10: 's',
11: 's',
12: 's',
13: 's'},
'question': {0: 1,
1: 3,
2: 2,
3: 4,
4: 5,
5: 6,
6: 8,
7: 7,
8: 9,
9: 12,
10: 11,
11: 10,
12: 1,
13: 3},
'answer': {0: 'apple',
1: 'apple/juice',
2: 'apple-pie',
3: 'apple-pie',
4: 'apple|pie|yes',
5: 'apple',
6: 'apple',
7: 'apple',
8: 'apple',
9: 'apple',
10: 'apple',
11: 'apple_sauce',
12: 'orange',
13: 'juice'}}
df = pd.DataFrame.from_dict(d)

You can pivot the dataframe before exporting to JSON:
(
df.pivot_table(
index=["id", "email", "surveyname"],
columns="question",
values="answer",
aggfunc="first",
)
.reindex(columns=np.arange(1, 13))
.fillna("")
.reset_index()
.to_json("data.json", orient="records")
)

Related

How do I convert the following dataframe into a particular format?

I have timeseries data in the following format:
df.columns= ['Timestamp','Parameter','Value']
Here, the column 'Parameter' has repeating characters for every 5 minutes and once the parameter changes the timestamp starts repeating for another parameter. A small sample of the data looks like this:
{'Timestamp': {0: '2019-01-01 06:00:00+00:00',
1: '2019-01-01 06:05:00+00:00',
2: '2019-01-01 06:10:00+00:00',
3: '2019-01-01 06:15:00+00:00',
4: '2019-01-01 06:20:00+00:00',
5: '2019-01-01 06:25:00+00:00',
6: '2019-01-01 06:30:00+00:00',
7: '2019-01-01 06:00:00+00:00',
8: '2019-01-01 06:05:00+00:00',
9: '2019-01-01 06:10:00+00:00',
10: '2019-01-01 06:15:00+00:00',
11: '2019-01-01 06:20:00+00:00',
12: '2019-01-01 06:25:00+00:00',
13: '2019-01-01 06:30:00+00:00',
14: '2019-01-01 06:00:00+00:00',
15: '2019-01-01 06:05:00+00:00',
16: '2019-01-01 06:10:00+00:00',
17: '2019-01-01 06:15:00+00:00',
18: '2019-01-01 06:20:00+00:00',
19: '2019-01-01 06:25:00+00:00',
20: '2019-01-01 06:30:00+00:00',
21: '2019-01-01 06:00:00+00:00',
22: '2019-01-01 06:05:00+00:00',
23: '2019-01-01 06:10:00+00:00',
24: '2019-01-01 06:15:00+00:00',
25: '2019-01-01 06:20:00+00:00',
26: '2019-01-01 06:25:00+00:00',
27: '2019-01-01 06:30:00+00:00'},
'Parameter': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'B',
8: 'B',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'C',
15: 'C',
16: 'C',
17: 'C',
18: 'C',
19: 'C',
20: 'C',
21: 'D',
22: 'D',
23: 'D',
24: 'D',
25: 'D',
26: 'D',
27: 'D'},
'Value': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 3,
8: 4,
9: 1,
10: 2,
11: 5,
12: 8,
13: 9,
14: 2,
15: 4,
16: 5,
17: 5,
18: 3,
19: 4,
20: 9,
21: 1,
22: 4,
23: 7,
24: 2,
25: 2,
26: 3,
27: 1}}
I want to change the data in the following format to apply my algorithm:
df.columns=['Timestamp','A','B','C','D']
I have manually prepared a dataframe to show how it should look:
{'timestamp': {0: '2019-01-01 06:00:00+00:00',
1: '2019-01-01 06:05:00+00:00',
2: '2019-01-01 06:10:00+00:00',
3: '2019-01-01 06:15:00+00:00',
4: '2019-01-01 06:20:00+00:00',
5: '2019-01-01 06:25:00+00:00',
6: '2019-01-01 06:30:00+00:00'},
'A': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7},
'B': {0: 3, 1: 4, 2: 1, 3: 2, 4: 5, 5: 8, 6: 9},
'C': {0: 2, 1: 4, 2: 5, 3: 5, 4: 3, 5: 4, 6: 9},
'D': {0: 1, 1: 4, 2: 7, 3: 2, 4: 2, 5: 3, 6: 1}}
How do I solve this problem?
Use:
df.pivot('Timestamp', 'Parameter', 'Value')
Output:
Parameter A B C D
Timestamp
2019-01-01 06:00:00+00:00 1 3 2 1
2019-01-01 06:05:00+00:00 2 4 4 4
2019-01-01 06:10:00+00:00 3 1 5 7
2019-01-01 06:15:00+00:00 4 2 5 2
2019-01-01 06:20:00+00:00 5 5 3 2
2019-01-01 06:25:00+00:00 6 8 4 3
2019-01-01 06:30:00+00:00 7 9 9 1

Adding missing rows and setting column value to zero based on current dataframe

dic= {'distinct_id': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5},
'first_name': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb'},
'tasks_completed': {0: 3, 1: 3, 2: 3, 3: 3, 4: 1},
'tasks_available': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3}}
tasks = pd.DataFrame(dic)
I'm trying to make every id/name pair have a row for every unique activity, for example I want "Joe" to have rows where the activity column is "Run" and "Climb", but I want him to have a 0 in the tasks_completed column (those rows not being present already means that he hasn't done these activity tasks). I have tried using df.iterrows() and making a list of the unique ids and activity names and checking to see if they're both present, but it didn't work. Any help is very appreciated!
This is what I am hoping to have:
1: 2,
2: 3,
3: 4,
4: 5,
5: 1,
6: 1,
7: 2,
8: 2,
9: 3,
10: 3,
11: 4,
12: 4,
13: 5,
14: 5},
'email': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony',
5: 'Joe',
6: 'Joe',
7: 'Barry',
8: 'Barry',
9: 'David',
10: 'David',
11: 'Marcus',
12: 'Marcus',
13: 'Anthony',
14: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb',
5: 'Run',
6: 'Climb',
7: 'Run',
8: 'Climb',
9: 'Jump',
10: 'Climb',
11: 'Climb',
12: 'Jump',
13: 'Run',
14: 'Jump'},
'tasks_completed': {0: 3,
1: 3,
2: 3,
3: 3,
4: 1,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0,
13: 0,
14: 0},
'tasks_available': {0: 3,
1: 3,
2: 3,
3: 3,
4: 3,
5: 3,
6: 3,
7: 3,
8: 3,
9: 3,
10: 3,
11: 3,
12: 3,
13: 3,
14: 3}}
pd.DataFrame(tasks_new)
idx_cols = ['distinct_id', 'first_name', 'activity']
tasks.set_index(idx_cols).unstack(fill_value=0).stack().reset_index()
distinct_id first_name activity tasks_completed tasks_available
0 1 Joe Climb 0 0
1 1 Joe Jump 3 3
2 1 Joe Run 0 0
3 2 Barry Climb 0 0
4 2 Barry Jump 3 3
5 2 Barry Run 0 0
6 3 David Climb 0 0
7 3 David Jump 0 0
8 3 David Run 3 3
9 4 Marcus Climb 0 0
10 4 Marcus Jump 0 0
11 4 Marcus Run 3 3
12 5 Anthony Climb 1 3
13 5 Anthony Jump 0 0
14 5 Anthony Run 0 0

pd.groupby() intermediary sums

on Python 3.6, pandas 1.1.2
I am trying to make intermediary sums. I could certainly use the sum(level) but this is not elegant nor optimal and I am wondering if there would be a better way.
For example:
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'}, 'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}}).groupby(by=['level_0', 'level_1', 'level_2']).sum()
Gives me:
value
level_0 level_1 level_2
a aa aaa 5
aab 2
bb bba 3
b aa aaa 5
aab 9
aac 2
cc cca 2
ccb 9
c bb bba 1
bbb 9
cc cca 9
ccb 5
ccc 5
dd dda 5
ddb 5
ddc 3
Now, I would like to be able to get the subtotal for each level_0 and level_1, such as below:
There you are:
import pandas as pd
df = pd.DataFrame.from_dict({'level_0': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b', 6: 'b', 7: 'b', 8: 'c', 9: 'c', 10: 'c', 11: 'c', 12: 'c', 13: 'c', 14: 'c', 15: 'c'}, 'level_1': {0: 'aa', 1: 'aa', 2: 'bb', 3: 'aa', 4: 'aa', 5: 'aa', 6: 'cc', 7: 'cc', 8: 'bb', 9: 'bb', 10: 'cc', 11: 'cc', 12: 'cc', 13: 'dd', 14: 'dd', 15: 'dd'}, 'level_2': {0: 'aaa', 1: 'aab', 2: 'bba', 3: 'aaa', 4: 'aab', 5: 'aac', 6: 'cca', 7: 'ccb', 8: 'bba', 9: 'bbb', 10: 'cca', 11: 'ccb', 12: 'ccc', 13: 'dda', 14: 'ddb', 15: 'ddc'},
'value': {0: 5, 1: 2, 2: 3, 3: 5, 4: 9, 5: 2, 6: 2, 7: 9, 8: 1, 9: 9, 10: 9, 11: 5, 12: 5, 13: 5, 14: 5, 15: 3}})
gb1 = df.groupby(by=['level_0', 'level_1', 'level_2']).sum().reset_index()
gb2 = df.groupby(by=['level_0', 'level_1']).sum().reset_index()
gb3 = df.groupby(by=['level_0']).sum().reset_index()
gb2['level_2'] = ''
gb3['level_1'] = ''
gb3['level_2'] = ''
gb_all = pd.concat((gb1, gb2, gb3), axis=0)
gb_all.sort_values(['level_0', 'level_1', 'level_2'], inplace=True)
gb_all.reset_index(inplace=True, drop=True)
print(gb_all)
Output:
level_0 level_1 level_2 value
0 a 10
1 a aa 7
2 a aa aaa 5
3 a aa aab 2
4 a bb 3
5 a bb bba 3
6 b 27
7 b aa 16
8 b aa aaa 5
9 b aa aab 9
10 b aa aac 2
11 b cc 11
12 b cc cca 2
13 b cc ccb 9
14 c 42
15 c bb 10
16 c bb bba 1
17 c bb bbb 9
18 c cc 19
19 c cc cca 9
20 c cc ccb 5
21 c cc ccc 5
22 c dd 13
23 c dd dda 5
24 c dd ddb 5
25 c dd ddc 3

How to select rows with certain value between 2 columns from another DataFrame in pandas?

For example, I have 2 Frames, the first one is the one I want to select rows from, the second one contains the creteria for selection.
df1 = pd.DataFrame({'chr': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7},
0: {0: 55241686,
1: 55242415,
2: 55248986,
3: 55259412,
4: 55260459,
5: 55266410,
6: 55268009},
1: {0: 55241736,
1: 55242513,
2: 55249171,
3: 55259567,
4: 55260534,
5: 55266556,
6: 55268064}})
df1
df2 = pd.DataFrame({'chr': {0: 7,
1: 7,
2: 7,
3: 7,
4: 7,
5: 7,
6: 7,
7: 7,
8: 7,
9: 7,
10: 7,
11: 7,
12: 7,
13: 7,
14: 7,
15: 7,
16: 7,
17: 7,
18: 7,
19: 7},
's': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'e': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'ref': {0: 'T',
1: 'T',
2: 'A',
3: 'G',
4: 'C',
5: 'G',
6: 'G',
7: 'A',
8: 'G',
9: 'G',
10: 'C',
11: 'G',
12: 'C',
13: 'G',
14: 'G',
15: 'G',
16: 'G',
17: 'G',
18: 'C',
19: 'C'},
'alt': {0: 'C',
1: 'G',
2: 'C',
3: 'A',
4: 'T',
5: 'A',
6: 'A',
7: 'G',
8: 'A',
9: 'A',
10: 'T',
11: 'A',
12: 'G',
13: 'A',
14: 'C',
15: 'A',
16: 'C',
17: 'A',
18: 'G',
19: 'T'}})
df2 here only shows a small part.
df2
what I want to achieve is
for each row in df1, if this row(row_df1) match with certain row in df2 (row_df2) (match means, row_df1['chr']==row_df2['chr'] & row_df1[0] >= row_df2['s'] & row_df11 <= row_df2['e']
in brief,
if the value is fall into certain intervals constructed by df2['s'] and df2['e'], return it.
I believe best case scenario for you is to merge both dataframes first using a common column. In your case "chr". For example as I understand you want all 'chr' from df1 which exist df2, so in that case you just do:
merged_df = df1.merge(df2, on='chr', how='left')
In merge you can use "indicator=True" which will create a new column called "_merge" for you which will indicate the source of each row.
Now when you have your data merged on you can make simple condition statements to get all the needed columns like:
merged_df.loc[(merged_df[0] >= merged_df['s']) & (merged_df[1] >= merged_df ['e'])]
Or you could add a new column as a result, using apply and etc.

Apply function to a MultiIndex dataframe with pandas/python

I have the following DataFrame that I wish to apply some date range calculations to. I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (i.e. the first sample).
Here is an example dataset. The actual dataset can exceed 200,000 records.
labno name sex dob id location sample_date
1 John A M 12/07/1969 12345 A 12/05/2112
2 John B M 10/01/1964 54321 B 6/12/2010
3 James M 30/08/1958 87878 A 30/04/2012
4 James M 30/08/1958 45454 B 29/04/2012
5 Peter M 12/05/1935 33322 C 15/07/2011
6 John A M 12/07/1969 12345 A 14/05/2012
7 Peter M 12/05/1935 33322 A 23/03/2011
8 Jack M 5/12/1921 65655 B 15/08/2011
9 Jill F 6/08/1986 65459 A 16/02/2012
10 Julie F 4/03/1992 41211 C 15/09/2011
11 Angela F 1/10/1977 12345 A 23/10/2006
12 Mark A M 1/06/1955 56465 C 4/04/2011
13 Mark A M 1/06/1955 45456 C 3/04/2011
14 Mark B M 9/12/1984 55544 A 13/09/2012
15 Mark B M 9/12/1984 55544 A 1/01/2012
Unique persons are those with the same name and dob. For example John A, James, Mark A, and Mark B are unique persons. Mark A however has different id values.
I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. It takes forever.
I would welcome a few pointers as to how I might attempt this with python/pandas. I started by making a MultiIndex with name/dob/id. The structure looks like what I want. What I need to do is try applying some of the functions I use in R to select out the rows I need. I have tried selecting with df.xs() but I am not getting very far.
Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order).
{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3:
'30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935',
7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977',
11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14:
'9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211,
10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544},
'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7:
8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15},
'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A',
6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C',
13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2:
'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7:
'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0:
'12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012',
4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7:
'15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10:
'23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13:
'13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M',
3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}
I think what you might be looking for is
def differ(df):
delta = df.sample_date.diff().abs() # only care about magnitude
cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
return df[cond].max()
delta = df.groupby(['dob', 'name']).apply(differ)
Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them.
Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64/datetime64 for numpy < 1.7.

Categories