Apply function to a MultiIndex dataframe with pandas/python - python

I have the following DataFrame that I wish to apply some date range calculations to. I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (i.e. the first sample).
Here is an example dataset. The actual dataset can exceed 200,000 records.
labno name sex dob id location sample_date
1 John A M 12/07/1969 12345 A 12/05/2112
2 John B M 10/01/1964 54321 B 6/12/2010
3 James M 30/08/1958 87878 A 30/04/2012
4 James M 30/08/1958 45454 B 29/04/2012
5 Peter M 12/05/1935 33322 C 15/07/2011
6 John A M 12/07/1969 12345 A 14/05/2012
7 Peter M 12/05/1935 33322 A 23/03/2011
8 Jack M 5/12/1921 65655 B 15/08/2011
9 Jill F 6/08/1986 65459 A 16/02/2012
10 Julie F 4/03/1992 41211 C 15/09/2011
11 Angela F 1/10/1977 12345 A 23/10/2006
12 Mark A M 1/06/1955 56465 C 4/04/2011
13 Mark A M 1/06/1955 45456 C 3/04/2011
14 Mark B M 9/12/1984 55544 A 13/09/2012
15 Mark B M 9/12/1984 55544 A 1/01/2012
Unique persons are those with the same name and dob. For example John A, James, Mark A, and Mark B are unique persons. Mark A however has different id values.
I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. It takes forever.
I would welcome a few pointers as to how I might attempt this with python/pandas. I started by making a MultiIndex with name/dob/id. The structure looks like what I want. What I need to do is try applying some of the functions I use in R to select out the rows I need. I have tried selecting with df.xs() but I am not getting very far.
Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order).
{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3:
'30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935',
7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977',
11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14:
'9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211,
10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544},
'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7:
8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15},
'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A',
6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C',
13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2:
'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7:
'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0:
'12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012',
4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7:
'15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10:
'23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13:
'13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M',
3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}

I think what you might be looking for is
def differ(df):
delta = df.sample_date.diff().abs() # only care about magnitude
cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
return df[cond].max()
delta = df.groupby(['dob', 'name']).apply(differ)
Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them.
Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64/datetime64 for numpy < 1.7.

Related

Dataframe Groupby ID to JSON

I'm trying to convert the following dataframe into a JSON file:
id email surveyname question answer
1 lol#gmail s 1 apple
1 lol#gmail s 3 apple/juice
1 lol#gmail s 2 apple-pie
1 lol#gmail s 4 apple-pie
1 lol#gmail s 5 apple|pie|yes
1 lol#gmail s 6 apple
1 lol#gmail s 8 apple
1 lol#gmail s 7 apple
1 lol#gmail s 9 apple
1 lol#gmail s 12 apple
1 lol#gmail s 11 apple
1 lol#gmail s 10 apple_sauce
2 ll#gmail s 1 orange
2 ll#gmail s 3 juice
.
.
To:
{
"df":[
{
"id":"1",
"email:"lol#gmail"
"surveyname":"s",
"1":"apple",
"2":"apple-pie",
"3":"apple/juice",
"4":"apple-pie",
"5":"apple|pie|yes",
"6":"apple",
"7":"apple",
"8":"apple",
"9":"apple",
"10":"apple_sauce",
"11":"apple",
"12":"apple"
},
{
"id": "vid",
"email:"llgmail"
"surveyname: "s"
"1":"orange",
"2":"", # empty
"3":"juice",
.
.
.
}
]
}
It should map all the ids in the df and skip the numbers if they're empty.
Below is a sample for the df I used above. If the whole df for id = 2 needs to be constructed, please let me know and I can edit that in. However, some entries don't have completed values inside the actual df.
d = {'id': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 2,
13: 2},
'email': {0: 'lol#gmail',
1: 'lol#gmail',
2: 'lol#gmail',
3: 'lol#gmail',
4: 'lol#gmail',
5: 'lol#gmail',
6: 'lol#gmail',
7: 'lol#gmail',
8: 'lol#gmail',
9: 'lol#gmail',
10: 'lol#gmail',
11: 'lol#gmail',
12: 'll#gmail',
13: 'll#gmail'},
'surveyname': {0: 's',
1: 's',
2: 's',
3: 's',
4: 's',
5: 's',
6: 's',
7: 's',
8: 's',
9: 's',
10: 's',
11: 's',
12: 's',
13: 's'},
'question': {0: 1,
1: 3,
2: 2,
3: 4,
4: 5,
5: 6,
6: 8,
7: 7,
8: 9,
9: 12,
10: 11,
11: 10,
12: 1,
13: 3},
'answer': {0: 'apple',
1: 'apple/juice',
2: 'apple-pie',
3: 'apple-pie',
4: 'apple|pie|yes',
5: 'apple',
6: 'apple',
7: 'apple',
8: 'apple',
9: 'apple',
10: 'apple',
11: 'apple_sauce',
12: 'orange',
13: 'juice'}}
df = pd.DataFrame.from_dict(d)
You can pivot the dataframe before exporting to JSON:
(
df.pivot_table(
index=["id", "email", "surveyname"],
columns="question",
values="answer",
aggfunc="first",
)
.reindex(columns=np.arange(1, 13))
.fillna("")
.reset_index()
.to_json("data.json", orient="records")
)

how to use drop_duplicates() with a condition?

I have a dataframe consisting of two columns. Column A consists of strings, column B consists of numbers. Column A has duplicates that I want to remove. However, I only want to retain those duplicates that have the highest number in column B. This is an example of how my dataframe looks like:
columnA | columnB
---------------------
a | 1
a | 2
b | 2
b | 1
What I want is this:
columnA | columnB
---------------------
a | 2
b | 2
using drop_duplicates()
You can sort your dataframe in descending order based on 'columnB', and use drop_duplicates() on your columnA keeping the first occurence:
df.sort_values(by='columnB',ascending=False).drop_duplicates('columnA',keep='first')
columnA columnB
13 d 555
27 h 6
16 f 6
6 c 3
1 a 2
2 b 2
15 e 1
Sample data (slightly enhanced than your sample):
df.to_dict()
{'columnA': {0: 'a',
1: 'a',
2: 'b',
3: 'b',
4: 'c',
5: 'c',
6: 'c',
7: 'd',
8: 'd',
9: 'd',
10: 'd',
11: 'd',
12: 'd',
13: 'd',
14: 'e',
15: 'e',
16: 'f',
17: 'f',
18: 'f',
19: 'f',
20: 'f',
21: 'f',
22: 'h',
23: 'h',
24: 'h',
25: 'h',
26: 'h',
27: 'h'},
'columnB': {0: 1,
1: 2,
2: 2,
3: 1,
4: 1,
5: 2,
6: 3,
7: 33,
8: 223,
9: 3,
10: 2,
11: 1,
12: 3,
13: 555,
14: 1,
15: 1,
16: 6,
17: 5,
18: 4,
19: 3,
20: 2,
21: 1,
22: 1,
23: 2,
24: 3,
25: 4,
26: 5,
27: 6}}
Just group by 'A' and take the max 'B'
df.groupby('A').max()
Grouping your dataframe by column a, taking only the max of column b and creating a new dataframe by this method can also help as it retains the original dataframe as it is:
df.groupby('columnA')['columB'].max()

pandas group by and create a new column based on group by result [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 2 years ago.
I have a DataFrame which i need to group by it and add a column based on group by result.
i can able to do groupby but i need a new column named "CLASS" where if the result of groupby "FIRST" column has "3" means it should have PASS else FAIL.
Attached pic for more clarity.
df = pd.DataFrame({'Name': {0: 'Ram',
1: 'Ram',
2: 'Ram',
3: 'Vignesh',
4: 'Vignesh',
5: 'Vignesh',
6: 'Paul',
7: 'Paul',
8: 'Paul',
9: 'Stephen',
10: 'Stephen',
11: 'Stephen',
12: 'Jones',
13: 'Jones',
14: 'Jones'},
'Section': {0: 'A',
1: 'A',
2: 'A',
3: 'B',
4: 'B',
5: 'B',
6: 'C',
7: 'C',
8: 'C',
9: 'D',
10: 'D',
11: 'D',
12: 'E',
13: 'E',
14: 'E'},
'School': {0: 'Don Bosco',
1: 'Don Bosco',
2: 'Don Bosco',
3: 'Don Bosco',
4: 'Don Bosco',
5: 'Don Bosco',
6: 'Don Bosco',
7: 'Don Bosco',
8: 'Don Bosco',
9: 'Don Bosco',
10: 'Don Bosco',
11: 'Don Bosco',
12: 'Don Bosco',
13: 'Don Bosco',
14: 'Don Bosco'},
'Rank': {0: 'First',
1: 'Second',
2: 'First',
3: 'Second',
4: 'Second',
5: 'First',
6: 'First',
7: 'First',
8: 'First',
9: 'Second',
10: 'Second',
11: 'Second',
12: 'First',
13: 'First',
14: 'First'}})
newdf = df.groupby(['Name', 'Section','School','Rank']).size().unstack(fill_value=0)
Actual DataFrame
Actual output: What i tried.
Expected output with class column based on above condition.
You can use numpy.where:
import numpy as np
newdf['Class'] = np.where(newdf.First.eq(3), 'PASS', 'FAIL')
An easy option would be this:
import numpy as np
newdf['Class'] = np.where(newdf['First'] >= 3, 'PASS', 'FAIL')

How to smooth a pandas / matplotlib lineplot?

I have the following df, weekly spend in a number of shops:
shop1 shop2 shop3 shop4 shop5 shop6 shop7 \
date_week
2 4328.85 5058.17 3028.68 2513.28 4204.10 1898.26 2209.75
3 5472.00 5085.59 3874.51 1951.60 2984.71 1416.40 1199.42
4 4665.53 4264.05 2781.70 2958.25 4593.46 2365.88 2079.73
5 5769.36 3460.79 3072.47 1866.19 3803.12 2166.84 1716.71
6 6267.00 4033.58 4053.70 2215.04 3991.31 2382.02 1974.92
7 5436.83 4402.83 3225.98 1761.87 4202.22 2430.71 3091.33
8 4850.43 4900.68 3176.00 3280.95 3483.53 4115.09 2594.01
9 6782.88 3800.03 3865.65 2221.43 4116.28 2638.28 2321.55
10 6248.18 4096.60 5186.52 3224.96 3614.24 2541.00 2708.36
11 4505.18 2889.33 2937.74 2418.34 5565.57 1570.55 1371.54
12 3115.26 1216.82 1759.49 2559.81 1403.61 1550.77 478.34
13 4561.82 827.16 4661.51 3197.90 1515.63 1688.57 247.25
shop8 shop9
date_week
2 3578.81 3134.39
3 4625.10 2676.20
4 3417.16 3870.00
5 3980.78 3439.60
6 3899.42 4192.41
7 4190.60 3989.00
8 4786.40 3484.51
9 6433.02 3474.66
10 4414.19 3809.20
11 3590.10 3414.50
12 4297.57 2094.00
13 3963.27 871.25
If I plot these in a line plot or "spaghetti plot" It works fine.
The goal is the look at trend in weekly sales over the last three months in 9 stores.
But looks a bit messy:
newgraph.plot()
I had a look at similar questions such as this one which uses df.interpolate() but it looks like I need to have missing values in there first. this answer seems to require a time series.
Is there another method to smoothen out the lines?
It doesn't matter if the values are not exactly accurate anymore, some interpolation is fine. All I am interested in is the trend over the last number of weeks. I have also tried logy=True in the plot() method to calm the lines a bit, but it didn't help.
My df, for pd.DataFrame.fromt_dict():
{'shop1': {2: 4328.849999999999,
3: 5472.0,
4: 4665.530000000001,
5: 5769.36,
6: 6267.0,
7: 5436.83,
8: 4850.43,
9: 6782.879999999999,
10: 6248.18,
11: 4505.18,
12: 3115.26,
13: 4561.82},
'shop2': {2: 5058.169999999993,
3: 5085.589999999996,
4: 4264.049999999997,
5: 3460.7899999999977,
6: 4033.579999999998,
7: 4402.829999999999,
8: 4900.679999999997,
9: 3800.0299999999997,
10: 4096.5999999999985,
11: 2889.3300000000004,
12: 1216.8200000000002,
13: 827.16},
'shop3': {2: 3028.679999999997,
3: 3874.5099999999984,
4: 2781.6999999999994,
5: 3072.4699999999984,
6: 4053.6999999999966,
7: 3225.9799999999987,
8: 3175.9999999999973,
9: 3865.6499999999974,
10: 5186.519999999996,
11: 2937.74,
12: 1759.49,
13: 4661.509999999998},
'shop4': {2: 2513.2799999999997,
3: 1951.6000000000001,
4: 2958.25,
5: 1866.1900000000003,
6: 2215.04,
7: 1761.8700000000001,
8: 3280.9499999999994,
9: 2221.43,
10: 3224.9600000000005,
11: 2418.3399999999997,
12: 2559.8099999999995,
13: 3197.9},
'shop5': {2: 4204.0999999999985,
3: 2984.71,
4: 4593.459999999999,
5: 3803.12,
6: 3991.31,
7: 4202.219999999999,
8: 3483.529999999999,
9: 4116.279999999999,
10: 3614.24,
11: 5565.569999999997,
12: 1403.6100000000001,
13: 1515.63},
'shop6': {2: 1898.260000000001,
3: 1416.4000000000005,
4: 2365.8799999999997,
5: 2166.84,
6: 2382.019999999999,
7: 2430.71,
8: 4115.0899999999965,
9: 2638.2800000000007,
10: 2541.0,
11: 1570.5500000000004,
12: 1550.7700000000002,
13: 1688.5700000000004},
'shop7': {2: 2209.75,
3: 1199.42,
4: 2079.7300000000005,
5: 1716.7100000000005,
6: 1974.9200000000005,
7: 3091.329999999999,
8: 2594.0099999999993,
9: 2321.5499999999997,
10: 2708.3599999999983,
11: 1371.5400000000004,
12: 478.34,
13: 247.25000000000003},
'shop8': {2: 3578.8100000000004,
3: 4625.1,
4: 3417.1599999999994,
5: 3980.7799999999997,
6: 3899.4200000000005,
7: 4190.600000000001,
8: 4786.4,
9: 6433.019999999998,
10: 4414.1900000000005,
11: 3590.1,
12: 4297.57,
13: 3963.27},
'shop9': {2: 3134.3900000000003,
3: 2676.2,
4: 3870.0,
5: 3439.6,
6: 4192.41,
7: 3989.0,
8: 3484.51,
9: 3474.66,
10: 3809.2,
11: 3414.5,
12: 2094.0,
13: 871.25}}
You could show the trend by plotting a regression line for the last few weeks, perhaps separately from the actual data, as the plot is already so crowded. I would use seaborn, because it has the convenient regplot() function:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
df.plot(figsize=[12, 10], style='--')
plt.xlim(2, 18)
last4 = df[len(df)-4:]
plt.gca().set_prop_cycle(None)
for shop in df.columns:
sns.regplot(last4.index + 4, shop, data=last4, ci=None, scatter=False)
plt.ylabel(None)
plt.xticks(list(df.index)+[14, 17], labels=list(df.index)+[10, 13]);

How to select rows with certain value between 2 columns from another DataFrame in pandas?

For example, I have 2 Frames, the first one is the one I want to select rows from, the second one contains the creteria for selection.
df1 = pd.DataFrame({'chr': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7},
0: {0: 55241686,
1: 55242415,
2: 55248986,
3: 55259412,
4: 55260459,
5: 55266410,
6: 55268009},
1: {0: 55241736,
1: 55242513,
2: 55249171,
3: 55259567,
4: 55260534,
5: 55266556,
6: 55268064}})
df1
df2 = pd.DataFrame({'chr': {0: 7,
1: 7,
2: 7,
3: 7,
4: 7,
5: 7,
6: 7,
7: 7,
8: 7,
9: 7,
10: 7,
11: 7,
12: 7,
13: 7,
14: 7,
15: 7,
16: 7,
17: 7,
18: 7,
19: 7},
's': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'e': {0: 55241646,
1: 55241658,
2: 55241690,
3: 55241718,
4: 55241721,
5: 55241722,
6: 55241727,
7: 55241732,
8: 55242454,
9: 55242457,
10: 55242488,
11: 55242511,
12: 55248991,
13: 55248995,
14: 55248995,
15: 55249000,
16: 55249022,
17: 55249036,
18: 55249053,
19: 55249057},
'ref': {0: 'T',
1: 'T',
2: 'A',
3: 'G',
4: 'C',
5: 'G',
6: 'G',
7: 'A',
8: 'G',
9: 'G',
10: 'C',
11: 'G',
12: 'C',
13: 'G',
14: 'G',
15: 'G',
16: 'G',
17: 'G',
18: 'C',
19: 'C'},
'alt': {0: 'C',
1: 'G',
2: 'C',
3: 'A',
4: 'T',
5: 'A',
6: 'A',
7: 'G',
8: 'A',
9: 'A',
10: 'T',
11: 'A',
12: 'G',
13: 'A',
14: 'C',
15: 'A',
16: 'C',
17: 'A',
18: 'G',
19: 'T'}})
df2 here only shows a small part.
df2
what I want to achieve is
for each row in df1, if this row(row_df1) match with certain row in df2 (row_df2) (match means, row_df1['chr']==row_df2['chr'] & row_df1[0] >= row_df2['s'] & row_df11 <= row_df2['e']
in brief,
if the value is fall into certain intervals constructed by df2['s'] and df2['e'], return it.
I believe best case scenario for you is to merge both dataframes first using a common column. In your case "chr". For example as I understand you want all 'chr' from df1 which exist df2, so in that case you just do:
merged_df = df1.merge(df2, on='chr', how='left')
In merge you can use "indicator=True" which will create a new column called "_merge" for you which will indicate the source of each row.
Now when you have your data merged on you can make simple condition statements to get all the needed columns like:
merged_df.loc[(merged_df[0] >= merged_df['s']) & (merged_df[1] >= merged_df ['e'])]
Or you could add a new column as a result, using apply and etc.

Categories