I'm following a Datacamp course on "efficient data manipulation" on pandas. On their videos, by way of example, they are demonstrating the native method of looping over the dataframe to calculate the zscore.
I have found this specific course strange with what seem to be errors in the code and I'm wondering if it was done for an older version of Python, but it is more likely just me not getting it.
The Dataframe is basically something like this:
df = pd.DataFrame({'total_bill': {0: 16.99, 1: 10.34, 2: 21.01, 3: 23.68, 4: 24.59, 5: 25.29, 6: 8.77, 7: 26.88, 8: 15.04, 9: 14.78}, 'tip': {0: 1.01, 1: 1.66, 2: 3.5, 3: 3.31, 4: 3.61, 5: 4.71, 6: 2.0, 7: 3.12, 8: 1.96, 9: 3.23}, 'sex': {0: 'Female', 1: 'Male', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male', 6: 'Male', 7: 'Male', 8: 'Male', 9: 'Male'}, 'smoker': {0: 'No', 1: 'No', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'No', 7: 'No', 8: 'No', 9: 'No'}, 'day': {0: 'Sun', 1: 'Sun', 2: 'Sun', 3: 'Sun', 4: 'Sun', 5: 'Sun', 6: 'Sun', 7: 'Sun', 8: 'Sun', 9: 'Sun'}, 'time': {0: 'Dinner', 1: 'Dinner', 2: 'Dinner', 3: 'Dinner', 4: 'Dinner', 5: 'Dinner', 6: 'Dinner', 7: 'Dinner', 8: 'Dinner', 9: 'Dinner'}, 'size': {0: 2, 1: 3, 2: 3, 3: 2, 4: 4, 5: 4, 6: 2, 7: 4, 8: 2, 9: 2}})
So the code on the slides is as follows:
mean_female = df.groupby("sex").mean()["total_bill"]["Female"]
mean_male = df.groupby("sex").mean()["total_bill"]["Male"]
std_female = df.groupby("sex").std()["total_bill"]["Female"]
std_male = df.groupby("sex").std()["total_bill"]["Male"]
Followed by this...
for i in range(len(df)):
if df.iloc[i,2] == "Female":
df.iloc[i][0] = (df.iloc[i,0] - mean_female) / std_female
else:
df.iloc[i][0] = (df.iloc[i,0] - mean_male) / std_male
When I run the code (which is from datacamp not mine) I get the usual copy of a slice warning, but (more importantly) NOTHING happens to the data frame.
I assume the objective is to have something like this:
zscore = lambda x: (x - x.mean()) / x.std()
dfsex = restaurant.groupby('sex')
dfzscore = grouptime["total_bill"].transform(zscore)
dfzscore
I'm a little confused so any help figuring this out is much appreciated.
Cheers!
.iloc[i,0] should be used instead of .iloc[i][0]. The dataframe will be updated correctly after fixing this bug. Evidence:
df
Out[58]:
total_bill tip sex smoker day time size
0 -0.707107 1.01 Female No Sun Dinner 2
1 -1.138059 1.66 Male No Sun Dinner 3
2 0.402209 3.50 Male No Sun Dinner 3
3 0.787637 3.31 Male No Sun Dinner 2
4 0.707107 3.61 Female No Sun Dinner 4
5 1.020048 4.71 Male No Sun Dinner 4
6 -1.364696 2.00 Male No Sun Dinner 2
7 1.249573 3.12 Male No Sun Dinner 4
8 -0.459590 1.96 Male No Sun Dinner 2
9 -0.497122 3.23 Male No Sun Dinner 2
Explanation: Let's take a close look at df.iloc[i][0]. The first step df.iloc[i] returns a Series in-place indeed. The second step [0], however, just returns a copy of value which is not in-place. Therefore df won't be updated.
In short, every indice must be put inside .iloc[] (or arguably better .iat[] in this case) for the value assignment to happen in-place.
use:
df.assign(column0_name= lambda x: np.where(x['column2_name']=='Female',
(x['column0_name'] - mean_female) / std_female),
(x['column0_name'] - mean_male) / std_male)))
Instead of:
for i in range(len(df)):
if df.iloc[i,2] == "Female":
df.iloc[i][0] = (df.iloc[i,0] - mean_female) / std_female
else:
df.iloc[i][0] = (df.iloc[i,0] - mean_male) / std_male
It works on series and works faster than for loop.
Related
I am trying to convert the date to a correct date format. I have tested some of the possibilities that I have read in the forum but, I still don't know how to tackle this issue:
After importing:
df = pd.read_excel(r'/path/df_datetime.xlsb', sheet_name="12FEB22", engine='pyxlsb')
I get the following df:
{'Unnamed: 0': {0: 'Administrative ID',
1: '000002191',
2: '000002382',
3: '000002434',
4: '000002728',
5: '000002826',
6: '000003265',
7: '000004106',
8: '000004333'},
'Unnamed: 1': {0: 'Service',
1: 'generic',
2: 'generic',
3: 'generic',
4: 'generic',
5: 'generic',
6: 'generic',
7: 'generic',
8: 'generic'},
'Unnamed: 2': {0: 'Movement type',
1: 'New',
2: 'New',
3: 'New',
4: 'Modify',
5: 'New',
6: 'New',
7: 'New',
8: 'New'},
'Unnamed: 3': {0: 'Date',
1: 37503,
2: 37475,
3: 37453,
4: 44186,
5: 37711,
6: 37658,
7: 37770,
8: 37820},
'Unnamed: 4': {0: 'Contract Term',
1: '12',
2: '12',
3: '12',
4: '12',
5: '12',
6: '12',
7: '12',
8: '12'}}
However, even although I have tried to convert the 'Date' Column (or 'Unnamed 3', because the original dataset hasn't first row so I have to change the header after that) during the importation, it has been unsuccessful.
Is there any option that I can do?
Thanks!
try this:
from xlrd import xldate_as_datetime
def trans_date(x):
if isinstance(x, int):
return xldate_as_datetime(x, 0).date()
else:
return x
print(df['Unnamed: 3'].apply(trans_date))
>>>
0 Date
1 2002-09-04
2 2002-08-07
3 2002-07-16
4 2020-12-21
5 2003-03-31
6 2003-02-06
7 2003-05-29
8 2003-07-18
Name: Unnamed: 3, dtype: object
I have a dataframe like this below,
A B C D
0 A1 Egypt 10 Yes
1 A1 Morocco 5 No
2 A2 Algeria 4 Yes
3 A3 Egypt 45 No
4 A3 Egypt 17 Yes
5 A3 Tunisia 4 Yes
6 A3 Algeria 32 No
7 A4 Tunisia 7 No
8 A5 Egypt 6 No
9 A5 Morocco 1 No
I want to get the count of yes and no from the column D wrt column B. The expected output needs to be in the lists like this below which can help to plot the multivariable chart.
Exected output:
yes = [1,2,0,1]
no = [1,2,2,1]
country = ['Algeria', 'Egypt', 'Morocco','Tunisia']
I am not sure how to achieve this from the above dataframe. Any help will be appreciated.
Here is the minimum reproducible dataframe sample:
import pandas as pd
df = pd.DataFrame({'A': {0: 'A1',
1: 'A1',
2: 'A2',
3: 'A3',
4: 'A3',
5: 'A3',
6: 'A3',
7: 'A4',
8: 'A5',
9: 'A5'},
'B': {0: 'Egypt',
1: 'Morocco',
2: 'Algeria',
3: 'Egypt',
4: 'Egypt',
5: 'Tunisia',
6: 'Algeria',
7: 'Tunisia',
8: 'Egypt',
9: 'Morocco'},
'C ': {0: 10, 1: 5, 2: 4, 3: 45, 4: 17, 5: 4, 6: 32, 7: 7, 8: 6, 9: 1},
'D': {0: 'Yes',
1: 'No',
2: 'Yes',
3: 'No',
4: 'Yes',
5: 'Yes',
6: 'No',
7: 'No',
8: 'No',
9: 'No'}}
)
Use crosstab:
df1 = pd.crosstab(df.B, df.D)
print (df1)
D No Yes
B
Algeria 1 1
Egypt 2 2
Morocco 2 0
Tunisia 1 1
Then for plot use DataFrame.plot.bar
df1.plot.bar()
If need lists:
yes = df1['Yes'].tolist()
no = df1['No'].tolist()
country = df1.index.tolist()
Create new columns by counting "yes", "no"; then groupby "B" and use sum on the newly created columns:
country, yes, no = df.assign(Yes=df['D']=='Yes', No=df['D']=='No').groupby('B')[['Yes','No']].sum().reset_index().T.to_numpy().tolist()
Output:
['Algeria', 'Egypt', 'Morocco', 'Tunisia']
[1, 2, 0, 1]
[1, 2, 2, 1]
I have a dataset as this
{'SYMBOL': {0: 'BAF180', 1: 'ACTL6A', 2: 'DMAP1', 3: 'C1orf149', 4: 'YEATS4'}, 'Gene Name(s)': {0: ';PB1;BAF180;MGC156155;MGC156156;PBRM1;', 1: ';ACTL6A;ACTL6;BAF53A;MGC5382;', 2: ';DMAP1;DKFZp686L09142;DNMAP1;DNMTAP1;FLJ11543;KIAA1425;EAF2;SWC4;', 3: ';FLJ11730;CDABP0189;C1orf149;NY-SAR-91;RP3-423B22.2;Eaf6;', 4: ';YEATS4;4930573H17Rik;B230215M10Rik;GAS41;NUBI-1;YAF9;'}, 'Description': {0: 'polybromo 1', 1: 'BAF complex 53 kDa subunit|BAF53|BRG1-associated factor|actin-related protein|hArpN beta; actin-like 6A', 2: 'DNA methyltransferase 1 associated protein 1; DNMT1 associated protein 1', 3: 'hypothetical protein LOC64769|sarcoma antigen NY-SAR-91; chromosome 1 open reading frame 149', 4: 'NuMA binding protein 1|glioma-amplified sequence-41; YEATS domain containing 4'}, 'G.O. PROCESS': {0: 'Transcription', 1: 'Transcription', 2: 'Transcription', 3: 'Transcription', 4: 'Transcription'}, 'TurboSEQUESTScore': {0: 70.29, 1: 80.29, 2: 34.18, 3: 30.32, 4: 40.18}, 'Coverage %': {0: 6.7, 1: 28.0, 2: 10.7, 3: 24.2, 4: 21.1}, 'KD': {0: 183572.3, 1: 47430.4, 2: 52959.9, 3: 21501.9, 4: 26482.7}, 'Genebank Accession no': {0: 30794372, 1: 4757718, 2: 13123776, 3: 29164895, 4: 5729838}, 'MS/MS Peptide no.': {0: '9 (9 0 0 0 0)', 1: '9 (9 0 0 0 0)', 2: '4 (3 0 0 1 0)', 3: '3 (3 0 0 0 0)', 4: '4 (4 0 0 0 0)'}}
I would want to detect and remove outliers on the column TurboSEQUESTScore using 3 times of standard deviation as the threshold for outliers How can I go about it? This is what i have tried.
The name of dataframe is rename_df
z_scores = stats.zscore(rename_df['TurboSEQUESTScore'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=None)
I don't seem to solve this properly.
You were approaching it correctly only but just needed to pass the boolean abs_z_scores < 3 to your dataframe, i.e., rename_df[(abs_z_scores < 3)], to get the desired dataframe and then store it in any variable of your choice.
This will do the job in one line and is more readable-
import numpy as np
from scipy import stats
filtered_rename_df = rename_df[(np.abs(stats.zscore(rename_df["TurboSEQUESTScore"])) < 3)]
You'll get a new dataframe named filtered_rename_df with the filtered entries after removing outliers using z-score < 3.
dic= {'distinct_id': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5},
'first_name': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb'},
'tasks_completed': {0: 3, 1: 3, 2: 3, 3: 3, 4: 1},
'tasks_available': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3}}
tasks = pd.DataFrame(dic)
I'm trying to make every id/name pair have a row for every unique activity, for example I want "Joe" to have rows where the activity column is "Run" and "Climb", but I want him to have a 0 in the tasks_completed column (those rows not being present already means that he hasn't done these activity tasks). I have tried using df.iterrows() and making a list of the unique ids and activity names and checking to see if they're both present, but it didn't work. Any help is very appreciated!
This is what I am hoping to have:
1: 2,
2: 3,
3: 4,
4: 5,
5: 1,
6: 1,
7: 2,
8: 2,
9: 3,
10: 3,
11: 4,
12: 4,
13: 5,
14: 5},
'email': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony',
5: 'Joe',
6: 'Joe',
7: 'Barry',
8: 'Barry',
9: 'David',
10: 'David',
11: 'Marcus',
12: 'Marcus',
13: 'Anthony',
14: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb',
5: 'Run',
6: 'Climb',
7: 'Run',
8: 'Climb',
9: 'Jump',
10: 'Climb',
11: 'Climb',
12: 'Jump',
13: 'Run',
14: 'Jump'},
'tasks_completed': {0: 3,
1: 3,
2: 3,
3: 3,
4: 1,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0,
13: 0,
14: 0},
'tasks_available': {0: 3,
1: 3,
2: 3,
3: 3,
4: 3,
5: 3,
6: 3,
7: 3,
8: 3,
9: 3,
10: 3,
11: 3,
12: 3,
13: 3,
14: 3}}
pd.DataFrame(tasks_new)
idx_cols = ['distinct_id', 'first_name', 'activity']
tasks.set_index(idx_cols).unstack(fill_value=0).stack().reset_index()
distinct_id first_name activity tasks_completed tasks_available
0 1 Joe Climb 0 0
1 1 Joe Jump 3 3
2 1 Joe Run 0 0
3 2 Barry Climb 0 0
4 2 Barry Jump 3 3
5 2 Barry Run 0 0
6 3 David Climb 0 0
7 3 David Jump 0 0
8 3 David Run 3 3
9 4 Marcus Climb 0 0
10 4 Marcus Jump 0 0
11 4 Marcus Run 3 3
12 5 Anthony Climb 1 3
13 5 Anthony Jump 0 0
14 5 Anthony Run 0 0
I have the following DataFrame that I wish to apply some date range calculations to. I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (i.e. the first sample).
Here is an example dataset. The actual dataset can exceed 200,000 records.
labno name sex dob id location sample_date
1 John A M 12/07/1969 12345 A 12/05/2112
2 John B M 10/01/1964 54321 B 6/12/2010
3 James M 30/08/1958 87878 A 30/04/2012
4 James M 30/08/1958 45454 B 29/04/2012
5 Peter M 12/05/1935 33322 C 15/07/2011
6 John A M 12/07/1969 12345 A 14/05/2012
7 Peter M 12/05/1935 33322 A 23/03/2011
8 Jack M 5/12/1921 65655 B 15/08/2011
9 Jill F 6/08/1986 65459 A 16/02/2012
10 Julie F 4/03/1992 41211 C 15/09/2011
11 Angela F 1/10/1977 12345 A 23/10/2006
12 Mark A M 1/06/1955 56465 C 4/04/2011
13 Mark A M 1/06/1955 45456 C 3/04/2011
14 Mark B M 9/12/1984 55544 A 13/09/2012
15 Mark B M 9/12/1984 55544 A 1/01/2012
Unique persons are those with the same name and dob. For example John A, James, Mark A, and Mark B are unique persons. Mark A however has different id values.
I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. It takes forever.
I would welcome a few pointers as to how I might attempt this with python/pandas. I started by making a MultiIndex with name/dob/id. The structure looks like what I want. What I need to do is try applying some of the functions I use in R to select out the rows I need. I have tried selecting with df.xs() but I am not getting very far.
Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order).
{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3:
'30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935',
7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977',
11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14:
'9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211,
10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544},
'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7:
8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15},
'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A',
6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C',
13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2:
'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7:
'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0:
'12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012',
4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7:
'15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10:
'23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13:
'13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M',
3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}
I think what you might be looking for is
def differ(df):
delta = df.sample_date.diff().abs() # only care about magnitude
cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
return df[cond].max()
delta = df.groupby(['dob', 'name']).apply(differ)
Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them.
Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64/datetime64 for numpy < 1.7.