I have two reports, one with training status and then one master roster. The training report has 15 columns. The master roster has 9 columns. I have created a small sample below. My terminology might not be correct since I'm new to Python.
Training Report (I add the Training column with some conditional logic from the Training Code column. Please note, a name can be repeated if they have completed multiple training such as Name2.)
import pandas as pd
df = pd.DataFrame({'Name':['Name1','Name2','Name2','Name3'],
'Office':['A', 'B', 'B', 'A'],
'Position':['Director','Manager','Manager','Analyst'],
'Training Code':['C3','C1-L','C2','C1-B'],
'Training':['ADV','BEG','INT','BEG']
})
Output
Name Office Position Training Code Training
0 Name1 A Director C3 ADV
1 Name2 B Manager C1-L BEG
2 Name2 B Manager C2 INT
3 Name3 A Analyst C1-B BEG
Master Roster (I add the Required column based on the condition of the Status column. This is a unique list of names of everyone on the roster.)
df4 = pd.DataFrame({'Name':['Name1','Name2','Name3','Name4'],
'Office':['A', 'B', 'A', 'C'],
'Position':['Director','Manager','Analyst','Supervisor'],
'Symbol':['OS','BP','OD','EO'],
'Status':[1,3,8,2],
'Required':['Required','Required','Recommended','Required']})
Output
Name Office Position Symbol Status Required
0 Name1 A Director OS 1 Required
1 Name2 B Manager BP 3 Required
2 Name3 A Analyst OD 8 Recommended
3 Name4 C Supervisor EO 2 Required
I need to merge the master roster and training data so it looks like below.
df3 = pd.DataFrame({'Name':['Name1','Name2','Name3','Name4'],
'Office':['A', 'B', 'A', 'C'],
'Position':['Director','Manager','Analyst','Supervisor'],
'Symbol':['OS','BP','OD','EO'],
'Status':[1,3,8,2],
'Required':['Required','Required','Recommended','Required'],
'ADV':[1,0,0,0],
'INT':[0,1,0,0],
'BEG':[0,1,1,0]
})
DESIRED OUTPUT (Unique list of names and information about each name - the master roster, merged with a pivoted version of the training report.)
Name Office Position Symbol Status Required ADV INT BEG
0 Name1 A Director OS 1 Required 1 0 0
1 Name2 B Manager BP 3 Required 0 1 1
2 Name3 A Analyst OD 8 Recommended 0 0 1
3 Name4 C Supervisor EO 2 Required 0 0 0
I need to use the master roster to get all the names and the other fields in that report. Then, I need to merge that report with a pivoted training report with the Training column being broken apart into multiple columns with a count.
My first step was to try to pivot the training report data (not using all the columns) and then merge it with the master roster.
pvt = df.pivot_table(index = ['Name','Office','Position'],
columns = 'Training',
fill_value = 0,
aggfunc='count')
However, I'm not sure if that is the best way, and that the pivot output doesn't seem to be merge friendly (I could be wrong). In SQL I would just LEFT JOIN the training report to the pivoted master roster on the Name column.
Any guidance would be greatly appreciated on the easiest and best way to accomplish merging those 2 reports to get my final desired outcome. Please let me know if I need to clarify anything further!
----- UPDATE 2 -------
I was able to merge and then pivot the data set, but it's not quite how I want it to look. The merge looks good, and I only bring in the columns I need.
result = pd.merge(df4,
df[['Name','Training']],
on='Name',
how='left')
I then replace the 'NaN' values in the Training column with 'NONE'.
result.update(result[['Training']].fillna('NONE'))
Merge Output
Name Office Position Symbol Status Required Training
0 Name1 A Director OS 1 Required ADV
1 Name2 B Manager BP 3 Required BEG
2 Name2 B Manager BP 3 Required INT
3 Name3 A Analyst OD 8 Recommended BEG
4 Name4 C Supervisor EO 2 Required NONE
However, when I try to pivot the result dataframe, I get 'Empty DataFrame' now.
cols = ['Name','Office','Position','Symbol','Status','Required']
pvt2 = result.pivot_table(index=cols,
columns='Training',
fill_value = 0,
aggfunc = 'count')
-------- FINAL UPDATE ---------
I got it to work! Yay!
result = pd.merge(df4,
df[['Name','Training']],
on='Name',
how='left')
result.update(result[['Training']].fillna('NONE'))
cols = ['Name','Office','Position','Symbol','Status','Required']
pvt2 = result.pivot_table(index=cols,
columns=['Training'],
fill_value = 0,
aggfunc = len)
All I had to do was the change the aggfunc =counttoaggfunc = len`. I hope that ends up helping someone else! If anyone has improvements on this, I'm definitely open to those as well.
There might be a better way, but this solution works for me! Again, I'm happy to accept feedback or improvements!
import pandas as pd
#Create DataFrame for training
df = pd.DataFrame({'Name':['Name1','Name2','Name2','Name3','Name1'],
'Office':['A', 'B', 'B', 'A','A'],
'Position':['Director','Manager','Manager','Analyst','Director'],
'Training Code':['C3','C1-L','C2','C1-B','C3'],
'Training':['ADV','BEG','INT','BEG','ADV']
})
#Create DataFrame for master roster
df4 = pd.DataFrame({'Name':['Name1','Name2','Name3','Name4'],
'Office':['A', 'B', 'A', 'C'],
'Position':['Director','Manager','Analyst','Supervisor'],
'Symbol':['OS','BP','OD','EO'],
'Status':[1,3,8,2],
'Required':['Required','Required','Recommended','Required']})
#Left join the training DataFrame to the master roster DataFrame using the 'Name'
#column as the join key.
result = pd.merge(df4,
df[['Name','Training']],
on='Name',
how='left')
#Substitute any 'NaN' values with 'NONE' so the pivot doesn't drop rows with 'NaN'
result.update(result[['Training']].fillna('NONE'))
#Store all the column headers of the master roster into the 'cols' list
cols = list(roster.columns)
#Pivot the combined 'result' DataFrame using all the columns from
#the master roster DataFrame. The 'Training' column is the column
#that will be broken apart. 'aggfunc = len' does a count of the instances
#of each 'Training' element.
pvt2 = result.pivot_table(index=cols,
columns=['Training'],
fill_value = 0,
aggfunc = len)
Related
I have a dataset
I want to get to know our customers by looking at the typical shared characteristics (e.g. "Married customers in their 40s like wine"). This would correspond to the itemset {Married, 40s, Wine}.
How can I create a new dataframe called customer_data_onehot such that rows correspond to customers (as in the original data set) and columns correspond to the categories of each of the ten categorical attributes in the data. The new dataframe should only contain boolean values (True/False or 0/1s) such that the value in row 𝑖 and column 𝑗 is True (or 1) if and only if the attribute value corresponding to the column 𝑗 holds for the customer corresponding to row 𝑖 . Display the dataframe.
I have this hint "Hint: For example, for the attribute "Education" there are 5 possible categories: 'Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'. Therefore, the new dataframe must contain one column for each of those attribute values." but I don't understand how can I achieve this.
Can someone guide me here to achieve the correct solution?
i have this code which Imports the csv file and selects 90% of data from the original dataset.
import pandas as pd
pre_process = pd.read_csv('customer_data.csv')
pre_process = pre_process.sample(frac=0.9, random_state=413808).to_csv('customer_data_2.csv',
index=False)
Use get_dummies:
Setup a MRE
data = {'Customer': ['A', 'B', 'C'],
'Marital_Status': ['Together', 'Married', 'Single'],
'Age_Group': ['40s', '60s', '20s']}
df = pd.DataFrame(data)
print(df)
# Output
Customer Marital_Status Age_Group
0 A Together 40s
1 B Married 60s
2 C Single 20s
out = pd.get_dummies(df.set_index('Customer')).reset_index()
print(out)
# Output
Customer Marital_Status_Married Marital_Status_Single Marital_Status_Together Age_Group_20s Age_Group_40s Age_Group_60s
0 A 0 0 1 0 1 0
1 B 1 0 0 0 0 1
2 C 0 1 0 1 0 0
I am kind of stuck with a silly issue, can someone please help me pointing out my mistake?
So, have like 5 Categorical variables. I have created their dummies in individual data frames.
seasons = pd.get_dummies(bike['season'], drop_first=True) #3
weathers = pd.get_dummies(bike['weather'], drop_first=True) #3
days = pd.get_dummies(bike['weekday'], drop_first=True)# 6
months = pd.get_dummies(bike['month'], drop_first=True) # 11
years = pd.get_dummies(bike['yr'], drop_first=True) #1
#will add 24 new columns.
Now, when I try to contact them into my main df.
bike = pd.concat([bike, seasons], axis=1)
bike = pd.concat([bike, weathers], axis=1)
bike = pd.concat([bike, months], axis=1)
bike = pd.concat([bike, days], axis=1)
bike = pd.concat([bike, years], axis=1)
bike.info()
I am getting a KeyError: 0 error on bike.info().
Now, upon investigating, I found it is coming only if I try to concat the year df, which is originally indicating one of 2 years 2018: 0, 2019: 1 After the dummy is created this is how it looks.
2019
0 0
1 0
2 0
3 0
4 0
Please Suggest.
Thanks
First of all, do you know why are you using drop_first=True? Just ensuring whether this is what you want to have (removing the first level and having only k-1 categorical levels).
If you want to keep all original data that was not processed by get_dummies method, you do not need to use concat function, it's enough to do bike_with_dummies = pd.get_dummies(bike, columns=['season','weather','weekday','month','yr'], drop_first=True). See example 1. If you want to keep all of them, I would recommend using the code in example 2.
Example 1
You have for example this simple DataFrame (taken from pandas doc)
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})
When you run
pd.get_dummies(df, columns=['C'], drop_first=True)
it will keep the original columns ("A" and "B") and will convert selected columns ("C" here) to dummies. Output will look like
A B C_2 C_3
0 a b 0 0
1 b a 1 0
2 a c 0 1
Example 2
If you want to keep the original columns as well ("C" from the example above) I would recommend you to do the following
cols_to_dummies = ["C"] # columns that should be turned into dummies
df_with_dummies = pd.get_dummies(df, columns=cols_to_dummies, drop_first=True)
df_with_dummies_and_original = pd.concat([df[cols_to_dummies], df_with_dummies], axis=1)
The output will look like (note that "C" is included now)
C A B C_2 C_3
0 1 a b 0 0
1 2 b a 1 0
2 3 a c 0 1
So in your case you could run this
cols_to_dummies = ['season','weather','weekday','month','yr']
bike_with_dummies = pd.get_dummies(bike, columns=cols_to_dummies, drop_first=True)
bike_with_dummies_and_original = pd.concat([bike[cols_to_dummies], bike_with_dummies], axis=1)
This approach has the advantage that you can easily change cols_to_dummies to update the list of columns that should be turned into dummies and you do not need to add any row.
Final comments - if you prefer better naming, you can use prefix and prefix_sep parameters or do the renaming by yourself at the end.
If this does not help you, please provide example DataFrame (content of bike dataframe).
I have a dataframe and dict like as shown below
df = pd.DataFrame({
'subject_id':[1,2,3,4,5],
'age':[42,56,75,48,39],
'date_visit':['1/1/2020','3/3/2200','13/11/2100','24/05/2198','30/03/2071'],
'a11fever':['Yes','No','Yes','Yes','No'],
'a12diagage':[36,np.nan,np.nan,40,np.nan],
'a12diagyr':[np.nan,np.nan,2091,np.nan,np.nan],
'a12diagyrago':[6,np.nan,9,np.nan,np.nan],
'a20cough':['Yes','No','No','Yes','No'],
'a21cough':[np.nan,'Yes',np.nan,np.nan,np.nan],
'a22agetold':[37,np.nan,np.nan,46,np.nan],
'a22yrsago':[np.nan,6,np.nan,2,np.nan],
'a22yrtold':[np.nan,2194,np.nan,np.nan,np.nan]
})
df['date_visit'] = pd.to_datetime(df['date_visit'])
disease_dict = {'a11fever' : 'fever', 'a20cough' : 'cough','a21cough':'cough'}
This dataframe contains info about patients' medical conditions and date of diagnosis
But as you can see that date of diagnoses is not directly available and we have to derive it based on columns that contain keywords like age,yr,ago,diag which appear within the next 5-6 columns from the condition column (ex : a11fever). Look for the next 5 columns after this condition column and you will be able to get the info required for deriving the date.Similary for other conditions like cough
I expect my output to be like as shown below
I was trying something like below but it didn't help
df = df[(df['a11fever'] =='Yes') | (df['a20cough'] =='Yes') | (df['a21cough'] =='Yes')]
# we filter by `Yes` above because we only nned to get dates for people who had medical condition (`fever`,`cough`)
df.fillna(0,inplace=True)
df['diag_date'] = df["date_visit"] - pd.DateOffset(years=df.filter('age'|'yr'|'ago')) # doesn't help throws error. need to use regex here to select non-na values any of other columns
pd.wide_to_long(df, stubnames=['condition', 'diag_date'], i='subject_id', j='grp').sort_index(level=0)
df.melt('subject_id', value_name='valuestring').sort_values('subject_id')
Please note that I know the column names of diseases before hand (refer the dict). What I don't know is the actual column name from where I can get the required info to derive the date. But I know that it contains keywords like age,ago,yr,diag
diag_date is obtained by subtracting the derived date from the date_vist column.
Rule screenshot
For ex: subject_id = 1 visited hospital on 1/1/2020 for fever and he was diagnosed at age 36 (a12diagage) or 6 years ago (a12diagyrago). we know his current age and date_visit, so we can choose to subtract from any of the column which gives us 1/1/2014
As you can see, I am not able to find out how to select a column based on regex and subtract it
Use:
#get of columns with Yes at least one value
mask = df[list(disease_dict.keys())].eq('Yes')
#assign mask back
df[list(disease_dict.keys())] = mask
#rename columns names by dict
df = df.rename(columns=disease_dict).max(axis=1, level=0)
#filter out False rows
df = df[mask.any(axis=1)]
#convert some columns to index for get only years and condition columns
df = df.set_index(['subject_id','age','date_visit'])
#extract columns names - removing aDD values
s = df.columns.to_series()
df.columns = s.str.extract('(yrago|yrsago)', expand=False).fillna(s.str.extract('(age|yr)', expand=False)).fillna(s)
#replace True in condition columns to column names
ill = set(disease_dict.values())
df.loc[:, ill] = np.where(df[ill].values, np.array(list(ill)), None)
#replace columns names to condition
df = df.rename(columns = dict.fromkeys(ill, 'condition'))
#create MultiIndex - only necessary condition columns are first per groups
cols = np.cumsum(df.columns == 'condition')
df.columns = [df.columns, cols]
#reshape by stack and convert MultiIndex to columns
df = df.stack().rename(columns={'age':'age_ill'}).reset_index().drop('level_3', axis=1)
#subtract ages
df['age_ill'] = df['age'].sub(df['age_ill'])
#priority yrago so yrago is filling missing values by age_ill
df['yrago'] = df['yrago'].fillna(df['yrsago']).fillna(df['age_ill']).fillna(0).astype(int)
df = df.drop(['yrsago','age_ill'], axis=1)
#subtract years
df['diag_date1'] = df.apply(lambda x: x["date_visit"] - pd.DateOffset(years=x['yrago']), axis=1)
#replace years
mask1 = df['yr'].notna()
df.loc[mask1, 'diag_date'] = df[mask1].apply(lambda x: x["date_visit"].replace(year=int(x['yr'])), axis=1)
#because priority yr then fillna diag_date by diag_date1
df['diag_date'] = df['diag_date'].fillna(df['diag_date1'])
df = df.drop(['diag_date1','age','date_visit','yr','yrago'], axis=1)
print (df)
subject_id condition diag_date
0 1 fever 2014-01-01
1 1 cough 2015-01-01
2 2 cough 2194-03-03
3 3 fever 2091-11-13
4 4 fever 2190-05-24
5 4 cough 2196-05-24
I have a code like this
frame[frame['value_text'].str.match('Type 2') | frame['value_text'].str.match('Type II diabetes')].groupby(['value_text','gender'])['value_text'].count()
which returns a series like
value_text gender count
type 2 M 4
type 2 without... M 4
F 3
what I want is
value_text gender count
type 2 M 4
F 0
type 2 without... M 4
F 3
I want to include count for all genders even though there is no record in the dataframe. how can I do this?
Categorical Data was introduced in pandas specifically for this purpose.
In effect, groupby operations with categorical data automatically calculate the Cartesian product.
You should see additional benefits compared to other functional methods: lower memory usage and data validation.
import pandas as pd
df = pd.DataFrame({'value_text': ['type2', 'type2 without', 'type2'],
'gender': ['M', 'F', 'M'],
'value': [1, 2, 3]})
df['gender'] = df['gender'].astype('category')
res = df.groupby(['value_text', 'gender']).count()\
.fillna(0).astype(int)\
.reset_index()
print(res)
value_text gender value
0 type2 F 0
1 type2 M 2
2 type2 without F 1
3 type2 without M 0
Try appending .unstack().fillna(0).stack() to your current line, like so:
frame[frame['value_text'].str.match('Type 2') |
frame['value_text'].str.match('Type II diabetes')]\
.groupby(['value_text','gender'])['value_text'].count()\
.unstack().fillna(0).stack()
Remember, whenever you want to force a specific list to index/shape your data. Pivot, crosstab, stack, unstack are not reliable since they highly depend on the input data. For example, if 'M' is never shown in any input row, you will not see 'M' no matter how you pivot/unstack your result. This kind of problem is where reindex() shines.
Assume your pre-processed frame is saved as df:
mdx1 = pd.MultiIndex.from_product([df.index.levels[0], ['M', 'F']])
df.reindex(mdx1).fillna(0, downcast='infer')
On the other hand, if you just want all possible level-1 values to be shown in all level-0, do the following:
mdx1 = pd.MultiIndex.from_product(df.index.levels)
df.reindex(mdx1).fillna(0, downcast='infer')
This can be easily extended to dataframes with more than 2-level indexes.
Update: use Categorical data-type might fix the problems pivot-like functions have.
The simplest way to do this is with pd.crosstab and then stack:
# save your filtered dataframe as an intermediate result, for convenience
type2 = frame[frame.value_text.str.match('Type 2|Type II diabetes')]
pd.crosstab(type2.value_text, type2.gender).stack()
I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')