Create multiple columns based on values in single column in Pandas DataFrame - python

I have a column in dataframe(df['Values') with 1000 rows with repetitive codes A30, A31, A32, A33, A34. I want to create five separate columns with headings colA30, colA31, colA32, colA33, colA34 in the same dataframe(df) with values 0 or 1 in the new five columns created based on if the row is anyone of codes in df['Values'].
for Ex: df
Values colA30 colA31 colA32 colA33 colA34
A32 0 0 1 0 0
A30 1 0 0 0 0
A31 0 1 0 0 0
A34 0 0 0 0 1
A33 0 0 0 1 0
So if a row in df['Values'] is A32 then colA32 should be 1 and all other columns should be 0's and so on for rest of columns in df['Values'].
I did in the following way. But, is there anyway to do it in one shot as i have multiple columns with several codes for which multiple columns are to be created.
df['A30']=df['Values'].map(lambda x : 1 if x=='A30' else 0)
df['A31']=df['Values'].map(lambda x : 1 if x=='A31' else 0)
df['A32']=df['Values'].map(lambda x : 1 if x=='A32' else 0)
df['A33']=df['Values'].map(lambda x : 1 if x=='A33' else 0)
df['A34']=df['Values'].map(lambda x : 1 if x=='A34' else 0)

You can do this in many ways :
In pandas there is a function called pd.get_dummies() that allows you to convert each categorical data to a binary data. Apply it to your categorical column and then concatenate the dataframe obtained with the original one.
Here is the link to the documentation.
Another way would be to use the library sklearn and its OneHotEncoder. It does exactly the same as above but the objects is not the same. You should use the instance of your OneHotEncoder class to fit it to your categorical data.
For your case I'd use pd.get_dummies(), it's simpler to use.

Related

Pandas save counts of multiple columns in single dataframe

I have a dataframe with 3 columns now which appears like this
Model IsJapanese IsGerman
BenzC 0 1
BensGla 0 1
HondaAccord 1 0
HondaOdyssey 1 0
ToyotaCamry 1 0
I want to create a new dataframe and have TotalJapanese and TotalGerman as two columns in the same dataframe.
I am able to achieve this by creating 2 different dataframes. But wondering how to get both the counts in a single dataframe.
please suggest thank you!
Editing and adding another similar dataframe to this [sorry notsure whether its allowed-but trying
Second dataset- am trying to save multiple counts in single dataframe, based on repetition of data.
Here is my sample dataset
Store Address IsLA IsGA
Albertsons Cross St 1 0
Safeway LeoSt 0 1
Albertsons Main St 0 1
RiteAid Culver St 1 0
My aim is to prepare a new dataset with multiple counts per store
The result should be like this
Store TotalStores TotalLA TotalGA
Alberstons 2 1 1
Safeway 1 0 1
RiteAid 1 1 0
Is it possible to achieve these in single dataframe ?
Thanks!
One way would be to store the sum of Japanese cars and German cars, and manually create a dataframe using them:
j , g =sum(df['IsJapanese']),sum(df['IsGerman'])
total_df = pd.DataFrame({'TotalJapanese':j,
'TotalGerman':g},index=['Totals'])
print(total_df)
TotalJapanese TotalGerman
Totals 3 2
Another way would be to transpose (T) your dataframe, sum(axis=1), and tranpose back:
>>> total_df_v2 = pd.DataFrame(df.set_index('Model').T.sum(axis=1)).T
print(total_df_v2)
IsJapanese IsGerman
3 2
To answer your 2nd question, you can use a DataFrameGroupBy.agg on your 'Store' column, use parameter count on Address and sum on your other two columns. Then you can rename() your columns if needed:
resulting_df = df.groupby('Store').agg({'Address':'count',
'IsLA':'sum',
'IsGA':'sum'}).\
rename({'Address':'TotalStores',
'IsLA':'TotalLA',
'IsGA':'TotalGA'},axis=1)
Prints:
TotalStores IsLA IsGA
Store
Albertsons 2 1 1
RiteAid 1 1 0
Safeway 1 0 1

In a DataFrame, how could we get a list of indexes with 0's in specific columns?

We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0

What's the best way to transform Array values in one column to columns of the original DataFrame?

I have a table where one of the columns is an Array of binary features, they are there when that feature is present.
I'd like to train a logistic model on these rows, but can't get the data in the required format where each feature value is it's own column with a 1 or 0 value.
Example:
id feature values
1 ['HasPaws', 'DoesBark', 'CanFetch']
2 ['HasPaws', 'CanClimb', 'DoesMeow']
I'd like to get it to the format of
id HasPaws DoesBark CanFetch CanClimb DoesMeow
1 1 1 1 0 0
2 1 0 0 1 0
It seems like there would be some functionality built in to accomplish this, but I can't think of what this transformation is called to do a better search on my own.
You can first convert lists to columns and then use get_dummies() method:
In [12]: df
Out[12]:
id feature_values
0 1 [HasPaws, DoesBark, CanFetch]
1 2 [HasPaws, CanClimb, DoesMeow]
In [13]: (pd.get_dummies(df.set_index('id').feature_values.apply(pd.Series),
...: prefix='', prefix_sep='')
...: .reset_index()
...: )
Out[13]:
id HasPaws CanClimb DoesBark CanFetch DoesMeow
0 1 1 0 1 1 0
1 2 1 1 0 0 1
Another option is to loop through the feature values column, and construct a series from each cell with the values in the list as index. And in this way, pandas will expand the series into a data frame with index as headers:
pd.concat([df['id'],
(df['feature values'].apply(lambda lst: pd.Series([1]*len(lst), index=lst))
.fillna(0)], axis=1)
method 1
pd.concat([df['id'], df['feature values'].apply(pd.value_counts)], axis=1).fillna(0)
method 2
df.set_index('id').squeeze().apply(pd.value_counts).reset_index().fillna(0)
method 3
pd.concat([pd.Series(1, f, name=i) for _, (i, f) in df.iterrows()],
axis=1).T.fillna(0).rename_axis('id').reset_index()

How to encode two Pandas dataframes according to the same dummy vectors?

I'm trying to encode categorical values to dummy vectors.
pandas.get_dummies does a perfect job, but the dummy vectors depend on the values present in the Dataframe. How to encode a second Dataframe according to the same dummy vectors as the first Dataframe?
import pandas as pd
df=pd.DataFrame({'cat1':['A','N','K','P'],'cat2':['C','S','T','B']})
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(b)
cat1_A cat1_K cat1_N cat1_P
0 1 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)
cat1_A cat1_N
0 1 0
1 0 1
How can I get this output ?
cat1_A cat1_K cat1_N cat1_P
0 1 0 0 0
1 0 0 1 0
I was thinking to manually compute uniques for each column and then create a dictionary to map the second Dataframe, but I'm sure there is already a function for that...
Thanks!
A always use categorical_encoding because it has a great choice of encoders. It also works with Pandas very nicely, is pip installable and is written inline with the sklearn API. It means you can quick test different types of encoders with the fit and transform methods or in a Pipeline.
If you wish to encode just the first column, like in your example, we can do so.
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1'])
# cols=None, all string columns encoded
df_trans = enc_ohe.fit_transform(df)
print(df_trans)
cat1_0 cat1_1 cat1_2 cat1_3 cat2
0 0 1 0 0 C
1 0 0 0 1 S
2 1 0 0 0 T
3 0 0 1 0 B
The default is to have column names have numerical encoding instead of the original letters. This is helpful though when you have long strings as categories. This can be changed by passing the use_cat_names=True kwarg, as mentioned by Arthur.
Now we can use the transform method to encode your second DataFrame.
df_test = pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df_test_trans = enc_ohe.transform(df_test)
print(df_test_trans)
cat1_1 cat1_3 cat2
0 1 0 T
1 0 1 B
As commented in line 5, not setting cols defaults to encode all string columns.
I had the same problem before. This is what I did which is not necessary the best way to do this. But this works for me.
df=pd.DataFrame({'cat1':['A','N'],'cat2':['C','S']})
df['cat1'] = df['cat1'].astype('category', categories=['A','N','K','P'])
# then run the get_dummies
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
Using the function astype with 'categories' values passed in as parameter.
To apply the same category to all DFs, you better store the category values to a variable like
cat1_categories = ['A','N','K','P']
cat2_categories = ['C','S','T','B']
Then use astype like
df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df['cat1'] = df['cat1'].astype('category', categories=cat1_categories)
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)
cat1_A cat1_N cat1_K cat1_P
0 1 0 0 0
1 0 1 0 0

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

Categories