How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2
Related
I have a dataframe where I want to convert each row into a diagonal dataframe and bind all the resulting dataframes into 1 large dataframe.
Input:
a b c
2021-11-06 1 2 3
2021-11-07 4 5 6
Desired output:
a b c
Date
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6
I tried using apply on each row of the original dataframe.
data = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'], index=pd.date_range('2021-11-06', '2021-11-07'))
def convert_dataframe(ser):
df_ser = pd.DataFrame(0.0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data.apply(lambda x: convert_dataframe(x), axis=1)
However, the output is not the multi-index dataframe that I expected.
The output is instead a single index dataframe where each row is a reference to the diagonal dataframe returned.
Any help is much appreciated. Thanks in advance.
Use MultiIndex.droplevel for remove first level of MultiIndex and call function after DataFrame.stack in GroupBy.apply:
def convert_dataframe(ser):
ser = ser.droplevel(0)
df_ser = pd.DataFrame(0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data = data.stack().groupby(level=0).apply(convert_dataframe)
print (data)
a b c
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6
Why can't I chain the get_dummies() function?
import pandas as pd
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
.get_dummies()
)
This works fine:
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
)
df = pd.get_dummies(df)
DataFrame.pipe can be helpful in chaining methods or function calls which are not natively attached to the DataFrame, like pd.get_dummies:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
Or with lambda:
df = (
df.drop(columns=['sepal_length'])
.pipe(lambda current_df: pd.get_dummies(current_df))
)
Sample DataFrame:
df = pd.DataFrame({'sepal_length': 1, 'a': list('ABACC'), 'b': list('ACCAB')})
df:
sepal_length a b
0 1 A A
1 1 B C
2 1 A C
3 1 C A
4 1 C B
Sample Output:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
df:
a_A a_B a_C b_A b_B b_C
0 1 0 0 1 0 0
1 0 1 0 0 0 1
2 1 0 0 0 0 1
3 0 0 1 1 0 0
4 0 0 1 0 1 0
You can't chain the pd.get_dummies() method since it is not a pd.DataFrame method. However, assuming -
You have a single column left after you drop your columns in the previous step in the chain.
Your column is a string column dtype.
... you can use pd.Series.str.get_dummies() which is a series level method.
### Dummy Dataframe
# A B
# 0 1 x
# 1 2 y
# 2 3 z
pd.read_csv(path).drop(columns=['A'])['B'].str.get_dummies()
x y z
0 1 0 0
1 0 1 0
2 0 0 1
NOTE: Make sure that before you call the get_dummies() method, the data type of the object is series. In this case, I fetch column ['B'] to do that, which kinda makes the previous pd.DataFrame.drop() method unnecessary and useless :)
But this is only for example's sake.
Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C
My setup is as follows
import numpy as np
import pandas as pd
df = pd.DataFrame({'user_id':[1, 1, 1, 2, 3, 3], 'action':['b', 'b', 'c', 'a', 'c', 'd']})
df
action user_id
0 b 1
1 b 1
2 c 1
3 a 2
4 c 3
5 d 3
What is the best way to generate a dataframe from this where there is one row for each unique user_id, one column for each unique action and the column values are the count of each action per user_id?
I've tried
df.groupby(['user_id', 'action']).size().unstack('action')
action a b c d
user_id
1 NaN 2 1 NaN
2 1 NaN NaN NaN
3 NaN NaN 1 1
which comes close, but this seems to make user_id the index which is not what I want (I think). Maybe there's a better way involving pivot, pivot_table or even get_dummies?
You could use pd.crosstab:
In [37]: pd.crosstab(index=[df['user_id']], columns=[df['action']])
Out[37]:
action a b c d
user_id
1 0 2 1 0
2 1 0 0 0
3 0 0 1 1
Having user_id as the index seems appropriate to me, but if you'd like to drop the user_id, you could use reset_index:
In [39]: pd.crosstab(index=[df['user_id']], columns=[df['action']]).reset_index(drop=True)
Out[39]:
action a b c d
0 0 2 1 0
1 1 0 0 0
2 0 0 1 1