I have a dataframe where I want to convert each row into a diagonal dataframe and bind all the resulting dataframes into 1 large dataframe.
Input:
a b c
2021-11-06 1 2 3
2021-11-07 4 5 6
Desired output:
a b c
Date
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6
I tried using apply on each row of the original dataframe.
data = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'], index=pd.date_range('2021-11-06', '2021-11-07'))
def convert_dataframe(ser):
df_ser = pd.DataFrame(0.0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data.apply(lambda x: convert_dataframe(x), axis=1)
However, the output is not the multi-index dataframe that I expected.
The output is instead a single index dataframe where each row is a reference to the diagonal dataframe returned.
Any help is much appreciated. Thanks in advance.
Use MultiIndex.droplevel for remove first level of MultiIndex and call function after DataFrame.stack in GroupBy.apply:
def convert_dataframe(ser):
ser = ser.droplevel(0)
df_ser = pd.DataFrame(0, index=ser.index, columns=ser.index)
np.fill_diagonal(df_ser.values, ser)
return df_ser
data = data.stack().groupby(level=0).apply(convert_dataframe)
print (data)
a b c
2021-11-06 a 1 0 0
b 0 2 0
c 0 0 3
2021-11-07 a 4 0 0
b 0 5 0
c 0 0 6
Related
I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)
Why can't I chain the get_dummies() function?
import pandas as pd
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
.get_dummies()
)
This works fine:
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
)
df = pd.get_dummies(df)
DataFrame.pipe can be helpful in chaining methods or function calls which are not natively attached to the DataFrame, like pd.get_dummies:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
Or with lambda:
df = (
df.drop(columns=['sepal_length'])
.pipe(lambda current_df: pd.get_dummies(current_df))
)
Sample DataFrame:
df = pd.DataFrame({'sepal_length': 1, 'a': list('ABACC'), 'b': list('ACCAB')})
df:
sepal_length a b
0 1 A A
1 1 B C
2 1 A C
3 1 C A
4 1 C B
Sample Output:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
df:
a_A a_B a_C b_A b_B b_C
0 1 0 0 1 0 0
1 0 1 0 0 0 1
2 1 0 0 0 0 1
3 0 0 1 1 0 0
4 0 0 1 0 1 0
You can't chain the pd.get_dummies() method since it is not a pd.DataFrame method. However, assuming -
You have a single column left after you drop your columns in the previous step in the chain.
Your column is a string column dtype.
... you can use pd.Series.str.get_dummies() which is a series level method.
### Dummy Dataframe
# A B
# 0 1 x
# 1 2 y
# 2 3 z
pd.read_csv(path).drop(columns=['A'])['B'].str.get_dummies()
x y z
0 1 0 0
1 0 1 0
2 0 0 1
NOTE: Make sure that before you call the get_dummies() method, the data type of the object is series. In this case, I fetch column ['B'] to do that, which kinda makes the previous pd.DataFrame.drop() method unnecessary and useless :)
But this is only for example's sake.
How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2
How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)
I have a column, 'col2', that has a list of strings. The current code I have is too slow, there's about 2000 unique strings (the letters in the example below), and 4000 rows. Ending up as 2000 columns and 4000 rows.
In [268]: df.head()
Out[268]:
col1 col2
0 6 A,B
1 15 C,G,A
2 25 B
Is there a fast way to make this in a get dummies format? Where each string has it's own column and in each string's column there is a 0 or 1 if it that row has that string in col2.
In [268]: def get_list(df):
d = []
for row in df.col2:
row_list = row.split(',')
for string in row_list:
if string not in d:
d.append(string)
return d
df_list = get_list(df)
def make_cols(df, lst):
for string in lst:
df[string] = 0
return df
df = make_cols(df, df_list)
for idx in range(0, len(df['col2'])):
row_list = df['col2'].iloc[idx].split(',')
for string in row_list:
df[string].iloc[idx]+= 1
Out[113]:
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0
This is my current code for it but it's too slow.
Thanks you any help!
You can use:
>>> df['col2'].str.get_dummies(sep=',')
A B C G
0 1 1 0 0
1 1 0 1 1
2 0 1 0 0
To join the Dataframes:
>>> pd.concat([df, df['col2'].str.get_dummies(sep=',')], axis=1)
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0