How to reshape dataframe if they have same index? - python

If I have a dataframe like
df= pd.DataFrame(['a','b','c','d'],index=[0,0,1,1])
0
0 a
0 b
1 c
1 d
How can I reshape the dataframe based on index like below i.e
df= pd.DataFrame([['a','b'],['c','d']],index=[0,1])
0 1
0 a b
1 c d

Let's use set_index, groupby, cumcount, and unstack:
df.set_index(df.groupby(level=0).cumcount(), append=True)[0].unstack()
Output:
0 1
0 a b
1 c d

You can use pivot with cumcount :
a = df.groupby(level=0).cumcount()
df = pd.pivot(index=df.index, columns=a, values=df[0])

Couple of ways
1.
In [490]: df.groupby(df.index)[0].agg(lambda x: list(x)).apply(pd.Series)
Out[490]:
0 1
0 a b
1 c d
2.
In [447]: df.groupby(df.index).apply(lambda x: pd.Series(x.values.tolist()).str[0])
Out[447]:
0 1
0 a b
1 c d
3.
In [455]: df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
Out[455]:
c 0 1
i
0 a b
1 c d
to remove names
In [457]: (df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
.rename_axis(None).rename_axis(None, 1))
Out[457]:
0 1
0 a b
1 c d

Related

Create an ordering dataframe depending on the ordering of items in a smaller dataframe

I have a dataframe that looks something like this:
i j
0 a b
1 a c
2 b c
I would like to convert it to another dataframe that looks like this:
a b c
0 1 -1 0
1 1 0 -1
2 0 1 -1
The idea is to look at each row in the first dataframe and assign the value 1 to the item in the first column and the value -1 for the item in the second column and 0 for all other items in the new dataframe.
The second dataframe will have as many rows as the first and as many columns as the number of unique entries in the first dataframe. Thank you.
Couldn't really get a start on this.
example
data = {'i': {0: 'a', 1: 'a', 2: 'b'}, 'j': {0: 'b', 1: 'c', 2: 'c'}}
df = pd.DataFrame(data)
df
i j
0 a b
1 a c
2 b c
First make dummy
df1 = pd.get_dummies(df)
df1
i_a i_b j_b j_c
0 1 0 1 0
1 1 0 0 1
2 0 1 0 1
Second make df1 index to multi-index
df1.columns = df1.columns.map(lambda x: tuple(x.split('_')))
df1
i j
a b b c
0 1 0 1 0
1 1 0 0 1
2 0 1 0 1
Third make j to negative value
df1.loc[:, 'j'] = df1.loc[:, 'j'].mul(-1).to_numpy()
df1
i j
a b b c
0 1 0 -1 0
1 1 0 0 -1
2 0 1 0 -1
Final sum i & j
df1.sum(level=1 ,axis=1)
a b c
0 1 -1 0
1 1 0 -1
2 0 1 -1
we can put multiple columns as list instead of i and j
columns = ['a', 'b', 'c']
def get_order(input_row):
output_row[input_row[i]] = 1
output_row[input_row[j]] = -1
return pd.Series(output_row)
ordering_df = original_df.apply(get_order, axis = 1)

How to create dataframe based on matrix?

There are two dataframe I have "df1" and "df2" and one matrix "res"
df1= a df2 = a
b c
c e
d
there are 4 record in df1 and 3 record in df2
so,
res = 4*3 matrix
res =
df2(index)
0 1 2
0 100 0 0
df1(index) 1 0 0 0
2 0 100 0
3 0 0 0
so I have above data based on this data or matrix I want following output in the form of dataframe
df1 df2 score
a a 100
a c 0
a e 0
b a 0
b c 0
b e 0
c a 0
c c 100
c e 0
d a 0
d c 0
d e 0
Set index and columns names by df1, df2:
res.index = df1[:len(res.index)]
res.columns = df2[:len(res.columns)]
And then reshape by DataFrame.melt:
df = res.rename_axis(index='df1', columns='df2').melt(ignore_index=False)
Or DataFrame.stack:
df = res.rename_axis(index='df1', columns='df2').stack().reset_index(name='value')

Convert dataframe to pivot table with booleans(0, 1) with Pandas [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

Duplicating each row in a dataframe with counts

For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2
You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2
Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3

Categories