is there a way to convert the multiindex columns to normal value columns? I have a multiindexed table like that:
level_0
level_1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
I want to convert level_0 and level_1 to normal columns:
ID
col0
col1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
Any suggestion?
Thank you!
You can use reset_index followed by rename.
# Setup
my_index = pd.MultiIndex.from_arrays([(0, 1, 2, 3),
(0, 0, 1, 1),
(0, 1, 0, 1)],
names=[None, 'level_0', 'level_1'])
df = pd.DataFrame({'Value': ['A', 'B', 'C', 'D']}, index=my_index)
>>> # level=['level_0', 'level_1'] works, too
>>> df = df.reset_index(level=[1, 2])
>>> df
level_0 level_1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
To rename the columns, you can do
>>> df.rename(columns={'level_0': 'col0', 'level_1': 'col1'})
col0 col1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
Related
Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
I want to ask an extension of this question, which talks about adding a label to missing classes to make sure the dummies are encoded as blanks correctly.
Is there a way to do this automatically across multiple sets of data and have the labels automatically synched between the two? (I.e. for Test & Training sets). I.e. the same columns but different classes of data represented in each?
E.g.:
Suppose I have the following two dataframes:
df1 = pd.DataFrame.from_items([('col1', list('abc')), ('col2', list('123'))])
df2 = pd.DataFrame.from_items([('col1', list('bcd')), ('col2', list('234'))])
df1
col1 col2
1 a 1
2 b 2
3 c 3
df2
col1 col2
1 b 2
2 c 3
3 d 4
I want to have:
df1
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 1 0 0 0 1 0 0 0
2 0 1 0 0 0 1 0 0
3 0 0 1 0 0 0 1 0
df2
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 0 1 0 0 0 1 0 0
2 0 0 1 0 0 0 1 0
3 0 0 0 1 0 0 0 1
WITHOUT having to specify in advance that
col1_labels = ['a', 'b', 'c', 'd'], col2_labels = ['1', '2', '3', '4']
And can I do this systematically for many columns all at once? I'm imagining a fuction that when passed in two or more dataframes (assuming columns are the same for all):
reads which columns in the pandas dataframe are categories
figures out what that overall labels are
and then provides the category labels to each column
Does that seem right? Is there a better way?
I think you need reindex by union of all columns if same categorical columns names in both Dataframes:
print (df1)
df1
1 a
2 b
3 c
print (df2)
df1
1 b
2 c
3 d
df1 = pd.get_dummies(df1)
df2 = pd.get_dummies(df2)
union = df1.columns | df2.columns
df1 = df1.reindex(columns=union, fill_value=0)
df2 = df2.reindex(columns=union, fill_value=0)
print (df1)
df1_a df1_b df1_c df1_d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
print (df2)
df1_a df1_b df1_c df1_d
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2
You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2
Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3
If I have a dataframe like
df= pd.DataFrame(['a','b','c','d'],index=[0,0,1,1])
0
0 a
0 b
1 c
1 d
How can I reshape the dataframe based on index like below i.e
df= pd.DataFrame([['a','b'],['c','d']],index=[0,1])
0 1
0 a b
1 c d
Let's use set_index, groupby, cumcount, and unstack:
df.set_index(df.groupby(level=0).cumcount(), append=True)[0].unstack()
Output:
0 1
0 a b
1 c d
You can use pivot with cumcount :
a = df.groupby(level=0).cumcount()
df = pd.pivot(index=df.index, columns=a, values=df[0])
Couple of ways
1.
In [490]: df.groupby(df.index)[0].agg(lambda x: list(x)).apply(pd.Series)
Out[490]:
0 1
0 a b
1 c d
2.
In [447]: df.groupby(df.index).apply(lambda x: pd.Series(x.values.tolist()).str[0])
Out[447]:
0 1
0 a b
1 c d
3.
In [455]: df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
Out[455]:
c 0 1
i
0 a b
1 c d
to remove names
In [457]: (df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
.rename_axis(None).rename_axis(None, 1))
Out[457]:
0 1
0 a b
1 c d
I have
df = pd.DataFrame.from_dict({'col1':['A','B', 'B', 'A']})
col1
0 A
1 B
2 B
3 A
other_dict = {'A':1, 'B':0}
I want to append a column to df, so that it looks like this:
col1 col2
0 A 1
1 B 0
2 B 0
3 A 1
You can also use map:
In [3]:
df['col2'] = df['col1'].map(other_dict)
df
Out[3]:
col1 col2
0 A 1
1 B 0
2 B 0
3 A 1
One option is to use an apply:
In [11]: df["col1"].apply(other_dict.get)
Out[11]:
0 1
1 0
2 0
3 1
Name: col1, dtype: int64
then assign it to the column:
df["col2"] = df["col1"].apply(other_dict.get)
Another which may be more efficient (if you have larger groups) is to use a transform:
In [21]: g = df.groupby("col1")
In [22]: g["col1"].transform(lambda x: other_dict[x.name])
Out[22]:
0 1
1 0
2 0
3 1
Name: col1, dtype: object
It's also worth linking to the categorical section of the docs.