Create unique MultiIndex from Non-unique Index Python Pandas - python

I have a pandas DataFrame with a non-unique index:
index = [1,1,1,1,2,2,2,3]
df = pd.DataFrame(data = {'col1': [1,3,7,6,2,4,3,4]}, index=index)
df
Out[12]:
col1
1 1
1 3
1 7
1 6
2 2
2 4
2 3
3 4
I'd like to turn this into unique MultiIndex and preserve order, like this:
col1
Ind2
1 0 1
1 3
2 7
3 6
2 0 2
1 4
2 3
3 0 4
I would imagine pandas would have a function for something like this but haven't found anything

You can do a groupby.cumcount on the index, and then append it as a new level to the index using set_index:
df = df.set_index(df.groupby(level=0).cumcount(), append=True)
The resulting output:
col1
1 0 1
1 3
2 7
3 6
2 0 2
1 4
2 3
3 0 4

Related

pop rows from dataframe based on conditions

From the dataframe
import pandas as pd
df1 = pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[1,2,3,4,5,6,7,8]})
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
I want to pop 2 rows where 'A' == 2, preferably in a single statement like
df2 = df1.somepopfunction(...)
to generate the following result:
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 7
5 2 8
print(df2)
A B
0 2 5
1 2 6
The pandas pop function sounds promising, but only pops complete colums.
What statement can replace the pseudocode
df2 = df1.somepopfunction(...)
to generate the desired results?
Pop function for remove rows does not exist in pandas, need filter first and then remove filtred rows from df1:
df2 = df1[df1.A.eq(2)].head(2)
print (df2)
A B
4 2 5
5 2 6
df1 = df1.drop(df2.index)
print (df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
6 2 7
7 2 8

pandas get original dataframe after vertical concatenation

Let us take a sample dataframe
df = pd.DataFrame(np.arange(10).reshape((5,2)))
df
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and concatenate the two columns into a single column
temp = pd.concat([df[0], df[1]]).to_frame()
temp
0
0 0
1 2
2 4
3 6
4 8
0 1
1 3
2 5
3 7
4 9
What would be the most efficient way to get the original dataframe i.e df from temp?
The following way using groupby works. But is there any more efficient way (like without groupby-apply, pivot) to do this whole task from concatenation (and then doing some operation) and then reverting back to the original dataframe?
pd.DataFrame(temp.groupby(level=0)[0]
.apply(list)
.to_numpy().tolist())
I think we can do pivot after assign the column value with cumcount
check = temp.assign(c=temp.groupby(level=0).cumcount()).pivot(columns='c',values='0')
Out[66]:
c 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can use groupby + cumcount to create a sequential counter per level=0 group then append it to the index of the dataframe and use unstack to reshape:
temp.set_index(temp.groupby(level=0).cumcount(), append=True)[0].unstack()
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can try this:
In [1267]: temp['g'] = temp.groupby(level=0)[0].cumcount()
In [1273]: temp.pivot(columns='g', values=0)
Out[1279]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
OR:
In [1281]: temp['g'] = (temp.index == 0).cumsum() - 1
In [1282]: temp.pivot(columns='g', values=0)
Out[1282]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df = pd.DataFrame(np.arange(10).reshape((5,2)))
temp = pd.concat([df[0], df[1]]).to_frame()
duplicated_index = temp.index.duplicated()
pd.concat([temp[~duplicated_index], temp[duplicated_index]], axis=1)
Works for this specific case (as pointed out in the comments, it will fail if you have more than one set of duplicate index values) so I don't think it's a better solution.

How to take minimum column value in pandas data frame if values in another column repeat?

If I have a pandas data frame like this:
Col A Col B Col C
1 4 3
1 4 5
2 3 7
2 4 6
1 6 6
1 6 4
When values in Column B repeat (are consecutive) I want to keep the row with the minimum value in Column C. Such that I get a pandas data frame like this:
Col A Col B Col C
1 4 3
2 3 7
2 4 6
1 6 4
It's okay if values in Column B repeat they just can't be consecutive.
IIUC sort_values + drop_duplicates
Yourdf=df.sort_values(['ColC']).drop_duplicates(['ColA','ColB']).sort_index()
ColA ColB ColC
0 1 4 3
2 2 3 7
3 2 4 6
5 1 6 4
All the other answers seem to overlook values in Column B repeat (are consecutive), so here's my approach:
B_blocks = df['Col B'].ne(df['Col B'].shift()).cumsum()
min_idx = df.groupby(B_blocks)['Col C'].idxmin()
df.loc[min_idx]
Output:
Col A Col B Col C
0 1 4 3
2 2 3 7
3 2 4 6
5 1 6 4
You can also use DataFrame.sort_values + GroupBy.first:
g=df['Col_B'].ne(df['Col_B'].shift()).cumsum()
new_df=df.sort_values('Col_C').groupby(g).first().reset_index(drop=True)
print(new_df)
Col_A Col_B Col_C
0 1 4 3
1 2 3 7
2 2 4 6
3 1 6 4

Expand a list returned by a function to multiple columns (Pandas)

I have a function that I'm trying to call on each row of a dataframe and I would like it to return 20 different numeric values and each of those be in a separate column of the original dataframe.
For example, this is not the function, but it if this will work the actual one will
def doStuff(x):
return([x] * 5)
So this will just return the same number 5x. so if I have the dataframe
import pandas as pd
df = pd.DataFrame({'A' : [1,2]})
A
0 1
1 2
2 3
After calling
df = np.vectorize(doStuff)(df['A'])
It should end up looking like
A 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
I believe you need df.apply, twice.
In [1254]: df['A'].apply(np.vectorize(doStuff)).apply(pd.Series)
Out[1254]:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
You may concatenate this with the original using pd.concat(..., axis=1):
In [1258]: pd.concat([df, df['A'].apply(np.vectorize(doStuff)).apply(pd.Series)], axis=1)
Out[1258]:
A 0 1 2 3 4
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
From pandas 0.23 you can use the result_type argument:
df = pd.DataFrame({'A' : [1,2]})
def doStuff(x):
return([x] * 5)
df.apply(doStuff, axis=1, result_type='expand')

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Categories