Remove a level from a MultiIndex - python

I need to remove a level (either by position or name) from a DataFrame's index and create a new DataFrame with the new index. The problem is that I end up having a non-unique index.
I had a look at Remove a level from a pandas MultiIndex but the problem is that the use of unique(), as the answer in there suggests, reduces the index to an array, that doesn't retain the names of the levels.
Other than using unique(), and then creating a new Index by stitching the label names onto the array, is there a more elegant solution?
index = [np.array(['foo', 'foo', 'qux']), np.array(['a', 'b', 'a'])]
data = np.random.randn(3, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
df.index.names = ["Level0", "Level1"]
print df
X Y
Level0 Level1
foo a -0.591649 0.831599
b 0.049961 -1.524291
qux a -0.100124 -1.059195
index2 = df.reset_index(level=1, drop=True).index
df2 = pd.DataFrame(index=index2)
print df2.loc[idx['foo'], :]
Empty DataFrame
Columns: []
Index: [foo, foo]

If I understand you correctly, you are looking for a solution to get the first level index without duplicated values. Your result should be an Ìndex object without using unique and without explicitly creating the index again.
For your example data frame, you can use the following including get_level_values and drop_duplicates:
print(df.index.get_level_values(0).drop_duplicates())
Index(['foo', 'qux'], dtype='object', name='Level0')
Edit
For a more general solution either returning an Index or MultiIndex depending on the number of levels, you may use droplevel and drop_duplicates in conjunction:
print(df.index.droplevel(-1).drop_duplicates())
Index(['foo', 'qux'], dtype='object', name='Level0')
Here is the example from the linked SO post with 3 levels which are reduced to 2 levels mutltiindex with unique values:
tuples = [(0, 100, 1000),(0, 100, 1001),(0, 100, 1002), (1, 101, 1001)]
index_3levels=pd.MultiIndex.from_tuples(tuples,names=["l1","l2","l3"])
print(index_3levels)
MultiIndex(levels=[[0, 1], [100, 101], [1000, 1001, 1002]],
labels=[[0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 2, 1]],
names=['l1', 'l2', 'l3'])
index2level= index_3levels.droplevel(-1).drop_duplicates()
print(index2level)
MultiIndex(levels=[[0, 1], [100, 101]],
labels=[[0, 1], [0, 1]],
names=['l1', 'l2'])
# show unique values of new index
print(index2level)
[(0, 100) (1, 101)]

Related

pandas largest value per group with multi columns / why does it only work when flattening?

For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)
For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))

Joining a pandas table with multi-index

I have two tables that I want to join - the main table has index SourceID, the sub-table is multi-indexed as it comes from a pivot table - indexes are (SourceID, sourceid)
How can I join a table with a single index to one with multi-index (or change the multi-indexed table to singular)?
The sub-table is created as follows:
d = {'SourceID': [1, 1, 2, 2, 3, 3, 3], 'Year': [0, 1, 0, 1, 1, 2, 3], 'Sales': [100, 200, 300, 400 , 500, 600, 700], 'Profit': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data=d)
df_sub = (
df
.pivot_table(
index=['SourceID'],
columns=['Year'],
values=['Sales', 'Profit'],
fill_value=0,
aggfunc='mean'
)
# .add_prefix('sales_')
.reset_index()
)
L = [(a, f'{a.lower()}{b}') for a, b in df_sub.columns]
df_sub.columns = pd.MultiIndex.from_tuples(L)
df_sub = df_sub.reset_index()
I'm then trying to join it with the main table df_main
df_all = df_sub.join(df_main.set_index('SourceID'), on='SourceID.sourceid')
but this fails due to the multi-index. The index in the sub-table could be single as long as I don't lost the multi-index on the other fields.
It is possible, but then MultiIndex values are converted to tuples:
df_all = df_sub.join(df.set_index('SourceID'), on=[('SourceID','sourceid')])
print (df_all)
If want MultiIndex in output is necessary convert df columns to MultiIndex too, e.g. by MultiIndex.from_product:
df1 = df.copy()
df1.columns = pd.MultiIndex.from_product([['orig'], df1.columns])
df_all = df_sub.join(df1.set_index([('orig','SourceID')]), on=[('SourceID','sourceid')])

Python Numpy Array Append with Blank Columns

I have 2 numpy arrays, I am using the top row as column headers. Each array has the same columns except for 2 columns. arr2 will have a different C column as well as an additional column
How can I combine all of these columns into a single np array?
arr1 = [ ['A', 'B', 'C1'], [1, 1, 0], [0, 1, 1] ]
arr2 = [ ['A', 'B', 'C2', 'C3'], [0, 1, 0, 1], [0, 0, 1, 0] ]
a1 = np.array(arr1)
a2 = np.array(arr2)
b = np.append(a1, a2, axis=0)
print(b)
# Desired Result
# A B C1 C2 C3
# 1 1 0 - -
# 0 1 1 - -
# 0 1 - 0 1
# 0 0 - 1 0
NumPy arrays aren't great for handling data with named columns, which might contain different types. Instead, I would use pandas for this. For example:
import pandas as pd
arr1 = [[1, 1, 0], [0, 1, 1] ]
arr2 = [[0, 1, 0, 1], [0, 0, 1, 0] ]
df1 = pd.DataFrame(arr1, columns=['A', 'B', 'C1'])
df2 = pd.DataFrame(arr2, columns=['A', 'B', 'C2', 'C3'])
df = pd.concat([df1, df2], sort=False)
df.to_csv('mydata.csv', index=False)
This results in a 'dataframe', a spreadsheet-like data structure. Jupyter Notebooks render these as follows:
You might notice there's an extra new column; this is the "index", which you can think of as row labels. You don't need it if you don't want it in your CSV, but if you carry on doing things in the dataframe, you might want to do df = df.reset_index() to relabel the rows in a more useful way.
If you want the dataframe back as a NumPy array, you can do df.values and away you go. It doesn't have the column names though.
Last thing: if you really want to stay in NumPy-land, then check out structured arrays, which give you another way to name the columns, essentially, in an array. Honestly, since pandas came along, I hardly ever see these in the wild.

Pandas convert from Int64Index to RangeIndex

I concatenated three dataframes. How can I print df.index in RangeIndex, instead of Int64Index?
My Input:
df = pd.concat([df1, df2, df3])
print(df.index)
My Output:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
73809, 73810, 73811, 73812, 73813, 73814, 73815, 73816, 73817,
73818],
dtype='int64', length=495673)
Desired Output:
RangeIndex(start=X, stop=X, step=X)
You can use reset_index to get desired indices. For example:
df = pd.concat([df1,df2,df3])
df.index
Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype='int64')
After resetting indices:
df.reset_index(inplace=True)
df.index
RangeIndex(start=0, stop=9, step=1)
Also it is good to use axis keyword in concat function.
you can use the built-in ignore_index option:
df = pd.concat([df1, df2, df3],ignore_index=True)
print(df.index)
From the docs:
ignore_index : boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

Multilevel index won't go away

I have a dataframe, which consists of summary statistics of another dataframe:
df = sample[['Place','Lifeexp']]
df = df.groupby('Place').agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values([('Lifeexp', 'count')], ascending=False)
When looking at the structure, the dataframe has a multi index, which makes plot creations difficult:
df.columns
MultiIndex(levels=[['Lifeexp', 'Place'], ['count', 'mean', 'max', 'min', '']],
labels=[[1, 0, 0, 0, 0], [4, 0, 1, 2, 3]])
I tried the solutions of different questions here (e.g. this), but somehow don't get the desired result. I want df to have Place, count, mean,max, min as column names and delete Lifeexp so that I can create easy plots e.g. df.plot.bar(x = "Place", y = 'count')
I think solution should be simplify define column after groupby for prevent MultiIndex in columns:
df = df.groupby('Place')['Lifeexp'].agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values('count', ascending=False)

Categories