MultiIndex Pandas does not group first index level - python

I am trying to create a Pandas Dataframe with two levels of index in the rows.
info = pd.DataFrame([['A', 1, 3],
['A', 2, 4],
['A', 3, 6],
['B', 1, 9],
['B', 2, 10],
['B', 4, 6]], columns=pd.Index(['C', 'D', 'V'])
info_new = info.set_index(['C', 'D'], drop=False)
EDIT: I want the following output:
V
C D
A 1 3
2 4
3 6
B 1 9
2 10
4 6
According to every instruction I found, this should work.
I am still getting
V
C D
A 1 3
A 2 4
A 3 6
B 1 9
B 2 10
B 4 6
So apparently, the multiindex does not work here.
I checked each column with non-unique values with .is_unique, the answer is False.
I checked the columns with unique values, the answer is True.
I also tried to assign a dtype=str, this didn't change anything.

Thank you for the info_new.index.is_lexsorted() comment.
I solved it by specifying dtype=str in the .csv import and then:
info_new.sortlevel(inplace=True)

Related

Subset of Pandas MultiIndex works for whole index but not for specific level?

I've run into strange behaviour with a pd.MultiIndex and am trying to understand what's going on. Not looking for a solution so much as an explanation.
Suppose I have a MultiIndexed dataframe:
index0 = pd.Index(['a', 'b', 'c'], name='let')
index1 = pd.Index(['foo', 'bar', 'baz'], name='word')
x = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=[index0, index1])
display(x)
0 1 2
let word
a foo 1 2 3
b bar 4 5 6
c baz 7 8 9
If I then take a subset of that dataframe with df.loc:
sub = ['a', 'c']
y = x.loc[sub]
display(y)
0 1 2
let word
a foo 1 2 3
c baz 7 8 9
So far, so good. Now, looking at the index of the new dataframe:
display(y.index)
MultiIndex([('a', 'foo'),
('c', 'baz')],
names=['let', 'word'])
That makes sense too. But if I look at a specific level of the subset dataframe's index...
display(y.index.levels[1])
Index(['bar', 'baz', 'foo'], dtype='object', name='word')
Suddenly I have the values of the original full dataframe, not the selected subset!
Why does this happen?
We need to add a specific function remove_unused_levels to this , since it is category type data
y.index.levels[0]
Index(['a', 'b', 'c'], dtype='object', name='let')
# after add
y.index=y.index.remove_unused_levels()
y.index.levels[0]
Index(['a', 'c'], dtype='object', name='let')
I think you are confusing levels and get_level_values:
y.index.get_level_values(1)
# Index(['foo', 'baz'], dtype='object', name='word')
y.index.levels is as Ben mentioned in his answer, just all possible values (before truncated). Let's see another example:
df = pd.DataFrame([[0]],
index=pd.MultiIndex.from_product([[0,1],[0,1,2]]))
So df would look like:
0
0 0 0
1 0
2 0
1 0 0
1 0
2 0
Now what do you think we would get with df.index.levels[1]? The answer is:
Int64Index([0, 1, 2], dtype='int64')
which consists of all the possible values in the level. Whereas, df.index.get_level_values(1) gives:
Int64Index([0, 1, 2, 0, 1, 2], dtype='int64')

Indexing with pandas df..loc with duplicated columns - sometimes it works, sometimes it doesn't

I want to add a row to a pandas dataframe with using df.loc[rowname] = s (where s is a series).
However, I constantly get the Cannot reindex from a duplicate axis ValueError.
I presume that this is due to having duplicate column names in df as well as duplicate index names in s (the index of s is identical to df.columns.
However, when I try to reproduce this error on a small example, I don't get this error. What could the reason for this behavior be?
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
b=pd.DataFrame(columns=a.columns)
b.loc['mean'] = a.replace('',np.nan).mean(skipna=True)
print(b)
a b a
mean 3.0 3.0 6.0
I think duplicated columns names should be avoid, because then should be weird errors.
It seems there are non matched values between index of Series and columns of DataFrame:
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
a.loc['mean'] = pd.Series([2,5,4], index=list('abb'))
print(a)
ValueError: cannot reindex from a duplicate axis
One possible solution for deduplicated columns names with rename columns:
s = a.columns.to_series()
a.columns = s.add(s.groupby(s).cumcount().astype(str).replace('0',''))
print(a)
a b a1
0 1 2 7
1 5 4 5
2
Or drop duplicated columns:
a = a.loc[:, ~a.columns.duplicated()]
print(a)
a b
0 1 2
1 5 4
2

How to join a dataframe and dictionary on two rows

I have a dictionary and a dataframe. The dictionary contains a mapping of one letter to one number and the dataframe has a row containing these specific letters and another row containing these specific numbers, adjacent to each other (not that it necessarily matters).
I want to update the row containing the numbers by matching each letter in the row of the dataframe with the letter in the dictionary and then replacing the corresponding number (number in the same column as the letter) with the value of that letter from the dictionary.
df = pd.DataFrame(np.array([[4, 5, 6], ['a', 'b', 'c'], [7, 8, 9]]))
dict = {'a':2, 'b':3, 'c':5}
Let's say dict is the dictionary and df is the dataframe I want the result to be df2.
df2 = pd.DataFrame(np.array([[3, 2, 5], ['b', 'a', 'c'], [7, 8, 9]]))
df
0 1 2
0 4 5 6
1 a b c
2 7 8 9
dict
{'a': 2, 'b': 3, 'c': 5}
df2
0 1 2
0 2 3 5
1 a b c
2 7 8 9
I do not know how to use merge or join to fix this, my initial thoughts are to make the dictionary a dataframe object but I am not sure where to go from there.
It's a little weird, but:
df = pd.DataFrame(np.array([[4, 5, 6], ['a', 'b', 'c'], [7, 8, 9]]))
d = {'a': 2, 'b': 3, 'c': 5}
df.iloc[0] = df.iloc[1].map(lambda x: d[x] if x in d.keys() else x)
df
# 0 1 2
# 0 2 3 5
# 1 a b c
# 2 7 8 9
I couldn't bring myself to redefine dict to be a particular dictionary. :D
After receiving a much-deserved smackdown regarding the speed of apply, I present to you the theoretically faster approach below:
df.iloc[0] = df.iloc[1].map(d).where(df.iloc[1].isin(d.keys()), df.iloc[0])
This gives you the dictionary value of d (df.iloc[1].map(d)) if the value in row 1 is in the keys of d (.where(df.iloc[1].isin(d.keys()), ...), otherwise gives you the value in row 0 (...df.iloc[0])).
Hope this helps!

Sort_values based on column index

I have seen lots of advice about sorting based on a pandas column name but I am trying to sort based on the column index.
I have included some code to demonstrate what I am trying to do.
import pandas as pd
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'D', 'C', 'D'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
})
df2 = df.sort_values(by=['col2'])
I want to sort a number of dataframes that all have different names for the second column. It is not practical to sort based on (by=['col2'] but I always want to sort on the second column (i.e. Column index 1). Is this possible?
Select columns name by position and pass to by parameter:
print (df.columns[1])
col2
df2 = df.sort_values(by=df.columns[1])
print (df2)
col1 col2 col3
1 A 1 1
0 A 2 0
5 D 4 3
4 C 7 2
3 D 8 4
2 B 9 9

how to re-arrange multiple columns into one column with same index

I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.
This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])
try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]

Categories