Adding a DataFrame to a level in Pandas - python

I have a MultiIndex DataFrame and would like to take a level and put a new DataFrame in its place. So if I had a DataFrame with levels like this:
a 1
2
b 3
4
Would I be able to swap out ['b', 3] with a DataFrame like this
10
11
Resulting in this:
a 1
2
b 10
11
4

I took a crack at this, and couldn't quite get what you wanted. Instead of 'inserting' a df, you can delete the row(s) in question, then concat/join/merge as needed. However, this does not sort the data in the method you requested.
import pandas as pd
arrays = [['a', 'a', 'b', 'b'], [1, 2, 3, 4], ['str1', 'str2', 'str3', 'str4']]
df1 = pd.DataFrame(list(zip(*arrays)),
columns = ['let', 'num', 'string']).set_index(['let', 'num'])
df2 = pd.DataFrame(list(zip(*[['b', 'b'], [10, 11], ['str5', 'str6']])),
columns = ['let', 'num', 'string']).set_index(['let', 'num'])
df3 = pd.concat([df1.drop(3, level = 'num'), df2])
Output:
string
let num
a 1 str1
2 str2
b 4 str4
10 str5
11 str6
I tried a number of other methods to insert data into the middle of the df, and was met with error after error. I'll keep messing with it; it's a good problem-oriented multiindex tutorial.

Related

Make list after groupby in pandas using apply() function

I have this dataframe:
c1 c2
0 B 1
1 A 2
2 B 5
3 A 3
4 A 7
My goal is to keep tracking the values in column2, based on the letters of column1 separated by(:), the output should look like this:
c1 list
0 A 2:3:7
1 B 1:5
What's the most pythonic way to do this:
At the moment I'm able to group by the column 1 and I'm trying to use the apply() function, but I do not know how to map and make this list in the new column.
Try this:
df = df.groupby("c1")["c2"].apply(lambda x: ":".join([str(i) for i in x])).reset_index()
You can use groupby
>>> import pandas as pd
>>> df = pd.DataFrame({'c1': ['B', 'A', 'B', 'A', 'A'], 'c2': [1, 2, 5, 3, 7]})
>>>
>>> df.c2 = df.c2.astype(str)
>>> new_df = df.groupby("c1")['c2'].apply(":".join).reset_index()
>>> new_df
c1 c2
0 A 2:3:7
1 B 1:5
i think you can just do a string join
df = pandas.DataFrame({"c1":list("BABAA"),"c2":[1,2,5,3,7]})
df['c2'] = df['c2'].astype(str)
df.groupby('c1').agg({'c2':':'.join})
you might get more mileage from
df.groupby('c1').agg({'c2':list})

Subset of Pandas MultiIndex works for whole index but not for specific level?

I've run into strange behaviour with a pd.MultiIndex and am trying to understand what's going on. Not looking for a solution so much as an explanation.
Suppose I have a MultiIndexed dataframe:
index0 = pd.Index(['a', 'b', 'c'], name='let')
index1 = pd.Index(['foo', 'bar', 'baz'], name='word')
x = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=[index0, index1])
display(x)
0 1 2
let word
a foo 1 2 3
b bar 4 5 6
c baz 7 8 9
If I then take a subset of that dataframe with df.loc:
sub = ['a', 'c']
y = x.loc[sub]
display(y)
0 1 2
let word
a foo 1 2 3
c baz 7 8 9
So far, so good. Now, looking at the index of the new dataframe:
display(y.index)
MultiIndex([('a', 'foo'),
('c', 'baz')],
names=['let', 'word'])
That makes sense too. But if I look at a specific level of the subset dataframe's index...
display(y.index.levels[1])
Index(['bar', 'baz', 'foo'], dtype='object', name='word')
Suddenly I have the values of the original full dataframe, not the selected subset!
Why does this happen?
We need to add a specific function remove_unused_levels to this , since it is category type data
y.index.levels[0]
Index(['a', 'b', 'c'], dtype='object', name='let')
# after add
y.index=y.index.remove_unused_levels()
y.index.levels[0]
Index(['a', 'c'], dtype='object', name='let')
I think you are confusing levels and get_level_values:
y.index.get_level_values(1)
# Index(['foo', 'baz'], dtype='object', name='word')
y.index.levels is as Ben mentioned in his answer, just all possible values (before truncated). Let's see another example:
df = pd.DataFrame([[0]],
index=pd.MultiIndex.from_product([[0,1],[0,1,2]]))
So df would look like:
0
0 0 0
1 0
2 0
1 0 0
1 0
2 0
Now what do you think we would get with df.index.levels[1]? The answer is:
Int64Index([0, 1, 2], dtype='int64')
which consists of all the possible values in the level. Whereas, df.index.get_level_values(1) gives:
Int64Index([0, 1, 2, 0, 1, 2], dtype='int64')

How to drop rows based on column value if column is not set as index in pandas?

I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.
Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4
You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)

Indexing with pandas df..loc with duplicated columns - sometimes it works, sometimes it doesn't

I want to add a row to a pandas dataframe with using df.loc[rowname] = s (where s is a series).
However, I constantly get the Cannot reindex from a duplicate axis ValueError.
I presume that this is due to having duplicate column names in df as well as duplicate index names in s (the index of s is identical to df.columns.
However, when I try to reproduce this error on a small example, I don't get this error. What could the reason for this behavior be?
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
b=pd.DataFrame(columns=a.columns)
b.loc['mean'] = a.replace('',np.nan).mean(skipna=True)
print(b)
a b a
mean 3.0 3.0 6.0
I think duplicated columns names should be avoid, because then should be weird errors.
It seems there are non matched values between index of Series and columns of DataFrame:
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
a.loc['mean'] = pd.Series([2,5,4], index=list('abb'))
print(a)
ValueError: cannot reindex from a duplicate axis
One possible solution for deduplicated columns names with rename columns:
s = a.columns.to_series()
a.columns = s.add(s.groupby(s).cumcount().astype(str).replace('0',''))
print(a)
a b a1
0 1 2 7
1 5 4 5
2
Or drop duplicated columns:
a = a.loc[:, ~a.columns.duplicated()]
print(a)
a b
0 1 2
1 5 4
2

how to re-arrange multiple columns into one column with same index

I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.
This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])
try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]

Categories