How can I make all subindexes on multiindex to have same values - python

I have a Dataframe with multiindex that looks like this:
a 1
2
3
b 2
3
So The outer level has values a, b and the inner value is 1, 2, 3 for a and 2, 3 for b
I want to make sure that the indexes on the inner level are the same for all indexes on the outer level (in that case, create a new row for b with inner index 1). The values on the columns would be all Nulls for these new rows.
Is there an easy way to do it?

You can re-index with a MultiIndex made from your original dataframe indices:
df.reindex(pd.MultiIndex.from_product(df.index.levels))
Example:
idx = pd.MultiIndex.from_arrays([['a','a','a','b','b'],[1,2,3,2,3]])
df = pd.DataFrame(np.random.random(5), index=idx)
>>> df
0
a 1 0.354691
2 0.322138
3 0.195380
b 2 0.731177
3 0.912628
>>> df.reindex(pd.MultiIndex.from_product(df.index.levels))
0
a 1 0.354691
2 0.322138
3 0.195380
b 1 NaN
2 0.731177
3 0.912628

Related

Filter out rows that are the same in one column but have multiple values in another columns respectively in dataframe

To say I have a dataframe with three columns like:
index
A
B
C
1
foo
One
1
2
foo
Two
2
3
foo
Three
3
4
bar
One
2
5
bar
One
1
6
num
Two
3
7
num
Three
3
In this case, how may I filter out the rows that have the same value in column B but more than one respective value in column C by using Python Pandas?
The rows that I need is 1, 2, 4, 5, 6 because "One" in column B has two corresponding values (1 and 2) in column C and "Two" in column B has two corresponding values as well. Eventually I want to group them by column A if possible.
You can try groupby B column then filter by the value_counts of C column.
out = df.groupby('B').filter(lambda group: len(group['C'].value_counts()) > 1)
print(out)
index A B C
0 1 foo One 1
1 2 foo Two 2
3 4 bar One 2
4 5 bar One 1
5 6 num Two 3
Not an optimised solution but will get your work done:
import pandas as pd
# create dataframe
df = pd.DataFrame([['foo','One',1],['foo','Two',2],['foo','Three',3],['bar','One',2], ['bar','One',1],['num','Two',3],['num','Three',3]], index = range(1,8), columns = ['A','B','C'])
# get the unique values present in column B
values = list(df['B'].unique())
result = pd.DataFrame()
# iterate through the unique values and for each unique value check the corresponding values in C
for val in values:
unique_values = list(df[df['B'] == val]['C'].unique())
# if the unique values in column C is more than 1, it satisfies your condition and hence can be added into your result dataFrame.
if len(unique_values) > 1:
result = result.append(df[df['B'] == val])
print(result)
The result is the rows 1, 2, 4, 5, 6.
Always show your work in the question.

pd.DataFrame on dataframe

What does pd.DataFrame does on a dataframe? Please see the code below.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
In [3]: b = pd.DataFrame(a)
In [4]: a['c'] = [7,8,9]
In [5]: a
Out[5]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [6]: b
Out[6]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [7]: a.drop(columns='c', inplace=True)
In [8]: a
Out[8]:
a b
0 1 4
1 2 5
2 3 6
In [9]: b
Out[9]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In In[3], the function pd.DataFrame is applied on a dataframe a. It turns out that the ids of a and b are different. However, when a column is added to a, the same column is added to b, but when we drop a column from a, the column is not dropped from b. So what does pd.DataFrame does? Are a and b the same object or different? What should we do to a so that we drop the column from b? Or, how do we prevent a column from being added to b when we add a column to a?
I would avoid your statements at all cost. Better would be to make a dataframe as such:
df=pd.DataFrame({'a': [0,1,2], 'b': [3,4,5], 'c':[6,7,8]})
The above result is a dataframe, with indices and column names.
You can add a column to df, like this:
df['d'] = [8,9,10]
And remove a column to the dataframe, like this:
df.drop(columns='c',inplace=True)
I would not create a dataframe from a function definition, but use 'append' instead. Append works for dictionaries and dataframes. An example for a dictionary based append:
df = pd.DataFrame(columns=['Col1','Col2','Col3','Col4']) # create empty df with column names.
append_dict = {'Col1':value_1, 'Col2':value_2, 'Col3':value_3,'Col4':value_4}
df = df.append(append_dict,ignore_index=True).
The values can be changed in a loop, so it does something with respect to the previous values. For dataframe append, you can check the pandas documentation (just replace the append_dict argument with the dataframe that you like to append)
Is this what you want?

Concatenate dataframes alternating rows with Pandas

I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?
I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')
IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r

Pandas: Create dataframe column based on other dataframe

If I have 2 dataframes like these two:
import pandas as pd
df1 = pd.DataFrame({'Type':list('AABAC')})
df2 = pd.DataFrame({'Type':list('ABCDEF'), 'Value':[1,2,3,4,5,6]})
Type
0 A
1 A
2 B
3 A
4 C
Type Value
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 F 6
I would like to add a column in df1 based on the values in df2. df2 only contains unique values, whereas df1 has multiple entries of each value.
So the resulting df1 should look like this:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
My actual dataframe df1 is quite long, so I need something that is efficient (I tried it in a loop but this takes forever).
As requested I am posting a solution that uses map without the need to create a temporary dict:
In[3]:
df1['Value'] = df1['Type'].map(df2.set_index('Type')['Value'])
df1
Out[3]:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
This relies on a couple things, that the key values that are being looked up exist otherwise we get a KeyError and that we don't have duplicate entries in df2 otherwise setting the index raises InvalidIndexError: Reindexing only valid with uniquely valued Index objects
You could create dict from your df2 with to_dict method and then map result to Type column for df1:
replace_dict = dict(df2.to_dict('split')['data'])
In [50]: replace_dict
Out[50]: {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6}
df1['Value'] = df1['Type'].map(replace_dict)
In [52]: df1
Out[52]:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
Another way to do this is by using the label based indexer loc. First use the Type column as the index using .set_index, then access using the df1 column, and reset the index to the original with .reset_index:
df2.set_index('Type').loc[df1['Type'],:].reset_index()
Either use this as your new df1 or extract the Value column:
df1['Value'] = df2.set_index('Type').loc[df1['Type'],:].reset_index()['Value']

How to not sort the index in pandas

I have 2 data frames with one column each. Index of the first is [C,B,F,A,Z] not sorted in any way. Index of the second is [C,B,Z], also unsorted.
I use pd.concat([df1,df2],axis=1) and get a data frame with 2 columns and NaN in the second column where there is no appropriate value for the index.
The problem I have is that index automatically becomes sorted in alphabetical order.
I have tried = pd.concat([df1,df2],axis=1, names = my_list) where my_list = [C,B,F,A,Z], but that didn't make any changes.
How can I specify index to be not sorted?
This seems to be by design, the only thing I'd suggest is to call reindex on the concatenated df and pass the index of df:
In [56]:
df = pd.DataFrame(index=['C','B','F','A','Z'], data={'a':np.arange(5)})
df
Out[56]:
a
C 0
B 1
F 2
A 3
Z 4
In [58]:
df1 = pd.DataFrame(index=['C','B','Z'], data={'b':np.random.randn(3)})
df1
Out[58]:
b
C -0.146799
B -0.227027
Z -0.429725
In [67]:
pd.concat([df,df1],axis=1).reindex(df.index)
Out[67]:
a b
C 0 -0.146799
B 1 -0.227027
F 2 NaN
A 3 NaN
Z 4 -0.429725

Categories