How to not sort the index in pandas - python

I have 2 data frames with one column each. Index of the first is [C,B,F,A,Z] not sorted in any way. Index of the second is [C,B,Z], also unsorted.
I use pd.concat([df1,df2],axis=1) and get a data frame with 2 columns and NaN in the second column where there is no appropriate value for the index.
The problem I have is that index automatically becomes sorted in alphabetical order.
I have tried = pd.concat([df1,df2],axis=1, names = my_list) where my_list = [C,B,F,A,Z], but that didn't make any changes.
How can I specify index to be not sorted?

This seems to be by design, the only thing I'd suggest is to call reindex on the concatenated df and pass the index of df:
In [56]:
df = pd.DataFrame(index=['C','B','F','A','Z'], data={'a':np.arange(5)})
df
Out[56]:
a
C 0
B 1
F 2
A 3
Z 4
In [58]:
df1 = pd.DataFrame(index=['C','B','Z'], data={'b':np.random.randn(3)})
df1
Out[58]:
b
C -0.146799
B -0.227027
Z -0.429725
In [67]:
pd.concat([df,df1],axis=1).reindex(df.index)
Out[67]:
a b
C 0 -0.146799
B 1 -0.227027
F 2 NaN
A 3 NaN
Z 4 -0.429725

Related

pd.DataFrame on dataframe

What does pd.DataFrame does on a dataframe? Please see the code below.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
In [3]: b = pd.DataFrame(a)
In [4]: a['c'] = [7,8,9]
In [5]: a
Out[5]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [6]: b
Out[6]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [7]: a.drop(columns='c', inplace=True)
In [8]: a
Out[8]:
a b
0 1 4
1 2 5
2 3 6
In [9]: b
Out[9]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In In[3], the function pd.DataFrame is applied on a dataframe a. It turns out that the ids of a and b are different. However, when a column is added to a, the same column is added to b, but when we drop a column from a, the column is not dropped from b. So what does pd.DataFrame does? Are a and b the same object or different? What should we do to a so that we drop the column from b? Or, how do we prevent a column from being added to b when we add a column to a?
I would avoid your statements at all cost. Better would be to make a dataframe as such:
df=pd.DataFrame({'a': [0,1,2], 'b': [3,4,5], 'c':[6,7,8]})
The above result is a dataframe, with indices and column names.
You can add a column to df, like this:
df['d'] = [8,9,10]
And remove a column to the dataframe, like this:
df.drop(columns='c',inplace=True)
I would not create a dataframe from a function definition, but use 'append' instead. Append works for dictionaries and dataframes. An example for a dictionary based append:
df = pd.DataFrame(columns=['Col1','Col2','Col3','Col4']) # create empty df with column names.
append_dict = {'Col1':value_1, 'Col2':value_2, 'Col3':value_3,'Col4':value_4}
df = df.append(append_dict,ignore_index=True).
The values can be changed in a loop, so it does something with respect to the previous values. For dataframe append, you can check the pandas documentation (just replace the append_dict argument with the dataframe that you like to append)
Is this what you want?

Drop a column which is a subset of any other column in a dataframe

I have a pandas dataframe as below. How can I drop any column which is a subset of any of the remaining columns? I would like to do this without using fillna.
df = pd.DataFrame([ [1,1,3,3], [np.NaN,2,np.NaN,4]], columns=['A','B','C','D'] )
df
A B C D
0 1.0 1 3.0 3
1 NaN 2 NaN 4
I can identify here that column A is subset of B and column C is a subset of D with something like this:
if all(df[A][df[A].notnull()].isin(df[B]))
I could run a loop over all columns and drop the subset columns. But is there a more efficient way to accomplish this, so that I have the following result:
df
B D
0 1 3
1 2 4
Thanks.
It still requires iteration, but you can use this list comprehension (with an if statement similar to the one you provided) to get columns to keep:
keep_cols = [x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))]
# ['B', 'D']
And then use the result with filter:
df.filter(items=keep_cols)
# B D
# 0 1 3
# 1 2 4
This should be fast enough, since it still uses apply at its core, and seems to be safer/more efficient than dropping columns within a loop.
If you're keen on a one-line solution, of course assigning the list to a variable is an optional step:
df.filter(items=[x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))])

Pandas: Create dataframe column based on other dataframe

If I have 2 dataframes like these two:
import pandas as pd
df1 = pd.DataFrame({'Type':list('AABAC')})
df2 = pd.DataFrame({'Type':list('ABCDEF'), 'Value':[1,2,3,4,5,6]})
Type
0 A
1 A
2 B
3 A
4 C
Type Value
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 F 6
I would like to add a column in df1 based on the values in df2. df2 only contains unique values, whereas df1 has multiple entries of each value.
So the resulting df1 should look like this:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
My actual dataframe df1 is quite long, so I need something that is efficient (I tried it in a loop but this takes forever).
As requested I am posting a solution that uses map without the need to create a temporary dict:
In[3]:
df1['Value'] = df1['Type'].map(df2.set_index('Type')['Value'])
df1
Out[3]:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
This relies on a couple things, that the key values that are being looked up exist otherwise we get a KeyError and that we don't have duplicate entries in df2 otherwise setting the index raises InvalidIndexError: Reindexing only valid with uniquely valued Index objects
You could create dict from your df2 with to_dict method and then map result to Type column for df1:
replace_dict = dict(df2.to_dict('split')['data'])
In [50]: replace_dict
Out[50]: {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6}
df1['Value'] = df1['Type'].map(replace_dict)
In [52]: df1
Out[52]:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
Another way to do this is by using the label based indexer loc. First use the Type column as the index using .set_index, then access using the df1 column, and reset the index to the original with .reset_index:
df2.set_index('Type').loc[df1['Type'],:].reset_index()
Either use this as your new df1 or extract the Value column:
df1['Value'] = df2.set_index('Type').loc[df1['Type'],:].reset_index()['Value']

Finding highest values in each row in a data frame for python

I'd like to find the highest values in each row and return the column header for the value in python. For example, I'd like to find the top two in each row:
df =
A B C D
5 9 8 2
4 1 2 3
I'd like my for my output to look like this:
df =
B C
A D
You can use a dictionary comprehension to generate the largest_n values in each row of the dataframe. I transposed the dataframe and then applied nlargest to each of the columns. I used .index.tolist() to extract the desired top_n columns. Finally, I transposed this result to get the dataframe back into the desired shape.
top_n = 2
>>> pd.DataFrame({n: df.T[col].nlargest(top_n).index.tolist()
for n, col in enumerate(df.T)}).T
0 1
0 B C
1 A D
I decided to go with an alternative way: Apply the pd.Series.nlargest() function to each row.
Path to Solution
>>> df.apply(pd.Series.nlargest, axis=1, n=2)
A B C D
0 NaN 9.0 8.0 NaN
1 4.0 NaN NaN 3.0
This gives us the highest values for each row, but keeps the original columns, resulting in ugly NaN values where a column is not everywhere part of the top n values. Actually, we want to receive the index of the nlargest() result.
>>> df.apply(lambda s, n: s.nlargest(n).index, axis=1, n=2)
0 Index(['B', 'C'], dtype='object')
1 Index(['A', 'D'], dtype='object')
dtype: object
Almost there. Only thing left is to convert the Index objects into Series.
Solution
df.apply(lambda s, n: pd.Series(s.nlargest(n).index), axis=1, n=2)
0 1
0 B C
1 A D
Note that I'm not using the Index.to_series() function since I do not want to preserve the original index.

set multiple Pandas DataFrame columns to values in a single column or multiple scalar values at the same time

I'm trying to set multiple new columns to one column and, separately, multiple new columns to multiple scalar values. Can't do either. Any way to do it other than setting each one individually?
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df.loc[:,['C','D']]=df['A']
df.loc[:,['C','D']]=[0,1]
for c in ['C', 'D']:
df[c] = d['A']
df['C'] = 0
df['D'] = 1
Maybe it is what you are looking for.
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df['C'], df['D'] = df['A'], df['A']
df['E'], df['F'] = 0, 1
# Result
A B C D E F
0 0 1 0 0 0 1
1 2 3 2 2 0 1
2 4 5 4 4 0 1
The assign method will create multiple, new columns in one step. You can pass a dict() with the column and values to return a new DataFrame with the new columns appended to the end.
Using your examples:
df = df.assign(**{'C': df['A'], 'D': df['A']})
and
df = df.assign(**{'C': 0, 'D':1})
See this answer for additional detail: https://stackoverflow.com/a/46587717/4843561

Categories