pandas: sort each column individually - python

My dataframe looks something like this, only much larger.
d = {'Col_1' : pd.Series(['A', 'B']),
'Col_2' : pd.Series(['B', 'A', 'C']),
'Col_3' : pd.Series(['B', 'A']),
'Col_4' : pd.Series(['C', 'A', 'B', 'D']),
'Col_5' : pd.Series(['A', 'C']),}
df = pd.DataFrame(d)
Col_1 Col_2 Col_3 Col_4 Col_5
A B B C A
B A A A C
NaN C NaN B NaN
NaN NaN NaN D NaN
First, I'm trying to sort each column individually. I've tried playing around with something like: df.sort([lambda x: x in df.columns], axis=1, ascending=True, inplace=True) however have only ended up with errors. How do I sort each column individually to end up with something like:
Col_1 Col_2 Col_3 Col_4 Col_5
A A A A A
B B B B C
NaN C NaN C NaN
NaN NaN NaN D NaN
Second, I'm looking to concatenate the rows within the columns
df = pd.concat([df,pd.DataFrame(df.sum(axis=0),columns=['Concatenation']).T])
I can combine everything with the line above after replacing np.nan with '', but the result comes out smashed ('AB') together and would require an additional step to clean (into something like 'A:B').

pandas.Series.order is deprecated since pandas=0.17. Instead, use sort_values as follows:
for col in df:
df[col] = df[col].sort_values(ignore_index=True)

Here is one way:
>>> pandas.concat([df[col].order().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
11: 0 1 2 3 4
0 A A A A A
1 B B B B C
2 NaN C NaN C NaN
3 NaN NaN NaN D NaN
[4 rows x 5 columns]
However, what you're doing is somewhat strange. DataFrames aren't just collections of unrelated columns. In a DataFrame, each row represents a record, so the value in one column is semantically linked to the values in other columns in that same row. By sorting the columns independently, you're discarding this information, so the rows are now meaningless. That's why the reset_index is needed in my example. Also, because of this, there's no way to do this in-place, which your example suggests you want.

I don't know if this is any better, but here are a couple other ways to do it.
pd.DataFrame({key: sorted(value.values(), reverse=True) \
for key, value in df.to_dict().iteritems()})
pd.DataFrame({key: sorted(values, reverse=True) \
for key, values in df.transpose().iterrows()})

Related

How to retain null/nan in one of the groupby columns while performing df.groupby

Lets say I have a dataframe that looks like this:
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', np.nan, 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
Group1 Group 2 Group 3 Value
A B C 54.34
A B nan 61.34
B A C 514.5
B A A 765.4
A B D 765.4
When I do a group by on these 3 columns, the nan row somehow gets deleted/dropped.
Ideally, I would want to have the combination ( A, B and nan in this case) to be retained. So a separate row should have been there in my output. However it gets dropped.
df2 = df.groupby(['Group1', 'Group2', 'Group3'],as_index=False).sum()
Group1 Group 2 Group 3 Value
A B C 54.34
A B D 765.4
B A A 765.4
B A C 514.5
For a workaround, I can fillna with some value and then do a group by so that i get to see the row there, however that is not an ideal solution I feel.
Please can you share how i can retain the nan row ?
Here is one way to fillna before groupby , since groupby will automatically remove the NaN
df.fillna('NaN',inplace=True)
df2 = df.groupby(['Group1', 'Group2', 'Group3'],as_index=False).sum()
df2
Group1 Group2 Group3 Value
0 A B C 54.34
1 A B D 765.40
2 A B NaN 61.34
3 B A A 765.40
4 B A C 514.50
From the doc :http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
> NA and NaT group handling
If there are any NaN or NaT values in the
grouping key, these will be automatically excluded. In other words,
there will never be an “NA group” or “NaT group”. This was not the
case in older versions of pandas, but users were generally discarding
the NA group anyway (and supporting it was an implementation
headache).

How to modify the values in a dataframe based on the values from another dataframe in an efficient way?

I have 2 dataframes like so:
import pandas as pd
data1 = {'Col1':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Col2':[3.409836, 2.930693, 2.75, 3.140845, 2.971429, 2.592593, 2.6, 3.1875, 2.857143, 0.714286]}
df1 = pd.DataFrame(data1, columns=['Col1', 'Col2'])
data2 = {'Col1':['B', 'F', 'I'],
'Col2':[23.45, 32.57, 19.85]}
df2 = pd.DataFrame(data2, columns=['Col1', 'Col2'])
I want to modify the values of Col2 in df1 with the values from df2. This is my code to do that:
for i in range(len(df2)):
for j in range(len(df1)):
if df2['Col1'][i]==df1['Col1'][j]:
df1['Col2'][j]=df2['Col2'][i]
The code works:
But the problem is, this code would be slow for large dataframes as it has complexity O(len(df1)*len(df2)). How do I merge the 2 dataframes in a faster, more efficient way?
I tried merging the dataframes using outer join, but it does not produce the correct result - it keeps both values:
pd.merge(df1, df2, how='outer')
An inner join produces a blank dataframe, a left join produces the same dataframe as df1, and a right join produces the same dataframe as df2.
If working only with one column use map:
df1['Col2'] = df1['Col1'].map(df2.set_index('Col1')['Col2']).fillna(df1['Col2'])
print (df1)
Col1 Col2
0 A 3.409836
1 B 23.450000
2 C 2.750000
3 D 3.140845
4 E 2.971429
5 F 32.570000
6 G 2.600000
7 H 3.187500
8 I 19.850000
9 J 0.714286
If multiple columns is possible use merge with left join and specified column Col1:
cols = df1.columns.difference(['Col1'])
orig_cols = [f'{x}_' for x in cols]
df = pd.merge(df1, df2, how='left', on='Col1', suffixes=('_',''))
print (df)
Col1 Col2_ Col2
0 A 3.409836 NaN
1 B 2.930693 23.45
2 C 2.750000 NaN
3 D 3.140845 NaN
4 E 2.971429 NaN
5 F 2.592593 32.57
6 G 2.600000 NaN
7 H 3.187500 NaN
8 I 2.857143 19.85
9 J 0.714286 NaN
Then replace missing values of added column by original columns and last remove them:
df[cols] = df[cols].fillna(df[orig_cols].rename(columns=lambda x: x.strip('_')))
df = df.drop(orig_cols, axis=1)
print (df)
Col1 Col2
0 A 3.409836
1 B 23.450000
2 C 2.750000
3 D 3.140845
4 E 2.971429
5 F 32.570000
6 G 2.600000
7 H 3.187500
8 I 19.850000
9 J 0.714286
try this code:
df4=df3.Col3.isnull()
df3=pd.merge(df1, df2,how='outer')
df4=df3[df3.Col3.isnull()]
df5=df3[df3.Col3.notnull()]
df5.Col2=df5.Col3
df6=df4.append(df5)
df6=df6.drop('Col3',axis=1)
df6 is the output you are looking for.

Selecting a subset using dropna() to select multiple columns

I have the following DataFrame:
df = pd.DataFrame([[1,2,3,3],[10,20,2,],[10,2,5,],[1,3],[2]],columns = ['a','b','c','d'])
From this DataFrame, I want to drop the rows where all values in the subset ['b', 'c', 'd'] are NA, which means the last row should be dropped.
The following code works:
df.dropna(subset=['b', 'c', 'd'], how = 'all')
However, considering that I will be working with larger data frames, I would like to select the same subset using the range ['b':'d']. How do I select this subset?
IIUC, use loc, retrieve those columns, and pass that to dropna.
c = df.loc[0, 'b':'d'].columns # retrieve only the 0th row for efficiency
df = df.dropna(subset=c, how='all')
print(df)
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN
Similar to #ayhan's idea - using df.columns.slice_indexer:
In [25]: cols = df.columns[df.columns.slice_indexer('b','d')]
In [26]: cols
Out[26]: Index(['b', 'c', 'd'], dtype='object')
In [27]: df.dropna(subset=cols, how='all')
Out[27]:
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN
You could also slice the column list numerically:
c = df.columns[1:4]
df = df.dropna(subset=c, how='all')
If using numbers is impractical (i.e. too many to count), there is a somewhat cumbersome work-around:
start, stop = df.columns.get_loc('b'), df.columns.get_loc('d')
c = df.columns[start:stop+1]
df = df.dropna(subset=c, how='all')

Pandas add keys while concatenating dataframes at column level

As per Pandas 0.19.2 documentation, I can provide keys argument to create a resulting multi-index DataFrame. An example (from pandas documents ) is :
result = pd.concat(frames, keys=['x', 'y', 'z'])
How would I concat the dataframe so that I can provide the keys at the column level instead of index level ?
What I basically need is something like this :
where df1 and df2 are to be concat.
This is supported by keys parameter of pd.concat when specifying axis=1:
df1 = pd.DataFrame(np.random.random((4, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.random((4, 3)), columns=list('BDF'), index=[2, 3, 6, 7])
df = pd.concat([df1, df2], keys=['X', 'Y'], axis=1)
The resulting output:
X Y
A B C D B D F
0 0.654406 0.495906 0.601100 0.309276 NaN NaN NaN
1 0.020527 0.814065 0.907590 0.924307 NaN NaN NaN
2 0.239598 0.089270 0.033585 0.870829 0.882028 0.626650 0.622856
3 0.983942 0.103573 0.370121 0.070442 0.986487 0.848203 0.089874
6 NaN NaN NaN NaN 0.664507 0.319789 0.868133
7 NaN NaN NaN NaN 0.341145 0.308469 0.884074

How to divide two dataframes with different length and duplicated indexs in Python

Here is my code and I want to get the expected output, but, division of dataframes does not work, what is wrong here?
import pandas as pd
data1 = {'name':['A', 'C', 'D'], 'cond_a':['B','B','B'], 'value':[10,12,14]}
data2 = {'name':['A', 'C', 'D','D','A'], 'cond_a':['G','G','G','G','G'], 'value':[5,6,7,3,2]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.set_index('name', inplace=True)
df2.set_index('name', inplace=True)
df2['new_col'] = df2['value'] / df1['value']
expected output:
cond_a value new_col
name
A G 5 5/10
C G 6 6/12
D G 7 7/14
D G 3 3/14
A G 2 2/10
As long as df1 has a unique index, you can reindex it on df2 when performing the division:
df2['new_col'] = df2['value'] / df1['value'].reindex(df2.index)
The resulting output:
cond_a value new_col
name
A G 5 0.500000
C G 6 0.500000
D G 7 0.500000
D G 3 0.214286
A G 2 0.200000
What doesn't work in your case is not DataFrame division, which you can easily check:
df2['value'] / df1['value']
Out[]:
name
A 0.500000
A 0.200000
C 0.500000
D 0.500000
D 0.214286
Name: value, dtype: float64
The problem is that in the process of this division pandas loses track of the order of index name. Then when you are trying to assign the result back to the df2, you have duplicates in your index name and pandas doesn't know how to merge them, because it is an ambiguous situation to have. In general having duplicates in your index is not a good idea. Get rid of the duplicates and your code will work.

Categories