Sort dataframe by multiple columns while ignoring case - python

I want to sort a dataframe by multiple columns like this:
df.sort_values( by=[ 'A', 'B', 'C', 'D', 'E' ], inplace=True )
However i found out that python first sorts the uppercase values and then the lowercase.
I tried this:
df.sort_values( by=[ 'A', 'B', 'C', 'D', 'E' ], inplace=True, key=lambda x: x.str.lower() )
but i get this error:
TypeError: sort_values() got an unexpected keyword argument 'key'
If i could, i would turn all columns to lowercase but i want them as they are.
Any hints?

If check docs - DataFrame.sort_values for correct working need upgrade pandas higher like pandas 1.1.0:
key - callable, optional
Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
New in version 1.1.0.
Sample:
df = pd.DataFrame({
'A':list('MmMJJj'),
'B':list('aYAbCc')
})
df.sort_values(by=[ 'A', 'B'], inplace=True, key=lambda x: x.str.lower())
print (df)
A B
3 J b
4 J C
5 j c
0 M a
2 M A
1 m Y

Related

How use 'for-loop' and column addition to produce column in dataframe?

I'm quite new to python and so would really appreciate some help.
I've simplified my dataframes as I'm working on large ones.
My question is what would the code be to produce a new column in df1 so that it looks like 'Merged' in df2 - ie it is made up of:
the 'Letter' column value
a 'for loop' that either includes an underscore and 'Number' value if it exists, or skips this step if there isn't a value (such as the final row)
an underscore and 'Capital' column value
data1 = {'Letter': ['a', 'b', 'c'],
'Number': ['1', '2', ''],
'Capital': ['A', 'B', 'C']}
df1 = pd.DataFrame (data1, columns = ['Letter', 'Number', 'Capital'])
print(df1)
data2 = {'Letter': ['a', 'b', 'c'],
'Number': ['1', '2', ''],
'Capital': ['A', 'B', 'C'],
'Merged': ['a_1_A', 'b_2_B', 'c_C']}
df2 = pd.DataFrame (data2, columns = ['Letter', 'Number', 'Capital', 'Merged'])
print(df2)
Sorry, I can't figure out how to run this code but hopefully that makes sense. I understand how to add columns (below) but can't figure out how to incorporate a for loop. My best guess is:
df1["merged"] = (df1["Letter"] +
for value in data1:
if data1["Number"] != "":
"_" + data1["Number"]
else:
+ "_" + df1["Capital"])
You can define your logic in a separate function and apply this function to each row.
In order to eliminate the empty fields, include filter() method.
def func(row):
row['merged'] = '_'.join(filter(None, row))
return row
df1 = df1.apply(func, axis=1)
df1
Result:
Letter Number Capital merged
0 a 1 A a_1_A
1 b 2 B b_2_B
2 c C c_C
Or, you can just use the lambda function:
df1['merged'] = df1.apply(lambda row: '_'.join(filter(None, row)), axis=1)
df1
Result:
Letter Number Capital merged
0 a 1 A a_1_A
1 b 2 B b_2_B
2 c C c_C
(Almost always, there is more than one way in Pandas to achieve the same result - which can be both confusing and amazing!)

How to drop rows based on column value if column is not set as index in pandas?

I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.
Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4
You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)

How to combine values in a column by group?

I have this dataframe as input read from excel:
Name Data
0 A green
1 A blue
2 A yellow
3 A green
4 B green
5 B green
6 B red
7 C violet
8 C red
9 C white
Desired output:
Name Data
0 A blue;green;yellow
1 B green;red
2 C red;violet;white
I tried the following, both gave errors
pivot_df = df.pivot(index = df.columns[0], columns = df.columns[1]) ## Gives --> ValueError: Index contains duplicate entries, cannot reshape
pivot_table_df = df.pivot_table(index = df.columns[0], columns = df.columns[1]) ## gives --> pandas.core.base.DataError: No numeric types to aggregate
A simple way to do it is -
df.groupby(['Name'])['Data'].apply(set).apply(';'.join).reset_index()
Name Data
0 A yellow;green;blue
1 B red;green
2 C red;violet;white
# convert type to string you can also use .astype(str) which is vectorized
df["Data"] = df["Data"].map(str)
# group data by name, set index parameters false otherwise you will have "Name" as an index. Theoretically you could simply do ["Data"].apply(list) but aggregate is more scalable in case other columns will be added later on.
df = df.groupby(["Name"], as_index=False).aggregate({"Data": set})
# df["Data"] now contains a set, we want to get a ordered, concatenated string with the delimiter ";" out of it, therefore we use ";".join() to join a list to a string. I use .map which is not vectorized, and this part is therefore probably up for improvement.
df["Data"] = df["Data"].map(lambda x: ";".join(sorted(x)))
Since the 'Data' column can contain numbers (as stated in comments), it's better to set the dtype as str, because .join is a str method, and it's more efficient then using map inside the lambda function (e.g. map(str, set(x))).
Use .groupby on 'Data' and .apply a function
lambda x: ';'.join(sorted(set(x))))
Each group contains non-unique values, so use set, because a set may only contain unique values.
Use sorted if you want the result in order, otherwise replace sorted(set(x)) with set(x).
import pandas as pd
# test data
data = {'Name': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'B', 'C'],
'Data': ['green', 'blue', 'yellow', 'green', 'green', 'green', 'red', 'violet', 'red', 'white', 3, 3, 3]}
# create dataframe
df = pd.DataFrame(data)
# convert the Data column to str type
df.Data = df.Data.astype('str')
# groupby name and apply the function
dfg = df.groupby('Name', as_index=False)['Data'].apply(lambda x: ';'.join(sorted(set(x))))
# display(dfg)
Name Data
0 A 3;blue;green;yellow
1 B 3;green;red
2 C 3;red;violet;white

Apply a specific function for two columns in Pandas DataFrame

I have a Pandas DataFrame with two columns, each row contains a list of elements. I'm trying to find set difference between two columns for each row using pandas.apply method.
My df for example
A B
0 ['a','b','c'] ['a']
1 ['e', 'f', 'g'] ['f', 'g']
So it should look like this:
df.apply(set_diff_func, axis=1)
What I'm trying to achieve:
0 ['b','c']
1 ['e']
I can make it using iterrows, but I've once read, that it's better to use apply when it's possible.
How about
df.apply(lambda row: list(set(row['A']) - set(row['B'])), axis = 1)
or
(df['A'].apply(set) - df['B'].apply(set)).apply(list)
Here's the function you need, you can change the name of the columns with the col1 and col2 arguments by passing them to the args option in apply:
def set_diff_func(row, col1, col2):
return list(set(row[col1]).difference(set(row[col2])))
This should return the required result:
>>> dataset = pd.DataFrame(
[{'A':['a','b','c'], 'B':['a']},
{'A':['e', 'f', 'g'] , 'B':['f', 'g']}])
>>> dataset.apply(set_diff_func, axis=1, args=['A','B'])
0 [c, b]
1 [e]

Finding the Location of the Duplicate for Duplicated Columns in Pandas

I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])
I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index is
A None
B None
C A
D A
dtype: object
Another more direct way to test if two numeric columns are duplicated with each other is to test the correlation matrix, which test all pairs of columns. Here is the code:
import pandas as pd
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
# compute the correlation matrix
cm = df.corr()
cm
This shows a matrix of the correlation of all columns to each other column (including itself). If a column is 1:1 with another column, then the value is 1.0.
To find all columns that are duplicates of A, then :
cm['A']
A 1.0
B NaN
C 1.0
D 1.0
If you have categorical (string objects) and not numeric, you could make a cross correlation table.
Hope this helps!

Categories