I have the following Pandas DataFrame which I use comparing the performance of different classifiers over multiple iterations. After each iteration, I save the ranking of that specific classifier to a DataFrame which is the cumulative sum of rankings over all iterations (the index of the DataFrame tells the ranking from 0-3, i.e., 4 classifiers in total and 0 is the best).
The DataFrame looks as follows:
rankings = {'Classifier1': ['1', '2', '1', '0'],
'Classifier2': ['2', '1', '1', '0'],
'Classifier3': ['0', '1', '1', '2'],
'Classifier4': ['1', '0', '1', '2']}
df = pd.DataFrame(data = rankings)
which formats as:
Classifier1 Classifier2 Classifier3 Classifier4
0 1 2 0 1
1 2 1 1 0
2 1 1 1 1
3 0 0 2 2
I would like to create the following boxplot (as in this paper) of the different classifier by using Seaborn or alternative method:
Firstly, we need to convert your data into numeric values rather than strings. Then, we melt the dataframe to get it into long format, and finally we apply a boxplot with a swarmplot on top
df = df.apply(pd.to_numeric).melt(var_name='Classifier', value_name='AUC Rank')
ax = sns.boxplot(data=df, x='Classifier', y='AUC Rank')
ax = sns.swarmplot(data=df, x='Classifier', y='AUC Rank', color='black')
Related
I want to print two dataframes that print the rows where there is a mismatch in a given column, here the "second_column":
"first_column" is a key value that identify same product in both dataframes
import pandas as pd
data1 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['1', '2', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df1 = pd.DataFrame(data1)
print(df1)
test = df1['second_column'].nunique()
data2 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['3', '4', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df2 = pd.DataFrame(data2)
print(df2)
expected output:
IIUC
btw, you screenshots don't match your DF definition
df1.loc[~df1['second_column'].isin(df2['second_column'])]
first_column second_column third_column fourth_column
0 1 1 1 1
df2.loc[~df2['second_column'].isin(df1['second_column'])]
first_column second_column third_column fourth_column
0 1 3 1 1
1 2 4 2 2
the compare method can do what you want.
different_rows = df1.compare(df2, align_axis=1).index
df1.loc[different_rows]
With this method, one important point is if there are extra rows (index) then it will not return a difference.
or if you want to find differences in one column only, you can first join on the index then check if the join matches
joined_df = df1.join(df2['second_column'], rsuffix='_df2')
diff = joined_df['second_column']!=joined_df['second_column_df2']
print(joined_df.loc[diff, df1.columns])
I have the following dataframe with multiple dictionaries in a list in the Rules column.
SetID SetName Rules
0 Standard_1 [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]
1 Standard_2 [{'RulesID': '12', 'RuleName': 'name_arg'}]
The desired output is:
SetID SetName RulesID RuleName
0 Standard_1 10 name_abc
0 Standard_1 11 name_xyz
1 Standard_2 12 name_arg
It might be possible that there are more than two dictionaries inside of the list.
I am thinking about a pop, explode or pivot function to build the dataframe but I have no clue how to start.
Each advice will be very appreciated!
EDIT: To build the dataframe you can use the follwing dataframe constructor:
# initialize list of lists
data = [[0, 'Standard_1', [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]], [1, 'Standard_2', [{'RulesID': '12', 'RuleName': 'name_arg'}]]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['SetID', 'SetName', 'Rules'])
You can use explode:
tmp = df.explode('Rules').reset_index(drop=True)
df = pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1).drop('Rules', axis=1)
Output:
>>> df
SetID SetName RulesID RuleName
0 0 Standard_1 10 name_abc
1 0 Standard_1 11 name_xyz
2 1 Standard_2 12 name_arg
One-liner version of the above:
df.explode('Rules').reset_index(drop=True).pipe(lambda x: pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1)).drop('Rules', axis=1)
This question already has answers here:
Update a column values from another df where ids are same
(2 answers)
How do I merge dataframes on unique column values?
(1 answer)
Closed last year.
Given I have two DataFrames:
import pandas as pd
df1 = pd.DataFrame([['2017', '1'],
['2018', '2'],
['2019', '3'],
['2020', '4'],
['2021', '5'],
['2022', '6'],
], columns=['datetime', 'values'])
df2 = pd.DataFrame([['2018', '0'],
['2019', '0'],
['2020', '0'],
], columns=['datetime', 'values'])
print(df1)
print(df2)
(Assume the values in the column 'datetime' has datetime format and is not string)
How can I replace the values in df1 to the values of df2 where the datetime exists in both without using loops?
You can use combine_first after temporarily setting the index to whatever you want to use as matching columns:
(df2.set_index('datetime')
.combine_first(df1.set_index('datetime'))
.reset_index()
)
output:
datetime values
0 2017 1
1 2018 0
2 2019 0
3 2020 0
4 2021 5
5 2022 6
Assuming I have a dataframe as follows:
df = pd.DataFrame({ 'ids' : ['1', '1', '1', '1', '2', '2', '2', '3', '3'],
'values' : ['5', '8', '7', '12', '2', '1', '3', '15', '4']
}, dtype='int32')
ids values
1 5
1 7
1 8
1 12
2 1
2 3
2 2
3 4
3 15
What I would like to do is to loop over the values column and check which values are greater than 6 and the corresponding id from the ids column must be appended into an empty list.
Even if an id (say 3) has multiple values and out of those multiple values (4 and 15), only one value is greater than 6, I would like the corresponding id to be appended into the list.
Example:
Assuming we run a loop over the above mentioned dataframe df, I would like the output as follows:
more = [1, 3]
less = [2]
with more =[] and less = [] being pre-initialized empty lists
What I have so far:
I tried implementing the same, but surely I am doing some mistake. The code I have:
less = []
more = []
for value in df['values']:
for id in df['ids']:
if (value > 6):
more.append(id)
else:
less.append(id)
Use groupby and boolean indexing to create your lists. This will be much faster than looping:
g = df.groupby('ids')['values'].max()
mask = g.gt(6)
more = g[mask].index.tolist()
less = g[~mask].index.tolist()
print(more)
print(less)
[1, 3]
[2]
You can use dataframe indexing to scrape out all those indices which are greater than 6 and create a set of unique indices using:
setA = set(df[df['values'] > 6]['ids'])
This will create a set of all indices in the dataframe:
setB = set(df['ids'])
Now,
more = list(setA)
and for less, take the set difference:
less = list(setB.difference(setA))
That's it!
The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)