How to split multiple dictionaries in row into new rows using Pandas - python

I have the following dataframe with multiple dictionaries in a list in the Rules column.
SetID SetName Rules
0 Standard_1 [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]
1 Standard_2 [{'RulesID': '12', 'RuleName': 'name_arg'}]
The desired output is:
SetID SetName RulesID RuleName
0 Standard_1 10 name_abc
0 Standard_1 11 name_xyz
1 Standard_2 12 name_arg
It might be possible that there are more than two dictionaries inside of the list.
I am thinking about a pop, explode or pivot function to build the dataframe but I have no clue how to start.
Each advice will be very appreciated!
EDIT: To build the dataframe you can use the follwing dataframe constructor:
# initialize list of lists
data = [[0, 'Standard_1', [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]], [1, 'Standard_2', [{'RulesID': '12', 'RuleName': 'name_arg'}]]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['SetID', 'SetName', 'Rules'])

You can use explode:
tmp = df.explode('Rules').reset_index(drop=True)
df = pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1).drop('Rules', axis=1)
Output:
>>> df
SetID SetName RulesID RuleName
0 0 Standard_1 10 name_abc
1 0 Standard_1 11 name_xyz
2 1 Standard_2 12 name_arg
One-liner version of the above:
df.explode('Rules').reset_index(drop=True).pipe(lambda x: pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1)).drop('Rules', axis=1)

Related

column comparison of two dataframe, return df with mismatches python

I want to print two dataframes that print the rows where there is a mismatch in a given column, here the "second_column":
"first_column" is a key value that identify same product in both dataframes
import pandas as pd
data1 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['1', '2', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df1 = pd.DataFrame(data1)
print(df1)
test = df1['second_column'].nunique()
data2 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['3', '4', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df2 = pd.DataFrame(data2)
print(df2)
expected output:
IIUC
btw, you screenshots don't match your DF definition
df1.loc[~df1['second_column'].isin(df2['second_column'])]
first_column second_column third_column fourth_column
0 1 1 1 1
df2.loc[~df2['second_column'].isin(df1['second_column'])]
first_column second_column third_column fourth_column
0 1 3 1 1
1 2 4 2 2
the compare method can do what you want.
different_rows = df1.compare(df2, align_axis=1).index
df1.loc[different_rows]
With this method, one important point is if there are extra rows (index) then it will not return a difference.
or if you want to find differences in one column only, you can first join on the index then check if the join matches
joined_df = df1.join(df2['second_column'], rsuffix='_df2')
diff = joined_df['second_column']!=joined_df['second_column_df2']
print(joined_df.loc[diff, df1.columns])

How to use apply for two pandas column including lists to return index in a list in one column using the element in another column?

I have a pandas Dataframe with columns of "a" and "b". Column a has a list of values as a column value, and column "b" has a list with a single value that might appear in column "a". I want to create a new column c based on column a and b that has the value of position of element in b that appears in column a values using apply. (c: (index of b in a)+1 )
column b is always a list with one element or no element at all, column a can be in any length, but if it is empty, column b would be empty as well. column b element is expected to be in column a and I just want to find the position of first occurrence of it in column a.
a b c
['1', '2', '5'] ['2'] 2
['2','3','4'] ['4'] 3
['2','3','4'] [] 0
[] [] 0
...
I wrote a for loop which works fine but it is pretty slow:
for i in range(0,len(df)):
if len(df['a'][i])!=0:
df['c'][i]=df['a'][i].index(*df['b'][i])+1
else:
df['c'][i]=0
But I want to use apply to make it faster, the following does not work, any thoughts or suggestion would greatly be appreciated?
df['c']=df['a'].apply(df['a'].index(*df['b']))
First of all, here is a basic method using .apply().
import pandas as pd
import numpy as np
list_a = [['1', '2', '5'], ['2', '3', '4'], ['2', '3', '4'], []]
list_b = [['2'], ['4'], [], []]
df_1 = pd.DataFrame(data=zip(list_a, list_b), columns=['a', 'b'])
df_1['a'] = df_1['a'].map(lambda x: x if x else np.NaN)
df_1['b'] = df_1['b'].map(lambda x: x[0] if x else np.NaN)
#df_1['b'] = df_1['b'].map(lambda x: next(iter(x), np.NaN))
def calc_c(curr_row: pd.Series) -> int:
if curr_row['a'] is np.NaN or curr_row['b'] is np.NaN:
return 0
else:
return curr_row['a'].index(curr_row['b'])
df_1['c'] = df_1[['a', 'b']].apply(func=calc_c, axis=1)
df_1 result:
a b c
-- --------------- --- ---
0 ['1', '2', '5'] 2 1
1 ['2', '3', '4'] 4 2
2 ['2', '3', '4'] nan 0
3 nan nan 0
I replaced the empty lists with NaN, I find it far more idiomatic and practical.
This is obviously not an ideal solution, I will try to find something else. Obviously, the more information we have about your program and the DataFrame, the better.
By reading in the data so the data types are list, i am able to create an apply function that creates the values for c:
import io, ast
#a b
#['1','2','5'] ['2']
#['2','3','4'] ['4']
#['2','3','4'] []
#[] []
csvfile=io.StringIO("""a b
['1','2','5'] ['2']
['2','3','4'] ['4']
['2','3','4'] []
[] []""")
df = pd.read_csv(csvfile, sep=' ', converters={'a' : ast.literal_eval, 'b' : ast.literal_eval })
def a_b_index(hm):
if hm.b != []:
return hm.a.index(hm.b[0])
else:
return 0
df['c'] = df.apply(a_b_index, axis=1)
df.c
# a b c
#0 [1, 2, 5] [2] 1
#1 [2, 3, 4] [4] 2
#2 [2, 3, 4] [] 0
#3 [] [] 0

How to append a list after looping over a dataframe column?

Assuming I have a dataframe as follows:
df = pd.DataFrame({ 'ids' : ['1', '1', '1', '1', '2', '2', '2', '3', '3'],
'values' : ['5', '8', '7', '12', '2', '1', '3', '15', '4']
}, dtype='int32')
ids values
1 5
1 7
1 8
1 12
2 1
2 3
2 2
3 4
3 15
What I would like to do is to loop over the values column and check which values are greater than 6 and the corresponding id from the ids column must be appended into an empty list.
Even if an id (say 3) has multiple values and out of those multiple values (4 and 15), only one value is greater than 6, I would like the corresponding id to be appended into the list.
Example:
Assuming we run a loop over the above mentioned dataframe df, I would like the output as follows:
more = [1, 3]
less = [2]
with more =[] and less = [] being pre-initialized empty lists
What I have so far:
I tried implementing the same, but surely I am doing some mistake. The code I have:
less = []
more = []
for value in df['values']:
for id in df['ids']:
if (value > 6):
more.append(id)
else:
less.append(id)
Use groupby and boolean indexing to create your lists. This will be much faster than looping:
g = df.groupby('ids')['values'].max()
mask = g.gt(6)
more = g[mask].index.tolist()
less = g[~mask].index.tolist()
print(more)
print(less)
[1, 3]
[2]
You can use dataframe indexing to scrape out all those indices which are greater than 6 and create a set of unique indices using:
setA = set(df[df['values'] > 6]['ids'])
This will create a set of all indices in the dataframe:
setB = set(df['ids'])
Now,
more = list(setA)
and for less, take the set difference:
less = list(setB.difference(setA))
That's it!

Creating boxplot from Pandas DataFrame using Seaborn

I have the following Pandas DataFrame which I use comparing the performance of different classifiers over multiple iterations. After each iteration, I save the ranking of that specific classifier to a DataFrame which is the cumulative sum of rankings over all iterations (the index of the DataFrame tells the ranking from 0-3, i.e., 4 classifiers in total and 0 is the best).
The DataFrame looks as follows:
rankings = {'Classifier1': ['1', '2', '1', '0'],
'Classifier2': ['2', '1', '1', '0'],
'Classifier3': ['0', '1', '1', '2'],
'Classifier4': ['1', '0', '1', '2']}
df = pd.DataFrame(data = rankings)
which formats as:
Classifier1 Classifier2 Classifier3 Classifier4
0 1 2 0 1
1 2 1 1 0
2 1 1 1 1
3 0 0 2 2
I would like to create the following boxplot (as in this paper) of the different classifier by using Seaborn or alternative method:
Firstly, we need to convert your data into numeric values rather than strings. Then, we melt the dataframe to get it into long format, and finally we apply a boxplot with a swarmplot on top
df = df.apply(pd.to_numeric).melt(var_name='Classifier', value_name='AUC Rank')
ax = sns.boxplot(data=df, x='Classifier', y='AUC Rank')
ax = sns.swarmplot(data=df, x='Classifier', y='AUC Rank', color='black')

Convert row to column header for Pandas DataFrame,

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Categories