How to compare two columns in a grouped pandas dataframe? - python

I am unable to compare two columns inside a grouped pandas dataframe.
I used groupby method to group the fields with respect to two columns
I am required to get the list of fields that are not matching with the actual output.
file_name | page_no | field_name | value | predicted_value | actual_value
-------------------------------------------------------------------------
A 1 a 1 zx zx
A 2 b 0 xt xi
B 1 a 1 qw qw
B 2 b 0 xr xe
desired output:
b
Because b is the only field that is causing the mismatch between the two columns
The following is my code:
groups = df1.groupby(['file_name', 'page_no'])
a = pd.DataFrame(columns = ['file_name', 'page_no', 'value'])
for name, group in groups:
lst = []
if (group[group['predicted_value']] != group[group['actual_value']]):
lst = lst.append(group[group['field_name']])
print(lst)
I am required to get the list of fields that are not matching with the actual output.
Here, I'm trying to store them in a list but I am getting some key error.
The error is as follows:
KeyError: "None of [Index(['A', '1234'')] are in the [columns]"

Here is solution for test columns outside groups:
df1 = df[df['predicted_value'] != df['actual_value']]
s = df.loc[df['predicted_value'] != df['actual_value'], 'field_name']
L = s.tolist()

Does this solve your problem ?
# Create a new dataframe retrieving only non-matching values
df1=df[df['predicted_value']!=df['actual_value']]
# Store 'field_name' column in a list format
lst=list(df1['field_name'])
print(lst)

Related

How to save values in pandas dataframe after editing some values

I have a dataframe which looks like this (It contains dummy data) -
I want to remove the text which occurs after "_________" identifier in each of the cells. I have written the code as follows (Logic: Adding a new column containing NaN and saving the edited values in that column) -
import pandas as pd
import numpy as np
df = pd.read_excel(r'Desktop\Trial.xlsx')
NaN = np.nan
df["Body2"] = NaN
substring = "____________"
for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
row["Body2"] = split_string[0]
print(df)
But the Body2 column still displays NaN and not the edited values.
Any help would be much appreciated!
`for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
#row["Body2"] = split_string[0] # instead use below line
df.at[index,'Body2'] = split_string[0]`
Make use of at to modify the value
Instead of iterating through the rows, do the operation on all rows at once. You can use expand to split the values into multiple columns, which I think is what you want.
substring = "____________"
df = pd.DataFrame({'Body': ['a____________b', 'c____________d', 'e____________f', 'gh']})
df[['Body1', 'Body2']] = df['Body'].str.split(substring, expand=True)
print(df)
# Body Body1 Body2
# 0 a____________b a b
# 1 c____________d c d
# 2 e____________f e f
# 3 gh gh None

Pandas groupby: Nested loop fails with key error

I have CSV file with the following test content:
Name;N_D;mu;set
A;10;20;0
B;20;30;0
C;30;40;0
x;5;15;1
y;15;25;1
z;25;35;1
I'm reading the file with pandas, group the data and then iterate through the data. Within each group, I want to iterate through the rows of the data set:
import pandas as pd
df = pd.read_csv("samples_test.csv", delimiter=";", header=0)
groups = df.groupby("set")
for name, group in groups:
somestuff = [group["N_D"], group["mu"], name]
for i, txt in enumerate(group["Name"]):
print(txt, group["Name"][i])
The code fails on the line print(txt, group["Name"][i]) at the first element of the second group with an key error. I don't understand, why...
Your code fails since the series index does not match with the enumerator index for each loop hence cannot match the keys for filtering, (Note: Also use .loc[] or .iloc[] and avoid chained indexing group["Name"][i])
groups = df.groupby("set")
for name, group in groups:
somestuff = [group["N_D"], group["mu"], name]
for i, txt in enumerate(group["Name"]):
print(i,group["Name"])
0 0 A
1 B
2 C
Name: Name, dtype: object
1 0 A
1 B
2 C
.......
....
Your code should be changed to below using .iloc[] and get_loc for getting the column index:
groups = df.groupby("set")
for name, group in groups:
somestuff = [group["N_D"], group["mu"], name]
for i, txt in enumerate(group["Name"]):
print(txt,group.iloc[i,group.columns.get_loc('Name')])
A A
B B
C C
x x
y y
z z

How to compare columns of two dataframes and have consequences when they match in Python Pandas

I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C

How to prevent multi value dictionary object from splitting each word into individual letter strings?

I have a dictionary object that looks like this:
my_dict = {123456789123: ('a', 'category'),
123456789456:('bc','subcategory'),123456789678:('c_d','subcategory')}
The below code extracts and compares a integer in column headers in a df to the key in the dictionary and creates a new dataframe by picking the second value as the columns of the new df and first value as the value inside the df.
Code:
names = df.columns.values
new_df = pd.DataFrame()
for name in names:
if ('.value.' in name) and df[name][0]:
last_number = int(name[-13:])
print(last_number)
key, value = my_dict[last_number]
try:
new_df[value][0] = list(new_df[value][0]) + [key]
except:
new_df[value] = [key]
new_df:
category subcategory
0 a [b, c, c_d]
I am not sure what is causing it in my code, but how do I prevent bcfrom split up?
edit:
example df from above:
data.value.123456789123 data.value.123456789456 data.value.123456789678
TRUE TRUE TRUE
new_df should look like this:
category subcategory
0 a [bc, c_d]
list(new_df[value][0]) breaks a string into a list of characters, that's why you get the individual characters.
list(new_df[value][0]) must be [new_df[value][0]]. Or, better, list(new_df[value][0]) + [key] must be [new_df[value][0], key].
Using DataFrame constructor and groupby
df=pd.DataFrame(list(my_dict.values()))
df.groupby(1)[0].apply(list).to_frame(0).T
1 category subcategory
0 [a] [bc, c_d]

Given a pandas dataframe, is there an easy way to print out a command to generate it?

After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.

Categories