Pandas DataFrame - Creating a new column from a comparison - python

I'm trying to create a columns called 'city_code' with values from the 'code' column. But in order to do this I need to compare if 'ds_city' and 'city' values are equal.
Here is a table sample:
https://i.imgur.com/093GJF1.png
I've tried this:
def find_code(data):
if data['ds_city'] == data['city'] :
return data['code']
else:
return 'UNKNOWN'
df['code_city'] = df.apply(find_code, axis=1)
But since there are duplicates in the 'ds_city' columns that's the result:
https://i.imgur.com/geHyVUA.png
Here is a image of the expected result:
https://i.imgur.com/HqxMJ5z.png
How can I work around this?

You can use pandas merge:
df = pd.merge(df, df[['code', 'city']], how='left',
left_on='ds_city', right_on='city',
suffixes=('', '_right')).drop(columns='city_right')
# output:
# code city ds_city code_right
# 0 1500107 ABAETETUBA ABAETETUBA 1500107
# 1 2900207 ABARE ABAETETUBA 1500107
# 2 2100055 ACAILANDIA ABAETETUBA 1500107
# 3 2300309 ACOPIARA ABAETETUBA 1500107
# 4 5200134 ACREUNA ABARE 2900207
Here's pandas.merge's documentation. It takes the input dataframe and left joins itself's code and city columns when ds_city equals city.
The above code will fill code_right when city is not found with nan. You can further do the following to fill it with 'UNKNOWN':
df['code_right'] = df['code_right'].fillna('UNKNOWN')

This is more like np.where
import numpy as np
df['code_city'] = np.where(data['ds_city'] == data['city'],data['code'],'UNKNOWN')

You could try this out:
# Begin with a column of only 'UNKNOWN' values.
data['code_city'] = "UNKNOWN"
# Iterate through the cities in the ds_city column.
for i, lookup_city in enumerate(data['ds_city']):
# Note the row which contains the corresponding city name in the city column.
row = data['city'].tolist().index(lookup_city)
# Reassign the current row's code_city column to that code from the row we found in the last step.
data['code_city'][i] = data['code'][row]

Related

divide the row into two rows after several columns

I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779

Count occurrence of column values in other dataframe column

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.
First dataframe:
Second dataframe:
Result so far:
Desired Result:
My script so far:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
pat = '({})'.format('|'.join(clas['classifier'].unique()))
fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)
clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)
Thanks!
You could try this:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
fl.loc[fl["fullname"].str.contains(value, case=False), value] = value
# Count values and create a new dataframe
new_clas = pd.DataFrame(
{
"classifier": [col for col in clas["classifier"].values],
"count": [fl[col].count() for col in clas["classifier"].values],
}
)
# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)
# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])
print(new_clas)
# Outputs
classifier count coordinate
repair 3 52.520008, 13.404954
car 3 54.520008, 15.404954

Return a matching value from 2 dataframes (1 dataframe with single value in cell, 1 with a list in a cell) into 1 dataframe

I have 2 dataframes:
df1
ID Type
2456-AA Coolant
2457-AA Elec
df2
ID Task
[2456-AA, 5656-BB] Check AC
[2456-AA, 2457-AA] Check Equip.
I'm trying return the matched ID's 'Type' from df1 to df2. With the result looking something like this:
df2
ID Task Type
[2456-AA, 5656-BB] Check AC [Coolant]
[2456-AA, 2457-AA] Check Equip. [Coolant , Elec]
I tried the following for loop. I udnerstand it isn't the fastest but i'm struggling to workout a faster alternative:
def type_identifier(type):
df = df1.copy()
device_type = []
for value in df1.ID:
for x in type:
if x == value:
device_type.append(df1.Type.tolist())
else:
None
return device_type
df2['test'] = df2['ID'].apply(lambda x: type_identifier(x))
Could somebody help me out? and also refer me to material that could help me to better approach problems like these?
Thank you,
Use the to_dict of pandas to convert df1 to a dictionary, so we can efficiently translate id to type.
Then, apply lamda that for each ID in df2 converts it to the right type, and assign it to test column as you wished.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID':['2456-AA', '2457-AA'],
'Type':['Coolant', 'Elec']})
df2 = pd.DataFrame({'ID':[['2456-AA', '5656-BB'], ['2456-AA', '2457-AA']],
'Task':['Check AC', 'Check Equip.']})
# Use to dict to convert df1 ids to types
id_to_type = df1.set_index('ID').to_dict()['Type']
# {'2456-AA': 'Coolant', '2457-AA': 'Elec'}
print(id_to_type)
# Apply lamda that for each `ID` in `df2` converts it to the right type
df2['test'] = df2['ID'].apply(lambda x: [id_to_type[t] for t in x if t in id_to_type])
print(df2)
Output:
ID Task test
0 [2456-AA, 5656-BB] Check AC [Coolant]
1 [2456-AA, 2457-AA] Check Equip. [Coolant, Elec]

Comparing multiple columns of same CSV file and returning result to another CSV file using Python

I have a CSV file with 7 columns
This is my csv file
Application,Expected Value,ADER,UGOM,PRD
APP,CVD2,CVD2,CVD2,CVD2
APP1,"VCF7,hg6","VCF7,hg6","VCF8,hg6","VCF7,hg6"
APP1,"VDF9,pova8","VDF9,pova8","VDF10,pova10","VDF9,pova11"
APP2,gf8,gf8,gf8,gf8
APP3,pf8,pf8,gf8,pf8
APP4,vd3,mn7","vd3,mn7","vd3,mn7","vd3,mn7"
So here i want to compare a Expected Value column with the columns after that (that is ADER,UGOM,PRD)
so here is my code in python
import pandas as pd
# assuming id columns are identical and contain the same values
df1 = pd.read_csv('file1.csv', index_col='Expected Value')
df2 = pd.read_csv('file1.csv', index_col='ADER')
df3 = pd.DataFrame(columns=['status'], index=df1.index)
df3['status'] = (df1['Expected Value'] == df2['ADER']).replace([True, False], ['Matching', 'Not Matching'])
df3.to_csv('output.csv')
So this not creating any output.csv file ,nor it generates any output . So can anyone help
So i edited code : based on a comment by #Vlado
import pandas as pd
# assuming id columns are identical and contain the same values
df1 = pd.read_csv('first.csv')
df3 = pd.DataFrame(columns=['Application','Expected Value','ADER','status of AdER'], index=df1.index)
df3['Application'] = df1['Application']
df3['Expected Value'] = df1['Expected Value']
df3['ADER'] = df1['ADER']
df3['status'] = (df1['Expected Value'] == df1['ADER'])
df3['status'].replace([True, False], ['Matching', 'Not Matching'])
df3.to_csv('output.csv')
so now it works for one column ADER , but my headers after EXpected Values are dynamic , it may change . so sometimes it may be one column after expected Value , sometimes N columns and header name may also change . so can some one help on how to do that
Below given piece of code generates the desired output. It will compare the Expected Value column with rest of the columns after that.
import pandas as pd
df = pd.read_csv("input.csv")
expected_value_index = df.columns.get_loc("Expected Value")
for col_index in range(expected_value_index+1, len(df.columns)):
column = df.columns[expected_value_index]+" & "+ df.columns[col_index]
df[column] = df.loc[:,"Expected Value"] == df.iloc[:,col_index]
df[column].replace([True, False], ["Matching", "No Matching"], inplace=True)
df.to_csv("output.csv", index=None)
I haven't tried replicating your code as of yet but here are a few suggestions:
You do not need to read the df two times.
df1 = pd.read_csv('FinalResult1.csv')
is sufficient.
Then, you can proceed with
df1['status'] = (df1['exp'] == df1['ader'])
df1['status'].replace([True, False], ['Matching', 'Not Matching'])
Alternatively, you could do this row by row by using the pandas apply method.
If that doesn't work a reasonable first step would be to print your dataframe out to see what is happening.
Try this code
k = list(df1.columns).index('Expected Value') + 1
# get the integer index for the column after 'Expected Value'
df3 = df1.iloc[:, :k]
# copy first k columns
df3 = pd.concat([df3, (df1.iloc[:, k:] == np.repeat(
df1['Expected Value'].to_frame().values, df1.shape[1] - k, axis=1))], axis=1)
# slice df1 with iloc, wich works just as slicing lists in python
# np.repeat, used to repeat 'Expected Value' as many columns as needed (df1.shape[1]-k=3)
# .to_frame, slicing a column from a df returns a 1D series so we turin it back into a 2D df
# .values, returns the underlying numpy array without any index/column names
# ...without .values Pandas would try to find 3 columns named 'Expected Value' in df1
# concatenate previous df3 with this calculation
print(df3)
Output
Application FileName ConfigVariable Expected Value ADER UGOM PRD
0 APP1 FileName1 ConfigVariable1 CVD2 True True True
1 APP1 FileName2 ConfigVariable2 VCF7,hg6 True False True
2 APP1 FileName3 ConfigVariable3 VDF9,pova8 True False False
3 APP2 FileName4 ConfigVariable4 gf8 True True True
4 APP3 FileName5 ConfigVariable5 pf8 True False True
5 APP4 FileName6 ConfigVariable vd3,mn7 True True True
Of course you can do a loop if for some reason you need a special calculation on some column
for colname in df1.columns[k:]:
df3[colname] = df1[colname] == df1['Expected Value']

How to compare columns of two dataframes and have consequences when they match in Python Pandas

I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C

Categories