Change values in dataframe column based on another dataframe - python

I've seen a large number of similar questions but nothing quite answers what I am looking to do.
I have two dataframes
Conn_df that contains names and company details manually entered (e.g. Conn_df["Name", "Company_name", "Company_Address"]
Cleanse_df that contains cleaned up company names (e.g. Cleanse_df["Original_Company_Name", "Cleanse_Company_Name"]
The data for both is held in csv files that are imported into the script.
I want to change the company details in Conn_df.Company_Name using the values in Cleanse_df, where the Conn_df.Company_Name equals the Cleanse_df.Original_Company_Name and is replaced by Cleanse_df.Cleanse_Company_Name.
I have tried:
Conn_df["Company"] = Conn_df["Company"].replace(Conn_df["Company"], Cleanse_df["Cleansed"]) but got
replace() takes no keyword arguments
I also tried:
Conn_df["Company"] = Conn_df["Company"].map(Cleanse_df.set_index("Original")["Cleansed"]) but got
Reindexing only valid with uniquely valued Index objects
Any suggestions on how to get the values to be replaced. I would note that both dataframes run to many tens of thousands of rows, so creating a manual list is not possible.

I think you want something along the lines of this:
conn_df = pd.DataFrame({'Name':['Mac','K','Hutt'],
'Company_name':['McD','KFC','PH'],
'Company_adress':['street1','street2','street4']})
cleanse_df = pd.DataFrame({'Original_Company_Name':['McD'],'Cleanse_Company_Name':
['MacDonalds']})
cleanse_df = cleanse_df.rename(columns={'Original_Company_Name':'Company_name'})
merged_df = conn_df.merge(cleanse_df,on='Company_name',how='left')
merged_df['Cleanse_Company_Name'].fillna(merged_df['Company_name'],inplace=True)
final_df = merged_df[['Name','Company_adress','Cleanse_Company_Name']]\
.rename(columns={'Cleanse_Company_Name':'Company_name'})
This would return:
Name Company_adress Company_name
0 Mac street1 MacDonalds
1 K street2 KFC
2 Hutt street4 PH
You merge the two dataframes and then keep the replaced new value, if there is no value to replace the name then the name will just stay the same, this is done by the fillna command.

Related

Identify if records exist in another dataframe, within the first dataframe

I have two csv files, OrderOne (approx 105k records) & OrderTwo (approx 115k records)
I want to add a column in OrderTwo which states "TRUE" if that record is found in OrderOne, and "FALSE" if not.
The new column should be appended and the file output.
There is no shared key, so I'm creating one. It will the concatenation of columns within the orders, which are in different formats from different suppliers. For simplicity in this example, it will be 'Forename' + 'Surname'.
I am reading the two data tables in, one of which I only need a few columns from. I'm converting names to upper case & stripping out white space to ensure they match correctly.
I've read the outputs from these files and they look correct. So far, so good.
import pandas as pd
orderoneData = pd.read_csv ('orderone.csv', usecols=['Customer Reference','Forename', 'Surname'], index_col=False)
orderoneData.set_index('Customer Reference', inplace=True)
orderoneData["FNSN"] = orderoneData['Forename'].str.strip() + orderoneData["Surname"].str.strip()
orderoneData["FNSN"] = orderoneData["FNSN"].str.upper()
ordertwoData = pd.read_csv ('ordertwo.csv')
ordertwoData.set_index('Supplier Reference', inplace=True)
ordertwoData["FNSN"] = ordertwoData['Forename'].str.strip() + ordertwoData["Surname"].str.strip()
ordertwoData["FNSN"] = ordertwoData["FNSN"].str.upper()
Next I'm merging; I'm using OrderTwo as the left (because that's the file I want the new column added to). I intend to change the values of the indicator to Boolean ('both' = True, otherwise False) but I haven't got that far yet.
d = (
ordertwoData.merge(orderoneData['FNSN'],
on=['FNSN'],
how='left',
indicator=True,
)
)
d.reset_index(drop=True, inplace=True)
At this point, I have far too many records (approx 179k; I'm expecting the same as OrderTwo, which is 115k). My understanding was that a left join should have the same number of records as the left table, which is my case is ordertwoData
#I thought I might have used the wrong merge criteria and it was creating duplicates, so I thought I would just remove them
d1 = d.drop_duplicates()
print(d1)
d1.to_csv("d.csv")
Dropping duplicates leaves me with too few records, so I'm confused how I get the right result.
Any help much appreciated!
As #Clegane identified, the issue here was not the code but the input data containing duplicate records. By including the original reference in the merge then dropping duplicates on OrderTwo['Supplier Reference'] I got the expected answer. Thanks!

Fixing broken naming after merging a groupby pivot_table dataframe

I have a problem with naming of columns of dataframe resulting from merging it with its iteration created by group_by.
Generally, the code that creates the mess looks like this:
volume_aggrao = volume.groupby(by = ['room_name', 'material', 'RAO']).sum()['quantity']
volume_aggrao_concat = pd.pivot_table(pd.DataFrame(volume_aggrao), index=['room_name', 'material'], columns = ['RAO'], values = ['quantity'])
volume = volume.merge(volume_aggrao_concat, how = 'left', on = ['room_name', 'material'])
Now to what it does: the goal of pivot_table is to show 'quantity' variable sum over each category of 'RAO' and it looks like that:
And it is fine until you access how it looks on the inside:
"('room_name', '')","('material', '')","('quantity', 'moi')","('quantity', 'nao')","('quantity', 'onrao')","('quantity', 'prom')","('quantity', 'sao')"
1,aluminum,NaN,13.0,NaN,NaN,NaN
1,concrete,151.0,NaN,NaN,NaN,NaN
1,plastic,56.0,NaN,NaN,NaN,NaN
1,steel_mark_1,NaN,30.0,2.0,NaN,1.0
1,steel_mark_2,52.0,NaN,88.0,NaN,NaN
2,aluminum,123.0,NaN,84.0,NaN,NaN
2,concrete,155.0,NaN,NaN,30.0,NaN
2,plastic,170.0,NaN,NaN,NaN,NaN
2,steel_mark_1,107.0,NaN,105.0,47.0,NaN
2,steel_mark_2,81.0,41.0,NaN,NaN,NaN
3,aluminum,NaN,NaN,90.0,NaN,79.0
3,concrete,NaN,82.0,NaN,NaN,NaN
3,plastic,1.0,NaN,25.0,NaN,NaN
3,steel_mark_1,116.0,10.0,NaN,136.0,NaN
3,steel_mark_2,NaN,92.0,34.0,NaN,NaN
4,aluminum,50.0,74.0,NaN,NaN,88.0
4,concrete,96.0,NaN,27.0,NaN,NaN
4,plastic,63.0,135.0,NaN,NaN,NaN
4,steel_mark_1,97.0,NaN,28.0,87.0,NaN
4,steel_mark_2,57.0,22.0,7.0,NaN,NaN
Nevertheless, I was still able to merge it, with resulting columns being named automatically like that:
I cannot seem to be able to call these '(quantity, smth)' columns and hence could not even rename them directly. And there i decided to fully reset column namings with volume.columns = ["id", "room_name", "material", "alpha_UA", "beta_UA", "alpha_F", "beta_F", "gamma_EP", "quantity", "files_id", "all_UA", "RAO", "moi", "nao", "onrao", "prom", "sao"], which is indeed bulky, but it worked. Except it did not when one or more of categorical values of "RAO" is missing. For example, there is no "nao" in "RAO" and hence there is no such column created and hence the code has nothing to rename.
I tried fixing it with volume.rename(lambda x: x.lstrip("(\'quantity\',").strip("\'() \'") if "(" in x else x, axis=1), but it seems to do nothing with them.
I want to know if there is a way to rename these columns.
Data
Here's some example data of 'volume' dataframe you may use to replicate the process with desired output embedded in it to compare
"id","room_name","RAO","moi","nao","onrao","prom","sao"
"1","3","onrao","1","","25","",""
"2","4","nao","57","22","7","",""
"4","2","moi","170","","","",""
"6","4","moi","97","","28","87",""
"7","4","moi","97","","28","87",""
"11","1","nao","","13","","",""
"12","4","onrao","97","","28","87",""
"13","2","moi","107","","105","47",""
"18","2","moi","123","","84","",""
"19","2","moi","155","","","30",""
"22","2","moi","170","","","",""
"23","4","sao","50","74","","","88"
"24","4","nao","50","74","","","88"
So, after a cup of coffee and a cold shower, I was able to investigate a bit further and found out that the strange namings are actually tuples and not strings! Knowing that I decided to iterate over columns to change them to strings and then use the filter. A bit bulky once again, but here is a solution:
for name in volume.columns:
names.append(str(name).lstrip("(\'quantity\',").strip("\'() \'"))

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Merge misaligned pandas dataframes

I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.
Each csv file contains a 1000 rows and 816 columns.
Here is the problem:
Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.
Here's an made up example:
CSV file A (which is correct):
Name Age City
Joe 18 London
Kate 19 Berlin
Math 20 Paris
CSV file B (with misaglignment):
Name Age City
Joe 18 London
Kate Berlin
Math 20 Paris
I would like to merge A and B, but my current solution results in a misalignment.
I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.
The current solution to merge the dataframes is as follows:
def merge_pandas(csvpaths):
list = []
for path in csvpaths:
frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
list.append(frame)
return pd.concat(list)
Thanks in advance.
A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.
If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.
types = ['string', 'int']
Next, I would set up a marker to identify flaws:
df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1
The script can detect the error with certainty
In your simple scenario, whenever there is an error, we can simply check the value in the first column.
If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right
In your trivial example, that would be
def checkRow(row):
try:
row['Age'] = int(row['Age'])
except ValueError:
row['City']= row['Age']
row['Age'] = np.NaN
return row
df.apply(checkRow, axis=1)
In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.
The script cannot know the error with certainty
For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.

Categories