I have two csv files, OrderOne (approx 105k records) & OrderTwo (approx 115k records)
I want to add a column in OrderTwo which states "TRUE" if that record is found in OrderOne, and "FALSE" if not.
The new column should be appended and the file output.
There is no shared key, so I'm creating one. It will the concatenation of columns within the orders, which are in different formats from different suppliers. For simplicity in this example, it will be 'Forename' + 'Surname'.
I am reading the two data tables in, one of which I only need a few columns from. I'm converting names to upper case & stripping out white space to ensure they match correctly.
I've read the outputs from these files and they look correct. So far, so good.
import pandas as pd
orderoneData = pd.read_csv ('orderone.csv', usecols=['Customer Reference','Forename', 'Surname'], index_col=False)
orderoneData.set_index('Customer Reference', inplace=True)
orderoneData["FNSN"] = orderoneData['Forename'].str.strip() + orderoneData["Surname"].str.strip()
orderoneData["FNSN"] = orderoneData["FNSN"].str.upper()
ordertwoData = pd.read_csv ('ordertwo.csv')
ordertwoData.set_index('Supplier Reference', inplace=True)
ordertwoData["FNSN"] = ordertwoData['Forename'].str.strip() + ordertwoData["Surname"].str.strip()
ordertwoData["FNSN"] = ordertwoData["FNSN"].str.upper()
Next I'm merging; I'm using OrderTwo as the left (because that's the file I want the new column added to). I intend to change the values of the indicator to Boolean ('both' = True, otherwise False) but I haven't got that far yet.
d = (
ordertwoData.merge(orderoneData['FNSN'],
on=['FNSN'],
how='left',
indicator=True,
)
)
d.reset_index(drop=True, inplace=True)
At this point, I have far too many records (approx 179k; I'm expecting the same as OrderTwo, which is 115k). My understanding was that a left join should have the same number of records as the left table, which is my case is ordertwoData
#I thought I might have used the wrong merge criteria and it was creating duplicates, so I thought I would just remove them
d1 = d.drop_duplicates()
print(d1)
d1.to_csv("d.csv")
Dropping duplicates leaves me with too few records, so I'm confused how I get the right result.
Any help much appreciated!
As #Clegane identified, the issue here was not the code but the input data containing duplicate records. By including the original reference in the merge then dropping duplicates on OrderTwo['Supplier Reference'] I got the expected answer. Thanks!
I have a problem with naming of columns of dataframe resulting from merging it with its iteration created by group_by.
Generally, the code that creates the mess looks like this:
volume_aggrao = volume.groupby(by = ['room_name', 'material', 'RAO']).sum()['quantity']
volume_aggrao_concat = pd.pivot_table(pd.DataFrame(volume_aggrao), index=['room_name', 'material'], columns = ['RAO'], values = ['quantity'])
volume = volume.merge(volume_aggrao_concat, how = 'left', on = ['room_name', 'material'])
Now to what it does: the goal of pivot_table is to show 'quantity' variable sum over each category of 'RAO' and it looks like that:
And it is fine until you access how it looks on the inside:
"('room_name', '')","('material', '')","('quantity', 'moi')","('quantity', 'nao')","('quantity', 'onrao')","('quantity', 'prom')","('quantity', 'sao')"
1,aluminum,NaN,13.0,NaN,NaN,NaN
1,concrete,151.0,NaN,NaN,NaN,NaN
1,plastic,56.0,NaN,NaN,NaN,NaN
1,steel_mark_1,NaN,30.0,2.0,NaN,1.0
1,steel_mark_2,52.0,NaN,88.0,NaN,NaN
2,aluminum,123.0,NaN,84.0,NaN,NaN
2,concrete,155.0,NaN,NaN,30.0,NaN
2,plastic,170.0,NaN,NaN,NaN,NaN
2,steel_mark_1,107.0,NaN,105.0,47.0,NaN
2,steel_mark_2,81.0,41.0,NaN,NaN,NaN
3,aluminum,NaN,NaN,90.0,NaN,79.0
3,concrete,NaN,82.0,NaN,NaN,NaN
3,plastic,1.0,NaN,25.0,NaN,NaN
3,steel_mark_1,116.0,10.0,NaN,136.0,NaN
3,steel_mark_2,NaN,92.0,34.0,NaN,NaN
4,aluminum,50.0,74.0,NaN,NaN,88.0
4,concrete,96.0,NaN,27.0,NaN,NaN
4,plastic,63.0,135.0,NaN,NaN,NaN
4,steel_mark_1,97.0,NaN,28.0,87.0,NaN
4,steel_mark_2,57.0,22.0,7.0,NaN,NaN
Nevertheless, I was still able to merge it, with resulting columns being named automatically like that:
I cannot seem to be able to call these '(quantity, smth)' columns and hence could not even rename them directly. And there i decided to fully reset column namings with volume.columns = ["id", "room_name", "material", "alpha_UA", "beta_UA", "alpha_F", "beta_F", "gamma_EP", "quantity", "files_id", "all_UA", "RAO", "moi", "nao", "onrao", "prom", "sao"], which is indeed bulky, but it worked. Except it did not when one or more of categorical values of "RAO" is missing. For example, there is no "nao" in "RAO" and hence there is no such column created and hence the code has nothing to rename.
I tried fixing it with volume.rename(lambda x: x.lstrip("(\'quantity\',").strip("\'() \'") if "(" in x else x, axis=1), but it seems to do nothing with them.
I want to know if there is a way to rename these columns.
Data
Here's some example data of 'volume' dataframe you may use to replicate the process with desired output embedded in it to compare
"id","room_name","RAO","moi","nao","onrao","prom","sao"
"1","3","onrao","1","","25","",""
"2","4","nao","57","22","7","",""
"4","2","moi","170","","","",""
"6","4","moi","97","","28","87",""
"7","4","moi","97","","28","87",""
"11","1","nao","","13","","",""
"12","4","onrao","97","","28","87",""
"13","2","moi","107","","105","47",""
"18","2","moi","123","","84","",""
"19","2","moi","155","","","30",""
"22","2","moi","170","","","",""
"23","4","sao","50","74","","","88"
"24","4","nao","50","74","","","88"
So, after a cup of coffee and a cold shower, I was able to investigate a bit further and found out that the strange namings are actually tuples and not strings! Knowing that I decided to iterate over columns to change them to strings and then use the filter. A bit bulky once again, but here is a solution:
for name in volume.columns:
names.append(str(name).lstrip("(\'quantity\',").strip("\'() \'"))
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.