Unable to merge Dataframes : Getting Value Error - python

I'm currently trying to merge 2 dataframes based on a "key" that is created as a combination of alphanumeric and special characters. I'm getting a "Value error: can not merge DataFrame with instance of type <class 'str'>".
I'm aware this topic is covered in the thread below, however, the issue seemed very simple in that case where one of the dataset was not a Dataframe.
Python pandas: "can not merge DataFrame with instance of type <class 'str'>"
I have, on the other hand, checked both the datasets I want to merge are of type "DataFrame". So could it be that the "key" I'm using to combine the two dataframes has special characters in it? For ex, one such key could look like "1078-ORD-XHKG-HKDOct-23ATM". I do have similar keys in both dataframes for the merging to proceed, yet I get a Value error.
The code used to merge is simply-
df_new = pd.merge('df_1', 'df_2', on = 'key', how = 'left')
I have tried using the other ways of merging as suggested in other threads as well. Still doesn't help. Can someone please help?

Can you try using this:
pd.merge(df_1,df_2, how='left', left_on=['key df_1'], right_on=['key df_2'])?

Related

How can I get the difference between two dataframes based on one column?

I have a dataframe (allPop) and a geodataframe (allTracts). I'm merging them on a the column GEOID, which they both share:
newTracts = allTracts.merge(allPop, on='GEOID')
My problem is that I'm losing data on this merge, which conceptually shouldn't be happening. Each of the records in allPop should match with one of the records from allTracts, but newTracts has a couple hundred fewer records than allPop. I'd like to be able to look at the records not being included in the merge to try to diagnose the problem. Is there a way to do this? Or else, can I find the difference between allPop and allTracts based on their columns 'GEOID'? I've seen how to do this when both dataframes have all of the same column names/types, but can I do this based only on one column? I'm not sure what the output for this would look like, but lists of the GEOIDs that aren't being merged from both dataframes would be good. Or else the dataframes themselves without the records that were merged. Thanks!
You can use the isin method in Pandas.
badPop = allPop[~allPop['GEOID'].isin(allTracts['GEOID'])
You can also use the indicator option of the merge method along with how='outer' to find the offending rows.

multiplication of two columns in dataframe SettingWithCopyWarning

I`ve a large dataframe, Im trying to do a simple multipication between two columns and put the results in new column when I do that I'm getting this error message :
SettingWithCopyWarning : a value is trying to be set on a copy of a slice from a dataframe.
my code looks like this :
DF[‘mult‘]=DF[‘price‘]*DF[‘rate‘]
I Tried loc but didnt work .. does anyone have a solution ?
You should use df.assign() in this case:
df2 = DF.assign(mult=DF[‘price‘]*DF[‘rate‘])
You get back a new dataframe with a 'mult' column added.

pd.merge throwing error while executing through .bat file

Python script does not run while executing in a bat file, but runs seamlessly on the editor.
The error is related to datatype difference in pd.merge script. Although the datatype given to both the columns is same in both the dataframes.
df2a["supply"] = df2a["supply"].astype(str)
df2["supply_typ"] = df2["supply_typ"].astype(str)
df2a["supply_typ"] = df2a["supply_typ"].astype(str)
df = (pd.merge(df2,df2a, how=join,on=
['entity_id','pare','grome','buame','tame','prd','gsn',
'supply','supply_typ'],suffixes=['gs2','gs2x']))
While running the bat file i am getting following error in pd.merge:
You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat
Not a direct answer, but contains code that cannot be formatted in a comment, and should be enough to solve the problem.
When pandas says that you are trying to merge on float64 and object columns, it is certainly right. It may not be evident because pandas relies on numpy, and that a numpy object column can store any data.
I ended with a simple function to diagnose all those data type problem:
def show_types(df):
for i,c in enumerate(df.columns):
print(df[c].dtype, type(df.iat[0, i]))
It shows both the pandas datatype of the columns of a dataframe, and the actual type of the first element of the column. It can help do see the difference between columns containing str elements and other containing datatime.datatime ones, while the datatype is just objects.
Use that on both of your dataframes, and the problem should become evident...

Can't access part of Pandas dataframe by multiindex

I'm new with Pandas so this is basic question. I created a Dataframe by concatenating two previous Dataframes. I used
todo_pd = pd.concat([rabia_pd, capitan_pd], keys=['Rabia','Capitan'])
thinking that in the future I could separate them easily and saving each one to a different location. Right now I'm being unable to do this separation using the keys I defined with the concat function.
I've tried simple things like
half_dataframe = todo_pd['Rabia']
but it throws me an error saying that there is a problem with the key.
I've also tried with other options I've found in SO, like using the
_get_values('Rabia'),or the.index._get_level_values('Rabia')features, but they all throw me different errors regarding that it does not recognize a string as a way to access the information, or that it requires positional argument: 'level'
The whole Dataframe contains about 22 columns, and I just want to retrieve from the "big dataframe" the part indexed as 'Rabia' and the part index as 'Capitan'.
I'm sure it has a simple solution that I'm not getting for my lack of practice with Pandas.
Thanks a lot,
Use DataFrame.xs:
df1 = todo_pd.xs('Rabia')
df2 = todo_pd.xs('Capitan')

In pandas dataframe handling object data type

I'm tearing my hair out a bit with this one. I've imported two csv's into pandas dataframes both have a column called SiteReference i want to use pd.merge to join dataframes using SiteReference as a key.
Initial merged failed as pd.read took different interpretations of the SiteReference values, in one instance 380500145.0 in the other 380500145 both stored as objects. I ran Regex to clean the columns and then pd.to_numeric, this resulted in one value of 380500145.0 and another of 3.805001e+10. They should both be 380500145. I then attempted;
df['SiteReference'] = df['SiteReference'].astype(int).astype('str')
But got back;
ValueError: cannot convert float NaN to integer
How can i control how pandas is dealing with these, preferably on import?
Perharps the best solution is to avoid that pd.read affect the type of this field :
df=pd.read_csv('data.csv',sep=',',dtype={'SiteReference':str})
Following the discussion in the comments, if you want to format floats as integer strings, you can use this:
df['SiteReference'] = df['SiteReference'].map('{:,.0f}'.format)
This should handle null values gracefully.

Categories