Pandas question - merging on int64 and object columns? - python

I have a couple of Pandas dataframes that I am trying to merge together without any luck.
The first dataframe (let's call it dataframe A) looks a little like this:
offer | type | country
------|------|--------
123 | A | UK
456 | B | UK
789 | A | ROI
It's created by reading in an .xlsx file using the following code:
file_name = "My file.xlsx"
df_a = pd.read_excel(file_name, sheet_name = "Offers Plan", usecols= ["offer","type","country"], dtype={'offer': str})
The offer column is being read in as a string because otherwise they end up in the format 123.0. The offers need to be in the format 123 because they're being used in some embedded SQL later on in the code that looks for them in a certain database table. In this table the offers are in the format 123, so the SQL will return no results when looking for 123.0.
The second dataframe (dataframe B) looks a little like this:
offer
-----
123
456
789
123
456
123
What I want to do is merge the two dataframes together so the results look like this:
offer | type | country
------|------|--------
123 | A | UK
456 | B | UK
789 | A | ROI
123 | A | UK
456 | B | UK
123 | A | UK
I've tried using the following code, but I get an error message saying "ValueError: You are trying to merge on int64 and object columns":
df_merged = pd.merge(df.b, df.a, how = 'left', on="offer")
Does anyone know how I can merge the dataframes correctly please?

IIUC you can just change the df_a column to an int
df_a['offer'] = df_a['offer'].astype(int)
This will change it from a float/str to an int. If this gives you an error about converting from a str/float to an int check to make sure that you don't have any NaN/Nulls in your data. If you do you will need to remove them for a successful conversion.

Related

Merging 2 datasets in Python

I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.
inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")
TFC_DATA (in the orginal dataset there exits a column called year but I've dropped it):
| Country | TFP |
| Angola | 0.8633379340171814 |
| Angola | 0.9345720410346984 |
| Angola | 1.0301895141601562 |
| Angola | 1.0850582122802734 |
.
.
.
CO2_DATA:
| Country | year | GDP | co2
| Afghanistan | 2005 | 25397688320.0 | 1
| Afghanistan | 2006 | 28704401408.0 | 2
| Afghanistan | 2007 | 34507530240.0 | 2
| Afghanistan | 2008 | 1.0850582122802734 | 3
| Afghanistan | 2009 | 1.040212631225586 | 1
.
.
.
What I want is
Output
|Country|Year|gdp|co2|TFP
Angola|2005|51967275008.0|19.006|0.8633379340171814
Angola|2006|66748907520.0|19.006|0.9345720410346984
Angola|2007|87085293568.0|19.006|1.0301895141601562
.
.
.
What I have instead
Output
Country|Year|gdp|co2|Year|TFP
Angola|2005|51967275008.0|19.006|2005|0.8633379340171814
Angola|2005|51967275008.0|19.006|2006|0.9345720410346984
Angola|2005|51967275008.0|19.006|2007|1.0301895141601562
Angola|2005|51967275008.0|19.006|2008|1.0850582122802734
Angola|2005|51967275008.0|19.006|2009|1.040212631225586
Angola|2005|51967275008.0|19.006|2010|1.0594196319580078
Angola|2005|51967275008.0|19.006|2011|1.036203384399414
Angola|2005|51967275008.0|19.006|2012|1.076979637145996
Angola|2005|51967275008.0|19.006|2013|1.0862818956375122
Angola|2005|51967275008.0|19.006|2014|1.096832513809204
Angola|2005|51967275008.0|19.006|2015|1.0682281255722046
Angola|2005|51967275008.0|19.006|2016|1.0160540342330933
Angola|2005|51967275008.0|19.006|2017|1.0
I expected the datas of the countrys' merge in one dataset but it duplicates itself until the second one data is over then the second one does the same
TFC_DATA (in the orginal dataset there exits a column called year but
I've dropped it):
Well, based on your expected output, you should not drop the column Year from the dataframe TFC_DATA. Only then, you can use pandas.merge (as shown below). Because otherwise, you'll have duplicated values.
pd.merge(CO2_DATA, TFC_DATA, left_on=["country", "year"], right_on=["country", "Year"])
OR :
pd.merge(CO2_DATA, TFC_DATA.rename(columns={"Year": "year"}), on=["country", "year"])
pd.merge() function performs an inner join by default that means it only includes rows that have matching values in the specified columns.
Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA) and only the matching rows from the right dataset (CO2_DATA).
Specify a left outer join using the how="left" parameter in the pd.merge() function.
merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")
After #abokey's comment
EDIT
First, create a new column in the TFC_DATA dataset with the year value
TFC_DATA["year"] = TFC_DATA.index.year
Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group
TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()
Reset the index to make "country" and "year" columns in the resulting dataset
TFC_DATA_agg = TFC_DATA_agg.reset_index()
Perform the inner merge, using "country" and "year" as the merge keys
merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])

How to convert a data-frame with status and date of the status changed into a data-frame with a data-frame with status_to, status_from and date?

I have a data-frame like this:
date | status
2020/01/01 | A
2020/02/01 | B
2020/03/01 | c
I would like to convert it into something like this
status_from | status_to | date
A | B | 2020/02/01
B | C | 2020/03/01
Assuming i do not know the name of status and there are way to many status to manually make a data-frame. I need something dynamic that will work with any data-frame with similar structure.
thankyou.
This might help, convert the data frame to a list, make a new list according to your criteria, and convert the second list to a data frame.
However I am not sure what you want fill under 'status_to' and 'date' in the last row
#df - the given dataframe
list1=df.values.tolist()
list2=list()
for i in range(len(df)-1):
list2.append([l[i][1],l[i+1][1],l[i+1][0]])
data_frame= pd.DataFrame(list2,columns=['status_from','status_to','date'])
print(data_frame)

Create a new column based off values in two others

I'm trying to merge two columns. Merging is not working out and I've been reading and asking questions for two days trying to find a solution so I'm going to go a different route.
Since I have to change the column name after I merge anyway why not just create a new column and fill it based on the other two.
So I have column A, B and C now.
C is a blank column.
Column A has values for most rows but not all, In the case that column A doesn't have a value I want to use Column B's value. I want to put one of the two Values in column C.
Please keep in mind that when column A doesn't have a value a "-" was put in its place (hence why I'm having a horrendous time trying to merge these columns).
I have converted the "-" to NaN but then the .fillna function doesn't work and I'm not sure why.
I'm thinking I have to write a for loop and an if statement to accomplish this although I feel like there is a function that would accomplish compiling a new column based on the other two columns values.
| A | B |
| 34 | 35 |
| 37 | - |
| - | 32 |
| 94 | 92 |
| - | 91 |
| 47 | - |
Desired Result
|C |
|34|
|37|
|32|
|94|
|91|
|47|
Does this answer your question:
df['A']=df.apply(lambda x: x['B'] if x['A']=='-' else x['A'],axis=1)
df['A']=df.apply(lambda x: x['B'] if x['A']==np.NaN else x['A'],axis=1)

Python data subset - select value ... from DF1 ... where value exists in DF2

I am trying to create a dataframe partly by seeing if values exist in another dataframe. here is the SQL version of what I am trying to do:
SELECT *
FROM DF1
WHERE
Patient_alive='still_alive'
AND Patient_ID in (SELECT Pat_ID from DF2)
Here is the code I am struggling with, the last line is what I can't figure out, i have two versions of pseudocode concerning PT_ID:
DF3 = DF1[
(DF1['Patient_alive'].str.contains('still_alive', case=False))&
#(DF1['PT_ID'].isin(DF2))
(DF1['PT_ID'].contains(DF2, case=False))
]
Update1:
Input Data of df1:
Patient_ID | Patient_Alive | Patient_Name
12345 | StillAlive | Knowles, Archibald
23456 | NotAlive | Hauzer, Bruno
911235 | StillAlive | Samarkand, Samsonite VII
Input Data of df2:
PT_ID
12345
22222
55555
99999
Df3 desired output:
Patient_ID | Patient_Alive | Patient_Name
12345 | StillAlive | Knowles, Archibald

Python 3.x pandas how to compare duplicates and drop the rows with the higher values in csv?

Hi I'm new to python and currently using python version 3.x. I have a very large set of data needed to be filtered in csv. I searched online and many recommended loading it into pandas DataFrame (done).
My columns can be defined as: "ID", "Name", "Time", "Token", "Text"
I need to check under "Token" for any duplicates - which can be done via
df = df[df.Token.duplicate(keep=False)]
(Please correct me if I am wrong)
But the problem is, I need to keep the original row while dropping the other duplicates. For this, I was told to compare it with "Time". The "Time" with the smallest value will be original (keep) while drop the rest of the duplicates.
For example:
ID Name Time Token Text
1 | John | 333 | Hello | xxxx
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
4 | Kenn | 555 | Hello | xxxx
Desired output:
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
What I have done:
##compare and keep the smaller value
def dups(df):
return df[df["Time"] < df["Time"]]
df = df[df.Token.duplicate()].apply(dups)
This is roughly where I am stuck! Can anyone help? Its my first time coding in python, any help will be greatly appreciated.
Use sort_values + drop_duplicates:
df = df.sort_values('Time')\
.drop_duplicates('Token', keep='first').sort_index()
df
ID Name Time Token Text
1 2 Mary 233 Hiiii xxxx
2 3 Jame 222 Hello xxxx
The final sort_index call restores order to your original dataframe. If you want to retrieve a monotonically increasing index beyond this point, call reset_index.

Categories