Merging 2 datasets in Python

Merging 2 datasets in Python - python

I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.
inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")
TFC_DATA (in the orginal dataset there exits a column called year but I've dropped it):
| Country | TFP |
| Angola | 0.8633379340171814 |
| Angola | 0.9345720410346984 |
| Angola | 1.0301895141601562 |
| Angola | 1.0850582122802734 |
.
.
.
CO2_DATA:
| Country | year | GDP | co2
| Afghanistan | 2005 | 25397688320.0 | 1
| Afghanistan | 2006 | 28704401408.0 | 2
| Afghanistan | 2007 | 34507530240.0 | 2
| Afghanistan | 2008 | 1.0850582122802734 | 3
| Afghanistan | 2009 | 1.040212631225586 | 1
.
.
.
What I want is
Output
|Country|Year|gdp|co2|TFP
Angola|2005|51967275008.0|19.006|0.8633379340171814
Angola|2006|66748907520.0|19.006|0.9345720410346984
Angola|2007|87085293568.0|19.006|1.0301895141601562
.
.
.
What I have instead
Output
Country|Year|gdp|co2|Year|TFP
Angola|2005|51967275008.0|19.006|2005|0.8633379340171814
Angola|2005|51967275008.0|19.006|2006|0.9345720410346984
Angola|2005|51967275008.0|19.006|2007|1.0301895141601562
Angola|2005|51967275008.0|19.006|2008|1.0850582122802734
Angola|2005|51967275008.0|19.006|2009|1.040212631225586
Angola|2005|51967275008.0|19.006|2010|1.0594196319580078
Angola|2005|51967275008.0|19.006|2011|1.036203384399414
Angola|2005|51967275008.0|19.006|2012|1.076979637145996
Angola|2005|51967275008.0|19.006|2013|1.0862818956375122
Angola|2005|51967275008.0|19.006|2014|1.096832513809204
Angola|2005|51967275008.0|19.006|2015|1.0682281255722046
Angola|2005|51967275008.0|19.006|2016|1.0160540342330933
Angola|2005|51967275008.0|19.006|2017|1.0
I expected the datas of the countrys' merge in one dataset but it duplicates itself until the second one data is over then the second one does the same

TFC_DATA (in the orginal dataset there exits a column called year but
I've dropped it):
Well, based on your expected output, you should not drop the column Year from the dataframe TFC_DATA. Only then, you can use pandas.merge (as shown below). Because otherwise, you'll have duplicated values.
pd.merge(CO2_DATA, TFC_DATA, left_on=["country", "year"], right_on=["country", "Year"])
OR :
pd.merge(CO2_DATA, TFC_DATA.rename(columns={"Year": "year"}), on=["country", "year"])

pd.merge() function performs an inner join by default that means it only includes rows that have matching values in the specified columns.
Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA) and only the matching rows from the right dataset (CO2_DATA).
Specify a left outer join using the how="left" parameter in the pd.merge() function.
merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")
After #abokey's comment
EDIT
First, create a new column in the TFC_DATA dataset with the year value
TFC_DATA["year"] = TFC_DATA.index.year
Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group
TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()
Reset the index to make "country" and "year" columns in the resulting dataset
TFC_DATA_agg = TFC_DATA_agg.reset_index()
Perform the inner merge, using "country" and "year" as the merge keys
merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])

Related

Pandas question - merging on int64 and object columns?

I have a couple of Pandas dataframes that I am trying to merge together without any luck.
The first dataframe (let's call it dataframe A) looks a little like this:
offer | type | country
------|------|--------
123 | A | UK
456 | B | UK
789 | A | ROI
It's created by reading in an .xlsx file using the following code:
file_name = "My file.xlsx"
df_a = pd.read_excel(file_name, sheet_name = "Offers Plan", usecols= ["offer","type","country"], dtype={'offer': str})
The offer column is being read in as a string because otherwise they end up in the format 123.0. The offers need to be in the format 123 because they're being used in some embedded SQL later on in the code that looks for them in a certain database table. In this table the offers are in the format 123, so the SQL will return no results when looking for 123.0.
The second dataframe (dataframe B) looks a little like this:
offer
-----
123
456
789
123
456
123
What I want to do is merge the two dataframes together so the results look like this:
offer | type | country
------|------|--------
123 | A | UK
456 | B | UK
789 | A | ROI
123 | A | UK
456 | B | UK
123 | A | UK
I've tried using the following code, but I get an error message saying "ValueError: You are trying to merge on int64 and object columns":
df_merged = pd.merge(df.b, df.a, how = 'left', on="offer")
Does anyone know how I can merge the dataframes correctly please?

IIUC you can just change the df_a column to an int
df_a['offer'] = df_a['offer'].astype(int)
This will change it from a float/str to an int. If this gives you an error about converting from a str/float to an int check to make sure that you don't have any NaN/Nulls in your data. If you do you will need to remove them for a successful conversion.

Pandas DataFrame: Fill NA values based on group mean

I would like to update the NA values of a Pandas DataFrame column with the values in a groupby object.
Let's illustrate with an example:
We have the following DataFrame columns:
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
|--------|-------|-----|-------------|
We're simply measuring temperature multiple times a day for many months. Now, let's assume that for some of our records, the temperature reading failed and we have a NA.
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
| 4 | 1 | 2 | NA |
| 5 | 1 | 3 | 14.8 |
| 6 | 1 | 4 | NA |
|--------|-------|-----|-------------|
We could just use panda's .fillna(), however we want to be a little more sophisticated. Since there are multiple readings per day (there could be 100's per day), we'd like to take the daily average and use that as our fill value.
we can get the daily averages with a simple groupby:
avg_temp_by_month_day = df.groupby(['month'])['day'].mean()
Which gives us the means for each day by month. The question is, how best to fill the NA values with the groupby values?
We could use an apply(),
df['temperature'] = df.apply(
lambda row: avg_temp_by_month_day.loc[r['month'], r['day']] if pd.isna(r['temperature']) else r['temperature'],
axis=1
)
however this is really slow (1M+ records).
Is there a vectorized approach, perhaps using np.where(), or maybe creating another Series and merging.
What's the a more efficient way to perform this operation?
Thank you!

I'm not sure if this is the fastest, however instead of taking ~1 hour for apply, it takes ~20 sec for +1M records. The below code has been updated to work on 1 or many columns.
local_avg_cols = ['temperature'] # can work with multiple columns
# Create groupby's to get local averages
local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean()
# Convert to DataFrame and prepare for merge
local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index()
# Merge into original dataframe
df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg'))
# Now overwrite na values with values from new '_avg' col
for col in local_avg_cols:
df[col] = df[col].mask(df[col].isna(), df[col+'_avg'])
# Drop new avg cols
df = df.drop(columns=[col+'_avg' for col in local_avg_cols])
If anyone finds a more efficient way to do this, (efficient in processing time, or in just readability), I'll unmark this answer and mark yours. Thank you!

I'm guessing what speeds down your process are two things. First, you don't need to convert your groupby to a dataframe. Second, you don't need the for loop.
from pandas import DataFrame
from numpy import nan
# Populating the dataset
df = {"Month": [1] * 6,
"Day": [1, 1, 2, 2, 3, 4],
"Temperature": [14.3, 14.8, 13.1, nan, 14.8, nan]}
# Creating the dataframe
df = pd.DataFrame(df, columns=df.keys())
local_averages = df.groupby(['Month', 'Day'])['Temperature'].mean()
df = df.merge(local_averages, on=['Month', 'Day'], how='left', suffixes=('', '_avg'))
# Filling the missing values of the Temperature column with what is available in Temperature_avg
df.Temperature.fillna(df.Temperature_avg, inplace=True)
df.drop(columns="Temperature_avg", inplace=True)
Groupby is a resource heavy process so make the most out of it when you use it. Furthermore, as you already know loops are not a good idea when it comes to dataframes. Additionally, if you have a large data you may want to avoid creating extra variables from it. I may put the groupby into the merge if my data has 1m rows and many columns.

Pandas groupby function returns NaN values

I have a list of people with fields unique_id, sex, born_at (birthday) and I’m trying to group by sex and age bins, and count the rows in each segment.
Can’t figure out why I keep getting NaN or 0 as the output for each segment.
Here’s the latest approach I've taken...
Data sample:
|---------------------|------------------|------------------|
| unique_id | sex | born_at |
|---------------------|------------------|------------------|
| 1 | M | 1963-08-04 |
|---------------------|------------------|------------------|
| 2 | F | 1972-03-22 |
|---------------------|------------------|------------------|
| 3 | M | 1982-02-10 |
|---------------------|------------------|------------------|
| 4 | M | 1989-05-02 |
|---------------------|------------------|------------------|
| 5 | F | 1974-01-09 |
|---------------------|------------------|------------------|
Code:
df[‘num_people’]=1
breakpoints = [18,25,35,45,55,65]
df[[‘sex’,’born_at’,’num_people’]].groupby([‘sex’,pd.cut(df.born_at.dt.year, bins=breakpoints)]).agg(‘count’)
I’ve tried summing as the agg type, removing NaNs from the data series, pivot_table using the same pd.cut function but no luck. Guessing there’s also probably a better way to do this that doesn’t involve creating a column of 1s.
Desired output would be something like this...
The extra born_at column isn't necessary in the output and I'd also like the age bins to be 18 to 24, 25 to 34, etc. instead of 18 to 25, 25 to 35, etc. but I'm not sure how to specify that either.

I think you missed the calculation of the current age. The ranges you define for splitting the bithday years only make sense when you use them for calculating the current age (or all grouped cells will be nan or zero respectively because the lowest value in your sample is 1963 and the right-most maximum is 65). So first of all you want to calculate the age:
datetime.now().year-df.birthday.dt.year
This information then can be used to group the data (which are previously grouped by gender):
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count')
In order to get rid of the nan cells you simply do a fillna(0) like this:
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count').fillna(0).rename(columns={'birthday':'count'})

Python 3.x pandas how to compare duplicates and drop the rows with the higher values in csv?

Hi I'm new to python and currently using python version 3.x. I have a very large set of data needed to be filtered in csv. I searched online and many recommended loading it into pandas DataFrame (done).
My columns can be defined as: "ID", "Name", "Time", "Token", "Text"
I need to check under "Token" for any duplicates - which can be done via
df = df[df.Token.duplicate(keep=False)]
(Please correct me if I am wrong)
But the problem is, I need to keep the original row while dropping the other duplicates. For this, I was told to compare it with "Time". The "Time" with the smallest value will be original (keep) while drop the rest of the duplicates.
For example:
ID Name Time Token Text
1 | John | 333 | Hello | xxxx
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
4 | Kenn | 555 | Hello | xxxx
Desired output:
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
What I have done:
##compare and keep the smaller value
def dups(df):
return df[df["Time"] < df["Time"]]
df = df[df.Token.duplicate()].apply(dups)
This is roughly where I am stuck! Can anyone help? Its my first time coding in python, any help will be greatly appreciated.

Use sort_values + drop_duplicates:
df = df.sort_values('Time')\
.drop_duplicates('Token', keep='first').sort_index()
df
ID Name Time Token Text
1 2 Mary 233 Hiiii xxxx
2 3 Jame 222 Hello xxxx
The final sort_index call restores order to your original dataframe. If you want to retrieve a monotonically increasing index beyond this point, call reset_index.

groupby and select columns from Pandas DataFrame

I have a DataFrame at daily level :
day | type| rev |impressions| yearmonth
2015-10-01| a | 1999| 1000 |201510
2015-10-02| a | 300 | 6777 |201510
2015-11-07| b | 2000| 4999 |201511
Yearmonth is a column I added to the DataFrame. Task is to group by yearmonth, ( and may be type and then sum up all the columns(or select a value) and select them as the new DataFrame.
On grouping the above DataFrame, we should be getting one row for one month .
yearmonth| type| rev |impressions
201510 | a | 2299| 7777
201511 | b | 2000| 4999
Let us say df is the DataFrame , I tried doing
test = df.groupby('yearmonth')
I checked the methods available for test ( test.) but I did not see anything where we can select columns and also aggregate them there ( I guess we can use agg for sum) .
Any inputs?

add the as_index parameter
like this:
test = df.groupby('yearmonth', as_index=False)
here is a reference:
enter link description here

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging 2 datasets in Python - python

Related

Pandas question - merging on int64 and object columns?

Pandas DataFrame: Fill NA values based on group mean

Pandas groupby function returns NaN values

Python 3.x pandas how to compare duplicates and drop the rows with the higher values in csv?

groupby and select columns from Pandas DataFrame

Categories

Resources