Merging 3 datasets together - python

I have 3 datasets (csv.) that I need to merge together inorder to make a motion chart.
they all have countries as the columns and the year as rows.
The other two datasets look the same except one is population and the other is income. I've tried looking around to see what I can find to get the data set out like I'd like but cant seem to find anything.
I tried using pd.concat but it just lists it all one after the other not in separate columns.
merge all 3 data sets in preperation for making motionchart using pd.concat
mc_data = pd.concat([df2, pop_data3, income_data3], sort = True)
Any sort of help would be appreciated
EDIT: I have used the code as suggested, however I get a heap of NaN values that shouldn't be there
mc_data = pd.concat([df2, pop_data3, income_data3], axis = 1, keys = ['df2', 'pop_data3', 'income_data3'])
EDIT2: when I run .info and .index on them i get these results. Could it be to do with the data types? or the column entries?

From this answer:
You can do it with concat (the keys argument will create the hierarchical columns index):
pd.concat([df2, pop_data3, income_data3], axis=1, keys=['df2', 'pop_data3', 'income_data3'])

Related

Unable to merge datasets

I have scraped data from two different pharma websites. So, I have 2 datasets in hand:-
Both datasets have a name column in common. What I am trying to achieve is combining these two datasets. My final objective is to get all the tables from the first dataset and product descriptions from the second dataset wherever the name is the same in both tables.
I tried using information from geeks for geeks:- https://www.geeksforgeeks.org/different-types-of-joins-in-pandas/
and https://pandas.pydata.org/docs/user_guide/merging.html
but not getting the expected result.
Also, I tried it using the for loop but to no avail:-
new_df['Product_description']=''
for i in range(len(new_df['Name'])):
for j in range(len(match_data['Name'])):
if type(new_df['Name'][i]) != float:
if new_df['Name'][i] == match_data['Name'][j].split(' ')[0].strip():
new_df['Product_description'][i] = match_data['Product_Description'][j]
I also tried:
but it's giving me 106 result which was from the older dataset and I need 251 results as in the new_df.
I want something like this but matched from the match_df data frame.
Can anyone suggest what I am doing here?
Result with left join
Also, below are the values I am getting after finding the unique values sorted.
If you want to keep the size of the first dataframe constant, you need to use left join. If there are mismatched values, it will be set to null, but this will keep the size constant.
Also remember that the first parameter of the merge method is the dataframe whose size you want to keep constant when 'how' is 'left'.
If you want to keep new_df length, I would suggest to use how='left' argument in
pd.merge(new_df, match_data, on="Name", how="left")
So it will do a left join on new_df.
Based in the screenshots you shared, I would double-check there are names in common in both dataframes "Name" column
Did you try these?
desc_df1 = pd.merge(new_df, match_data, on='Name', how='inner')
desc_df1 = pd.merge(new_df, match_data, on='Name', how='left')
After trying these options let us now, because I could not able to understand from your data preview. Can you sort Name.value_counts() ascending and check is there any dublicates in both df's ?.If so this is why you got this problem

How to merge two data frames in Pandas without losing values

I have two data frames that I imported as spreadsheets into Pandas and cleaned up. They have a similar key value called 'PurchaseOrders' that I am using to match product numbers to a shipment number. When I attempt to merge them, I only end up with a df of 34 rows, but I have over 400 pairs of matching product to shipment numbers.
This is the closest I've gotten, but I have also tried using join()
ShipSheet = pd.merge(new_df, orders, how ='inner')
ShipSheet.shape
Here is my order df
orders df
and here is my new_df that I want to add to my orders df using the 'PurchaseOrders' key
new_df
In the end, I want them to look like this
end goal df
I am not sure if I'm not using the merge function improperly, but my end product should have around 300+ rows. I will note that the new_df data frame's 'PurchaseOrders' values had to be delimited from a single column and split into rows, so I guess this could have something to do with it.
Use the merge method on the dataframe and specify the key
merged_inner = pd.merge(left=df_left, right=df_right, left_on='PurchaseOrders', right_on='PurchaseOrders')
learn more here
Use the concat method on pandas and specify the axis.
final_df = pd.concat([new_df, order], axis = 1)
when you specify the axis please careful if you specify axis = 0 then it placed second data frame under the first one and if you specify axis = 1 then it placed the second data frame right to the first data frame.

Pandas Dataframes: Combining Columns from Two Global Datasets when the rows hold different Countries

My Problem is that these two CSV files have different countries at different rows, so I can't just append the column in question to the other data frame.
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
I'm trying to think of some way to use a for loop, checking every row, and add the recovered cases to the correct row where the country name is the same in both data frames, but I don't know how to put that idea in to code. Help?
You can do this a couple of ways:
Option 1: use pd.concat with set_index
pd.concat([df_confirmed.set_index(['Province/State', 'Country/Region']),
df_recovered.set_index(['Province/State', 'Country/Region'])],
axis=1, keys=['Confirmed', 'Recovered'])
Option 2: use pd.DataFrame.merge with an left join or outer join using how parameter
df_confirmed.merge(df_recovered, on=['Province/State', 'Country/Region'], how='left',
suffixes=('_confirmed','_recovered'))
Using pd.read_csv from github raw format:
df_recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
df_confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

Pandas: merge_asof-like solutions for merging two multi-indexed DataFrames?

I have two dataframes, df1 and df2 say, which are both multi-indexed.
At the first index level, both dataframes share the same keys (i.e. df1.index.get_level_values(0) and df2.index.get_level_values(0) contain the same elements). Those keys are unordered strings, such as ['foo','bar','baz'].
At the second index level, both dataframes have timestamps which are ordered, but unequally spaced.
My question is as follows. I would like to merge df1and df2 in such a way that, for each key at level 1, the values of df2 should be inserted into df1 without changing the order of df1.
I tried using pd.merge, pd.merge_asof and pd.MultiIndex.searchsorted. From the descriptions of those methods, it seems like one of them should do the trick for me, but I cannot figure out how. Ideally, I would like to find a solution that avoids looping over the keys in index.get_level_values(0), since my dataframes can get large.
A few failed attempts for illustration:
df_merged = pd.merge(left=df1.reset_index(), right=df2.reset_index(),
left_on=[['some_keys', 'timestamps_df1']], right_on=[['some_keys', 'timestamps_df2']],
suffixes=('', '_2')
) # after sorting
# FAILED
df2.index.searchsorted(df1, side='right') # after sorting
# FAILED
Any help is greatly appreciated!
Base on your description , here is the solution from merge_asof
df_merged = pd.merge_asof(left=df1.reset_index(), right=df2.reset_index(),
left_on=['timestamps_df1'], right_on=['timestamps_df2'],by='some_keys',
suffixes=('', '_2')
)

How can I drop a column from multiple dataframes stored in a list?

Apologies as I'm new to all this.
I'm playing around with pandas at the moment. I want to drop one particular column across two dataframes stored within a list. This is what I've written.
combine = [train, test]
for dataset in combine:
dataset = dataset.drop('Id', axis=1)
However, this doesn't work. If I do this explicitly, such as train = train.drop('Id', axis=1), this works fine.
I appreciate in this case it's two lines either way, but is there some way I can use the list of dataframes to drop the column from both?
The reason why your solution didn't work is because dataset is a name that points to the item in the list combine. You had the right idea to reassign it with dataset = dataset.drop('Id', axis=1) but all you did was overwrite the name dataset and not really place a new dataframe in the list combine
Option 1
Create a new list
combine = [d.drop('Id', axis=1) for d in combine]
Option 2
Or alter each dataframe in place with inplace=True
for d in combine:
d.drop('Id', axis=1, inplace=True)
Or maybe
combine = [df1, df2]
for i in range(len(combine)):
combine[i]=combine[i].drop('Id', axis=1)

Categories