Unable to merge datasets - python

I have scraped data from two different pharma websites. So, I have 2 datasets in hand:-
Both datasets have a name column in common. What I am trying to achieve is combining these two datasets. My final objective is to get all the tables from the first dataset and product descriptions from the second dataset wherever the name is the same in both tables.
I tried using information from geeks for geeks:- https://www.geeksforgeeks.org/different-types-of-joins-in-pandas/
and https://pandas.pydata.org/docs/user_guide/merging.html
but not getting the expected result.
Also, I tried it using the for loop but to no avail:-
new_df['Product_description']=''
for i in range(len(new_df['Name'])):
for j in range(len(match_data['Name'])):
if type(new_df['Name'][i]) != float:
if new_df['Name'][i] == match_data['Name'][j].split(' ')[0].strip():
new_df['Product_description'][i] = match_data['Product_Description'][j]
I also tried:
but it's giving me 106 result which was from the older dataset and I need 251 results as in the new_df.
I want something like this but matched from the match_df data frame.
Can anyone suggest what I am doing here?
Result with left join
Also, below are the values I am getting after finding the unique values sorted.

If you want to keep the size of the first dataframe constant, you need to use left join. If there are mismatched values, it will be set to null, but this will keep the size constant.
Also remember that the first parameter of the merge method is the dataframe whose size you want to keep constant when 'how' is 'left'.

If you want to keep new_df length, I would suggest to use how='left' argument in
pd.merge(new_df, match_data, on="Name", how="left")
So it will do a left join on new_df.
Based in the screenshots you shared, I would double-check there are names in common in both dataframes "Name" column

Did you try these?
desc_df1 = pd.merge(new_df, match_data, on='Name', how='inner')
desc_df1 = pd.merge(new_df, match_data, on='Name', how='left')
After trying these options let us now, because I could not able to understand from your data preview. Can you sort Name.value_counts() ascending and check is there any dublicates in both df's ?.If so this is why you got this problem

Related

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

Replicating Excel VLOOKUP in Python

So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')

Pandas Dataframes: Combining Columns from Two Global Datasets when the rows hold different Countries

My Problem is that these two CSV files have different countries at different rows, so I can't just append the column in question to the other data frame.
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
I'm trying to think of some way to use a for loop, checking every row, and add the recovered cases to the correct row where the country name is the same in both data frames, but I don't know how to put that idea in to code. Help?
You can do this a couple of ways:
Option 1: use pd.concat with set_index
pd.concat([df_confirmed.set_index(['Province/State', 'Country/Region']),
df_recovered.set_index(['Province/State', 'Country/Region'])],
axis=1, keys=['Confirmed', 'Recovered'])
Option 2: use pd.DataFrame.merge with an left join or outer join using how parameter
df_confirmed.merge(df_recovered, on=['Province/State', 'Country/Region'], how='left',
suffixes=('_confirmed','_recovered'))
Using pd.read_csv from github raw format:
df_recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
df_confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

Merging 3 datasets together

I have 3 datasets (csv.) that I need to merge together inorder to make a motion chart.
they all have countries as the columns and the year as rows.
The other two datasets look the same except one is population and the other is income. I've tried looking around to see what I can find to get the data set out like I'd like but cant seem to find anything.
I tried using pd.concat but it just lists it all one after the other not in separate columns.
merge all 3 data sets in preperation for making motionchart using pd.concat
mc_data = pd.concat([df2, pop_data3, income_data3], sort = True)
Any sort of help would be appreciated
EDIT: I have used the code as suggested, however I get a heap of NaN values that shouldn't be there
mc_data = pd.concat([df2, pop_data3, income_data3], axis = 1, keys = ['df2', 'pop_data3', 'income_data3'])
EDIT2: when I run .info and .index on them i get these results. Could it be to do with the data types? or the column entries?
From this answer:
You can do it with concat (the keys argument will create the hierarchical columns index):
pd.concat([df2, pop_data3, income_data3], axis=1, keys=['df2', 'pop_data3', 'income_data3'])

How do I iterate through two dataframes of different sizes?

Specifically I want to iterate through two dataframes, one being large and one being small.
Ultimately, I would like to compare values within a certain column.
I tried creating a nested for loop; the outer loop iterating through the large dataframe and the inner loop iterating through the small dataframe however I am having difficulties.
I'm looking for a way to identify that the "name" and "value" in my large dataframe that matches my small dataframe.
Background info: I am using the panda library.
Large dataframe:
Small dataframe:
Name Value
SF 12.84
TH -49.45
If the goal is to iterate through one, or especially more, DataFrames, then explicit for loops is usually the wrong move. In this case, because you're trying to
identify that the "name" and "value" in my large dataframe that matches my small dataframe,
the operation that you're looking for is either pd.merge or pd.DataFrame.join which do the comparisons "under the hood" and return matching information. So, say you have the 2 DataFrames and they're called large and small. Then
import pandas as pd
new_large = pd.merge(left=large,
right=small,
how='left',
on=('Name', 'Value'),
indicator=True)
new_large._merge = new_large._merge.apply(lambda x: 1 if x=='both' else 0)
By doing a left join between large and small (how='left'), pd.merge returns the rows in large that contain a match in small on the ('Name', 'Value') tuple. Then, most of the heavy lifting is done by the indicator keyword that, quoting the pd.merge version 0.25.0 docs:
If True, adds a column to output DataFrame called "_merge" with
information on the source of each row.
Information column is Categorical-type and takes on a value of "left_only"
for observations whose merge key only appears in 'left' DataFrame,
"right_only" for observations whose merge key only appears in 'right'
DataFrame, and "both" if the observation's merge key is found in both.
So, new_large is the original large DataFrame with a new column called _merge the entries of which correspond to the rows of large that matched small just on Name (by the value 'left_only') and the rows that matched on Name as well as Value; the latter having the value both. The last step is changing both and left_only to 1 and 0, as you specified.
Now, the left join returned what it did because both of the Name values in the small DataFrame were present in the large DataFrame so the left-join of large and small returned the whole large DataFrame. When this is not the case, there will be pd.NaN values resulting from pd.merge and you'll have to employ a few more tricks to get the nice Boolean (integer) column to show what matched and what didn't. HTH.

Categories