Combining/Grouping the dataset using one column while keeping the other data - python

I have merged many dataframes using merge command and append and facing problem of data redundancy in final dataset. I have tried using groupby() on unique attribute but the end result still contains redundant data.
Tried:
removedRedun = data.groupby("Name", group_keys=False).apply(lambda x: x)
Actual Dataset and Expected Result.

Related

Unable to merge datasets

I have scraped data from two different pharma websites. So, I have 2 datasets in hand:-
Both datasets have a name column in common. What I am trying to achieve is combining these two datasets. My final objective is to get all the tables from the first dataset and product descriptions from the second dataset wherever the name is the same in both tables.
I tried using information from geeks for geeks:- https://www.geeksforgeeks.org/different-types-of-joins-in-pandas/
and https://pandas.pydata.org/docs/user_guide/merging.html
but not getting the expected result.
Also, I tried it using the for loop but to no avail:-
new_df['Product_description']=''
for i in range(len(new_df['Name'])):
for j in range(len(match_data['Name'])):
if type(new_df['Name'][i]) != float:
if new_df['Name'][i] == match_data['Name'][j].split(' ')[0].strip():
new_df['Product_description'][i] = match_data['Product_Description'][j]
I also tried:
but it's giving me 106 result which was from the older dataset and I need 251 results as in the new_df.
I want something like this but matched from the match_df data frame.
Can anyone suggest what I am doing here?
Result with left join
Also, below are the values I am getting after finding the unique values sorted.
If you want to keep the size of the first dataframe constant, you need to use left join. If there are mismatched values, it will be set to null, but this will keep the size constant.
Also remember that the first parameter of the merge method is the dataframe whose size you want to keep constant when 'how' is 'left'.
If you want to keep new_df length, I would suggest to use how='left' argument in
pd.merge(new_df, match_data, on="Name", how="left")
So it will do a left join on new_df.
Based in the screenshots you shared, I would double-check there are names in common in both dataframes "Name" column
Did you try these?
desc_df1 = pd.merge(new_df, match_data, on='Name', how='inner')
desc_df1 = pd.merge(new_df, match_data, on='Name', how='left')
After trying these options let us now, because I could not able to understand from your data preview. Can you sort Name.value_counts() ascending and check is there any dublicates in both df's ?.If so this is why you got this problem

Convert a rows list to pyspark dataframe

I have a pyspark dataframe (df) with very many rows and two columns (col1, col2), though I want to get a dataframe with only the top 30 rows. I've seen that I can display them with:
df.show(30)
But this isn't what I want to do, I need an actual dataframe with only these 30 rows. I've also seen that I can run:
df.head(30) or df.take(30)
But these don't output a dataframe but rather a list of the form [Row(col1=100, col2=200), Row(col1=300, col2=500), ...]
What I am wanting to know is how I can convert this list [Row(col1=100, col2=200), Row(col1=300, col2=500), ...] back into a dataframe. I have tried DataFrame and createDataFrame, but these are throwing up errors. Specifically, I have tried:
df_list=df.head(30)
df_Columns = ["col1","col2"]
new_df=DataFrame(df_list, df_Columns)
But this is throwing up errors. I am at a loss as to how to do this seemingly simple task, so I am looking for a straightforward method to do so (convert the df.head(30) list back to a pyspark dataframe). Any guidance or insights are greatly appreciated.

pandas: return mutated column into original dataframe

Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)

How to assign the same array of columns to multiple dataframes in Pandas?

I have 9 data sets. Between any 2 given data sets, they will share about 60-80% of the same columns. I want to concatenate these data sets into one data set. Due to some memory limitations, I can't load these datasets into data frames and use the concatenate function in pandas (but I can load each individual data set into a data frame). Instead, I am looking at an alternative solution.
I have created an ordered list of all columns which exist in these data sets. And I want to apply this column list to each of the individual 9 data sets. This way they will all have the same columns and are in the same order. Once that is done I will do a concatenate function on the flat files in the terminal, which will essentially append each data sets together, hopefully solving my issue and creating one single dataset of these 9.
The problem I am having is applying the ordered list to 9 data sets. I keep getting a KeyError "[[list of columns]] not in index" whenever I try to change the columns in the single data sets.
This is what I have been trying:
df = df[clist]
I have also tried
df = df.reindex(columns=clist)
but this doesn't create the extra columns in the data frame, it just orders them in the order that clist is in.
I expect the result to create 9 datasets which lineup on the same axis for an appends or concat operation outside pandas.
I just solved it.
the reindiex function does work. I was applying the reindex function outside of the list of dataframes I had created.
I loaded these 9 datasets with their first 9 rows into a list.
for filename in all_files:
df = pd.read(filename,nrows=10)
li.append(df)
And from that list I used the reindex as such
for i in range(0,9):
li[i]=li[i].reindex(columns=clist)

Pandas dataframe left-merge with different dataframe sizes

I have a toy stock predictor, and from time to time save results using dataframes. After the first result set I would like to append my first dataframe. Here is what I do:
Create first dataframe using predicted results
Sort descending to predicted performance
Save to csv, without the index
With new data, read out result csv and try left merge, goal is to append new predicted performance to the correct stock ticker
df=pd.merge(df, df_new[['ticker', 'avgrd_app']], on='ticker', how='left')
Those two dataframes have different amounts of columns. In the end it only appends the dataframes to another:
avgrd,avgrd_app,prediction1,prediction2,ticker
-0.533520756811,,110.64654541,110.37853241,KIO
-0.533520756811,,110.64654541,110.37853241,MMM
-0.604610694122,,110.64654541,110.37853241,SRI
[...]
,-0.212600450514,,,G5DN
,0.96378750992,,,G5N
,2.92757501984,,,DAL3
,2.27297945023,,,WHF4
So - how can I merge correctly?
From the sample result, it works as expected, the new data don't have numbers for all the tickers so some of the predictions are missing. So what exactly do you want to achieve? If you only need stocks with all the predictions, use inner join.

Categories