Merging rows in python - python

I have a big .csv file, and I needed to group products with the same name based on ordered quantity, which I have done through groupby(). However, I need to use all 7 of my columns in my file, but after joining the rows, only the qty_ordered and name_hash are left, the rest of my columns disappear. Is there any way to keep all 7 of my columns in my dataframe while joining rows based on name_hash and qty_ordered ?
this is my code:
import pandas as pd
data = pd.read_csv('in/tables/sales-order-item.csv')
data = data.groupby('qty_ordered')['name_hash'].apply(' '.join).reset_index()
The output of this is name_hash and qty_ordered columns combined, but I need to keep the rest of my columns as well. Any ideas on how to do this ?

Lets say you have 20 rows in your original dataframe, when grouped by qty_ordered it may have only 10 unique values of qty_ordered.
What will be your other column values for the 10 unique values of qty_ordered? You have that figured for the column name_hash. If you know what to do with other columns similarly, you could use agg.
For example if you want the same agg as ' '.join for name_hash, then you could do
data.groupby('qty_ordered', as_index = False).agg(' '.join)
which will do the same operation ' '.join for all columns.

Related

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Averaging data of dataframe columns based on redundancy of another column

I want to average the data of one column in a pandas dataframe is they share the same 'id' which is stored in another column in the same dataframe. To make it simple i have:
and i want:
Were is clear that 'nx' and 'ny' columns' elements have been averaged if for them the value of 'nodes' was the same. The column 'maille' on the other hand has to remain untouched.
I'm trying with groupby but couldn't manage till now to keep the column 'maille' as it is.
Any idea?
Use GroupBy.transform with specify columns names in list for aggregates and assign back:
cols = ['nx','ny']
df[cols] = df.groupby('nodes')[cols].transform('mean')
print (df)
Another idea with DataFrame.update:
df.update(df.groupby('nodes')[cols].transform('mean'))
print (df)

Expand column values from a grouped-by DataFrame into proper columns

After a GroupBy operation I have the following DataFrame:
The user_id is grouped with their respective aisle_id as I want. Now, I want to turn the aisle_id values into columns, having the user_id as a index, and all the aisle_id as columns. Then, in the values I want to have the amount of times the user_id and aisle_id have matched in the previous DataSet. For example, if the user_id 1 has bought from the aisle_id 12 in 3 occasions, the value in DF[1,12] would be 3.
With Pandas pivot tables I can get the template of the user_id as index, and the aisle_id as columns, but I can't seem to find the way to create the values specified above.
considering your first dataframe is df, I think you could try this :
df2=pd.DataFrame(index=df['user_id'].unique(),columns=df['aisle_id'].unique())
for i in df2.index :
for j in df2.columns :
df2.at[i,j]=len(df.loc[(df['user_id']==i) & (df['aisle_id']==j)])

Aggregate Python DF based on column

I have a big dataframe (approximately 35 columns), where 1 column - concat_strs is a concatenation of 8 columns in the dataframe. This is used to detect duplicates. What I want to do is to aggregate rows, where concat_strs has the same value, on columns val, abs_val, price, abs_price (using sum).
I have done the following:
agg_attributes = {'val': 'sum', 'abs_val': 'sum', 'price': 'sum', 'abs_price': 'sum'}
final_df= df.groupby('concat_strs', as_index=False).aggregate(agg_attributes)
But, when I look at final_df, I notice 2 issues:
Other columns are removed, so I have only 5 columns. I have tried to do final_df.reindex(columns=df.columns), but then all of the other columns are NaN
The number of rows in the final_df remains the same as in the df (ca. 300k rows). However, it should be reduced (checked manually)
The question is - what is done wrong and is there any improvement suggestion?
You groupby concat_strs, so only concat_strs and the columns in agg_attributes is kept, because groupby operation, pandas does not know what to do with other columns.
You can include all columns with first agg to keep the first value of that column (if duplicated), or last etc.. depends on what you need.
Also this way to dedup I bet it a good operation, can you simply drop all the duplicates?
You dont need to concat_strs too, as groupby support input in a list of column to group on
Not sure if I understood the question correctly. but you can try this?
final_df = df.groupby(['concat_strs']).sum()

Get unique values of multiple columns as a new dataframe in pandas

Having pandas data frame df with at least columns C1,C2,C3 how would you get all the unique C1,C2,C3 values as a new DataFrame?
in other words, similiar to :
SELECT C1,C2,C3
FROM T
GROUP BY C1,C2,C3
Tried that
print df.groupby(by=['C1','C2','C3'])
but im getting
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000769A9E8>
I believe you need drop_duplicates if want all unique triples:
df = df.drop_duplicates(subset=['C1','C2','C3'])
If want use groupby add first:
df = df.groupby(by=['C1','C2','C3'], as_index=False).first()

Categories