link to table
The name of my dataframe is df.
I want to combine the rows having the same Borough and same PostalCode with Neighborhood separated by commas. But I'm not able to get it. Can anyone please help me with it?
you have to first group by the two first column and then apply a transform for joining the result.
df['Neighborhood ']= df.groupby(['PostalCode ','Borough'])['Neighboudhood'].transform(lambda x: ','.join(x))
df = df.drop_duplicates()
You can use this:
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].agg(','.join)
output sample for the two rows:
CR0 Croydon Addington,Addiscombe
Related
I have the following dataframe :
And I was wondering how to get :
As you can see blue rows are subrows and the idea is to group them together depending on the name :
I tried :
DFTest= pd.read_excel("XXXXXXXXXXX/Test.xlsx")
DFTest.groupby(['Name'], as_index=False).sum().reset_index(drop=True)
But This does delete the blank rows (0,1,2,5,6,7).
How would I group subrows together and keep Blank rows as they are ?
This does the job:
grouped_df = df.groupby("Name", as_index = False)
df_sum = grouped_df.agg(np.sum)
pd.concat([df[df["Numb2"].isna()], df_sum])
Firstly I get the sum of all the values of the Numb2 column and then concatenate this new dataframe with the rows that have an NaN value in the Numb2 column.
This dataframe won't be the same as the one in the image you shared. But I don't think that'll be any problem.
But if is a problem then use the code below to get the dataframe sorted,
new_df.sort_values(by = "Name")
I hope this helped you!
I have a DataFrame which contains categories, and I want to split the DataFrame using categories as df_name:
df_name = df['category'].unique()
print(sites)
result:
['df1' 'df2']
after splitting the DataFrame using a loop, I get 2 smaller DataFrames, df1 and df2.
Next, I want to alter the DataFrame. I want to remove column category from df1 and df2, using df_name, but got an error. After trying for awhile I think the problem is because df_name is a list.
How do I convert df_name from
['df1' 'df2']
to
[df1 df2]
?
why using loop to filter? you can just use df[df['column'] == 'df1'] to filter 'df1' value from a column
then if you want to remove column, you can use del df['category']
I have a big .csv file, and I needed to group products with the same name based on ordered quantity, which I have done through groupby(). However, I need to use all 7 of my columns in my file, but after joining the rows, only the qty_ordered and name_hash are left, the rest of my columns disappear. Is there any way to keep all 7 of my columns in my dataframe while joining rows based on name_hash and qty_ordered ?
this is my code:
import pandas as pd
data = pd.read_csv('in/tables/sales-order-item.csv')
data = data.groupby('qty_ordered')['name_hash'].apply(' '.join).reset_index()
The output of this is name_hash and qty_ordered columns combined, but I need to keep the rest of my columns as well. Any ideas on how to do this ?
Lets say you have 20 rows in your original dataframe, when grouped by qty_ordered it may have only 10 unique values of qty_ordered.
What will be your other column values for the 10 unique values of qty_ordered? You have that figured for the column name_hash. If you know what to do with other columns similarly, you could use agg.
For example if you want the same agg as ' '.join for name_hash, then you could do
data.groupby('qty_ordered', as_index = False).agg(' '.join)
which will do the same operation ' '.join for all columns.
enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.
Having pandas data frame df with at least columns C1,C2,C3 how would you get all the unique C1,C2,C3 values as a new DataFrame?
in other words, similiar to :
SELECT C1,C2,C3
FROM T
GROUP BY C1,C2,C3
Tried that
print df.groupby(by=['C1','C2','C3'])
but im getting
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000769A9E8>
I believe you need drop_duplicates if want all unique triples:
df = df.drop_duplicates(subset=['C1','C2','C3'])
If want use groupby add first:
df = df.groupby(by=['C1','C2','C3'], as_index=False).first()