Edit DataFrame to merge sub rows only - python

I have the following dataframe :
And I was wondering how to get :
As you can see blue rows are subrows and the idea is to group them together depending on the name :
I tried :
DFTest= pd.read_excel("XXXXXXXXXXX/Test.xlsx")
DFTest.groupby(['Name'], as_index=False).sum().reset_index(drop=True)
But This does delete the blank rows (0,1,2,5,6,7).
How would I group subrows together and keep Blank rows as they are ?

This does the job:
grouped_df = df.groupby("Name", as_index = False)
df_sum = grouped_df.agg(np.sum)
pd.concat([df[df["Numb2"].isna()], df_sum])
Firstly I get the sum of all the values of the Numb2 column and then concatenate this new dataframe with the rows that have an NaN value in the Numb2 column.
This dataframe won't be the same as the one in the image you shared. But I don't think that'll be any problem.
But if is a problem then use the code below to get the dataframe sorted,
new_df.sort_values(by = "Name")
I hope this helped you!

Related

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

How can i "merge" rows groupby having same values in Pandas dataframe?

link to table
The name of my dataframe is df.
I want to combine the rows having the same Borough and same PostalCode with Neighborhood separated by commas. But I'm not able to get it. Can anyone please help me with it?
you have to first group by the two first column and then apply a transform for joining the result.
df['Neighborhood ']= df.groupby(['PostalCode ','Borough'])['Neighboudhood'].transform(lambda x: ','.join(x))
df = df.drop_duplicates()
You can use this:
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].agg(','.join)
output sample for the two rows:
CR0 Croydon Addington,Addiscombe

Subsetting a dataframe in pandas according to column name values

I have a dataframe in pandas that i need to split up. It is much larger than this, but here is an example:
ID A B
a 0 0
b 1 1
c 2 2
and I have a list: keep_list = ['ID','A']
and another list: recode_list = ['ID','B']
I'd like the split the dataframe up by the column headers into two dataframes: one dataframe with those columns and values whose column headers match the keep_list, and one with those column headers and data that match the recode_alleles list. Every code I have tried thus far has not worked as it is trying to compare the values to the list, not the column names.
Thank you so much in advance for your help!
Assuming your DataFrame's name is df:
you can simply do
df[keep_list] and df[recode_list] to get what you want.
You can do this by index.intersection:
df1 = df[df.columns.intersection(keep_list)]
df2 = df[df.columns.intersection(recode_list)]

How to drop duplicated rows in data frame based on certain criteria?

enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.

How do i drop all columns that include '_id' - Python

I have a dataframe with 247 columns. Many of the column names contain "_id" in the column name. How do I drop all columns that contain "_id"??
This is pretty straight forward as well. Select the columns that contain "_id" and then invert it, use .loc to restrict the columns, and you're done.
df = df.loc[:, ~df.columns.str.contains("_id")]
Try this:
df = df[df.columns.drop(list(df.filter(like='_id')), axis = 1, inplace = True)]
What this code does is:
To filter all those columns which will have _id anywhere in its name and then dropping all those columns.
let me know if you didn't understand or need any help in this regard.

Categories