how to merge duplicate rows using two columns in pandas - python

I am trying to merge two rows from the dataframe below but at the same time I want to replace the None and Nan fields with values from the rows that have the values.
I started with
new_df = df.groupby(['source','code'], axis =0)
but the result wasn't what I am looking for. In the dataframe below I would row 2 and row 5 to merge into a single row and filled with non empty values

Related

Compare two dataframes, and update the value to NaN if it is Nan in the first dataframe

I have two pandas dataframes, and both dataframes have equal number of rows and columns.
Only the first column in both dataframes have non NaN values. For other columns, it might be either NaN or some integer.
I need to compare the two dataframes using the first column, and update the value of the other columns in the second dataframe to NaN, if it is NaN in the first dataframe, irrespective of any value in the second dataframe.
I tried bruteforce method by converting df into dictionary, and compare each values. Is there a simple way to achieve this using Pandas functions?

Pandas merge two data frame only to first occurrence

I have two dataframe, I am able to merge by pd.merge(df1, df2, on='column_name'). But I only want to merge on first occurrence in df1 Any pointer or solution? It's a many to one, and I only want the first occurrence merged. Thanks in advance!
Since you want to merge two dataframes of different lengths, you'll have to have NaN values in the merged dataframe cells where there are no corresponding indices in df2. So let's try this. Merge left. This will duplicate df2 values for duplicated column_name rows in df1. Have a mask ready to filter those rows and assign NaN for them in the columns from df2.
mask = df1['column_name'].duplicated()
new_df = df1.merge(df2, how='left', on='column_name')
new_df.loc[mask, df2.columns[df2.columns!='column_name']] = np.nan

Sum multiple dataframe columns based on a condition

I have a python dataframe with 30 columns,
I would like to add new column and set it to be the sum only the columns that equal to 1 from the last 10 columns (20:30)
How can I do that ?
Thanks

How to add a pandas Series to a DataFrame ignoring indices?

I have a DataFrame with random, unsorted row indices, which is a result of removing some 'noise' from the original DataFrame.
row_index col1 col2
2 1 2
19 3 4
432 4 1
I would like to add some pd.Series to this Dataframe. The Series has its indices sorted from 0 to n=number of rows. The number of rows equals the number of rows in the DataFrame
Having tried multiple ways of adding the Series to my DataFrame I realized that the data from the Series gets mixed up, because (I believe) Python is matching records by their indices.
Is there a way I can add the Series to the Dataframe, ignoring the indices, so that my data doesn't get mixed up?
convert the series into a data frame.
code
df=pd.DataFrame(df)
result=pd.concat([df1,df],axis=1,ignore_index=True)
df1 is the data frame you want to add .
df is the data frame i.e series you converted to data frame
df['new_col'] = other_df['column'].values

How to drop rows from a dataframe as per null values in a specific column?

How to drop rows from a dataframe as per null values in a specific column?
Say I have a dataframe that has three columns a,b,c and all can have null values, but I only want to droprows where column b has null/NaN. How can I do that in pandas dataframe?
This should do the trick:
df = df.dropna(subset=['b'], axis=1)

Categories