Collapse rows of a dataframe with common values and fill in blanks

Collapse rows of a dataframe with common values and fill in blanks - python

I have a single data frame and every row is duplicated except for two values. In all cases the corresponding duplicate has a blank value in the corresponding row. I want to 'collapse' these rows and fill in the blanks.
In the example below, I want to collapse the top DataFrame to mirror the bottom

You can use groupby + first; first skips over NaN values by default:
collapsed_df = df.groupby("feature_id").first().reset_index()
If the empty spaces are not NaN values, probably will want to fill them with NaN first:
df = df.replace('', np.nan)

Related

fillna() only fills the 1st value of the dataframe

I'm facing a strange issue in which I'm trying to replace all NaN values in a dataframe with values taken from another one (same length) that has the relevant values.
Here's a glimpse for the "target dataframe" in which I want to replace the values:
data_with_null
Here's the dataframe where I want to take data from: predicted_paticipant_groups
I've tried:
data_with_null.participant_groups.fillna(predicted_paticipant_groups.participant_groups, inplace=True)
but it just fills all values NaN values with the 1st one (Infra)
Is it because of the indexes of data_with_null are all zeros?

Reset the index and try again.
data_with_null.reset_index(drop=True, inplace=True)

How can I drop duplicates in pandas without dropping NaN values

I have a dataframe which I query and I want to get only unique values out of a certain column.
I tried to do that executing this code:
database = pd.read_csv(db_file, sep='\t')
query = database.loc[database[db_specifications[0]].isin(elements)].drop_duplicates(subset=db_specification[1])
db_specification is just a list containing two columns that I query.
Some of the values are NaN and I don't want to consider them duplicates of each other, how can I achieve that?

You can start by selecting all NaN and then drop duplicate on the rest of the dataframe.
mask = data.isna().any()
data = pd.concat([data[mask], data[~mask]])

python - append only select columns as rows

Original file has multiple columns but there are lots of blanks and I want to rearrange so that there is one nice column with info. Starting with 910 rows, 51 cols (newFile df) -> Want 910+x rows, 3 cols (final df) final df has 910 rows.
newFile sample
for i in range (0,len(newFile)):
for j in range (0,48):
if (pd.notnull(newFile.iloc[i,3+j])):
final=final.append(newFile.iloc[[i],[0,1,3+j]], ignore_index=True)
I have this piece of code to go through newFile and if 3+j column is not null, to copy columns 0,1,3+j to a new row. I tried append() but it adds not only rows but a bunch of columns with NaNs again (like the original file).
Any suggestions?!

Your problem is that you are using a DataFrame and keeping column names, so adding a new columns with a value will fill the new column with NaN for the rest of the dataframe.
Plus your code is really inefficient given the double for loop.
Here is my solution using melt()
#creating example df
df = pd.DataFrame(numpy.random.randint(0,100,size=(100, 51)), columns=list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY'))
#reconstructing df as long version, keeping columns from index 0 to index 3
df = df.melt(id_vars=df.columns[0:2])
#dropping the values that are null
df.dropna(subset=['value'],inplace=True)
#here if you want to keep the information about which column the value is coming from you stop here, otherwise you do
df.drop(inplace=True,['variable'],axis=1)
print(df)

Getting nan when making a dataframe column equal to another

I am trying to make a subset of a dataframe
combo.iloc[:,orig_start_col:orig_start_col+2]
equal to the values another subset already has
combo.iloc[:,sm_col:sm_col+2]
where the columns will vary in a loop. The problem is that all I am getting is NaNs despite that the second subset values are not NaN
I tried to do this for the first column and it worked, however doing so with just the second column of the subset returns all NaNs. Then doing for the whole subset returns Nan values for everything
My code is:
for node_col in ('leg2_node', 'leg4_node'):
combo=orig_combos.merge(all, how='inner', left_on='leg6_node', right_on=node_col)
combo.reset_index(drop=True, inplace=True)
orig_start_col=combo.columns.get_loc('leg6_alpha_x')
sm_col=combo.columns.get_loc(node_col+'_y')
combo.iloc[:,orig_start_col+1:orig_start_col+2]=combo.
iloc[:,sm_col+1:sm_col+2]
What I would expect having sm_col:sm_col+2 subset all rows with values is to have those values in orig_start_col:orig_start_col+2 subset, but instead what I am having is all values=NaN

When taking nlargest in pandas dataframe, is there a way to ignore column with NaN values?

When taking nlargest in pandas dataframe, is there a way to ignore column with NaN values? If say I want to pick 5 column headings with the 5 largest values, and if some of the columns has NaN values, then the column is ignored. If the number of columns with finite values is smaller than 5, then pick all the column headings with finite values (<5).

nlargest takes the n top rows sorted descendingly by the columns passed to the method. If there are NaN values that get to the top then it will include these. If you wan to ignore rows in which NaN values exist in the columns that were sorted by then do this:
# assume a variable 'columns' exist that defines what columns to sort
# by. You'll have to assign this yourself. Also assign 'n' yourself.
df = df.dropna(subset=columns)
df = df.nlargest(n, columns=columns)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Collapse rows of a dataframe with common values and fill in blanks - python

I have a single data frame and every row is duplicated except for two values. In all cases the corresponding duplicate has a blank value in the corresponding row. I want to 'collapse' these rows and fill in the blanks. In the example below, I want to collapse the top DataFrame to mirror the bottom

You can use groupby + first; first skips over NaN values by default: collapsed_df = df.groupby("feature_id").first().reset_index() If the empty spaces are not NaN values, probably will want to fill them with NaN first: df = df.replace('', np.nan)

Related

fillna() only fills the 1st value of the dataframe

How can I drop duplicates in pandas without dropping NaN values

python - append only select columns as rows

Getting nan when making a dataframe column equal to another

When taking nlargest in pandas dataframe, is there a way to ignore column with NaN values?

Categories

Resources