I have a single data frame and every row is duplicated except for two values. In all cases the corresponding duplicate has a blank value in the corresponding row. I want to 'collapse' these rows and fill in the blanks.
In the example below, I want to collapse the top DataFrame to mirror the bottom
You can use groupby + first; first skips over NaN values by default:
collapsed_df = df.groupby("feature_id").first().reset_index()
If the empty spaces are not NaN values, probably will want to fill them with NaN first:
df = df.replace('', np.nan)
Related
I'm facing a strange issue in which I'm trying to replace all NaN values in a dataframe with values taken from another one (same length) that has the relevant values.
Here's a glimpse for the "target dataframe" in which I want to replace the values:
data_with_null
Here's the dataframe where I want to take data from: predicted_paticipant_groups
I've tried:
data_with_null.participant_groups.fillna(predicted_paticipant_groups.participant_groups, inplace=True)
but it just fills all values NaN values with the 1st one (Infra)
Is it because of the indexes of data_with_null are all zeros?
Reset the index and try again.
data_with_null.reset_index(drop=True, inplace=True)
I have a dataframe which I query and I want to get only unique values out of a certain column.
I tried to do that executing this code:
database = pd.read_csv(db_file, sep='\t')
query = database.loc[database[db_specifications[0]].isin(elements)].drop_duplicates(subset=db_specification[1])
db_specification is just a list containing two columns that I query.
Some of the values are NaN and I don't want to consider them duplicates of each other, how can I achieve that?
You can start by selecting all NaN and then drop duplicate on the rest of the dataframe.
mask = data.isna().any()
data = pd.concat([data[mask], data[~mask]])
Original file has multiple columns but there are lots of blanks and I want to rearrange so that there is one nice column with info. Starting with 910 rows, 51 cols (newFile df) -> Want 910+x rows, 3 cols (final df) final df has 910 rows.
newFile sample
for i in range (0,len(newFile)):
for j in range (0,48):
if (pd.notnull(newFile.iloc[i,3+j])):
final=final.append(newFile.iloc[[i],[0,1,3+j]], ignore_index=True)
I have this piece of code to go through newFile and if 3+j column is not null, to copy columns 0,1,3+j to a new row. I tried append() but it adds not only rows but a bunch of columns with NaNs again (like the original file).
Any suggestions?!
Your problem is that you are using a DataFrame and keeping column names, so adding a new columns with a value will fill the new column with NaN for the rest of the dataframe.
Plus your code is really inefficient given the double for loop.
Here is my solution using melt()
#creating example df
df = pd.DataFrame(numpy.random.randint(0,100,size=(100, 51)), columns=list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY'))
#reconstructing df as long version, keeping columns from index 0 to index 3
df = df.melt(id_vars=df.columns[0:2])
#dropping the values that are null
df.dropna(subset=['value'],inplace=True)
#here if you want to keep the information about which column the value is coming from you stop here, otherwise you do
df.drop(inplace=True,['variable'],axis=1)
print(df)
I am trying to make a subset of a dataframe
combo.iloc[:,orig_start_col:orig_start_col+2]
equal to the values another subset already has
combo.iloc[:,sm_col:sm_col+2]
where the columns will vary in a loop. The problem is that all I am getting is NaNs despite that the second subset values are not NaN
I tried to do this for the first column and it worked, however doing so with just the second column of the subset returns all NaNs. Then doing for the whole subset returns Nan values for everything
My code is:
for node_col in ('leg2_node', 'leg4_node'):
combo=orig_combos.merge(all, how='inner', left_on='leg6_node', right_on=node_col)
combo.reset_index(drop=True, inplace=True)
orig_start_col=combo.columns.get_loc('leg6_alpha_x')
sm_col=combo.columns.get_loc(node_col+'_y')
combo.iloc[:,orig_start_col+1:orig_start_col+2]=combo.
iloc[:,sm_col+1:sm_col+2]
What I would expect having sm_col:sm_col+2 subset all rows with values is to have those values in orig_start_col:orig_start_col+2 subset, but instead what I am having is all values=NaN
When taking nlargest in pandas dataframe, is there a way to ignore column with NaN values? If say I want to pick 5 column headings with the 5 largest values, and if some of the columns has NaN values, then the column is ignored. If the number of columns with finite values is smaller than 5, then pick all the column headings with finite values (<5).
nlargest takes the n top rows sorted descendingly by the columns passed to the method. If there are NaN values that get to the top then it will include these. If you wan to ignore rows in which NaN values exist in the columns that were sorted by then do this:
# assume a variable 'columns' exist that defines what columns to sort
# by. You'll have to assign this yourself. Also assign 'n' yourself.
df = df.dropna(subset=columns)
df = df.nlargest(n, columns=columns)