I would like to drop duplicates from a dataframe in a balanced way. Currently df.drop_duplicates() have the parameter keep where you can determine to keep the first or the last occurence. Instead of this solution, I would like to keep the occurrences in a distributed way.
I.e. I have this dataframe, with two columns: text and category, which looks balanced but have duplicates:
Applying drop_duplicates() and plotting again, will look like this:
df = df.drop_duplicates(subset='text')
df['Category'].value_counts().plot(kind='bar')
The expected result will be the dataframe without the duplicates, with all the columns. But instead of keeping the last occurence or the first one (because this will produce an unbalanced dataframe, due the most occurences could be in the first or last category), keep the dataframe as balanced as possible (since depending on the number of duplicates, there could be some odd categories, that is, it does not need to be 100% balanced).
To maintain balance after dropping duplicates you need to down sample all categories greater than the minimum (neg_hp). However I assume sadness might be a better minimum point. You can do that by
df = df.drop_duplicates(subset='text')
max_sample = df[df['Category'].eq('sadness')]
Then sample max_sample from all other categories
df['Category'].eq('conifident').sample(max_sample)
That way all your categories will have the same value_counts
Related
Although there are several related questions answered in Pandas, I cannot solve this issue. I have a large dataframe (~ 49000 rows) and want to drop rows the meet two conditions at the same time(~ 120):
For one column: an exact string
For another column: a NaN value
My code is ignoring the conditions and no row is removed.
to_remove = ['string1', 'string2']
df.drop(df[df['Column 1'].isin(to_remove) & (df['Column 2'].isna())].index, inplace=True)
What am I doing wrong? Thanks for any hint!
Instead of calling drop, and passing the index, You can create the mask for the condition for which you want to keep the rows, then take only those rows. Also, the logic error seems to be there, you are checking two different condition combined by AND for the same column values.
df[~(df['Column1'].isin(to_remove) & (df['Column2'].isna()))]
Also, if you need to check in the same column, then you probably want to combine the conditions by or i.e. |
If needed, you can reset_index at last.
Also, as side note, your list to_remove has two same string values, I'm assuming thats a typo in the question.
I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.
code:
for i in recommender_train_df['user_id'].unique():
recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)
problem with this code it's literally does nothing, I tried and tried and same result nothing happens.
quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.
The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.
If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:
for i in recommender_train_df['user_id'].unique():
mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
indices = mask[mask.tolist()].index
recommender_train_df.drop(indices, inplace=True)
I'm working with tables containing a number of image-derived features, with a row of features for every timepoint of an image series, and multiple images. An image identifier is contained in one column. To condense this (primarily to be used in a parallel coordinates plot), I want to reduce/aggregate the columns to a single value. The reduce operation has to be chosen depending on the feature in the column (for example mean, max, or something custom). DataFrame.agg seemed like the natural thing to do, but it gives me a table with multiple rows. Right now I'm doing something like this:
result_df = DataFrame()
for col in df.columns:
if col in ReduceThisColumnByMean:
result_df[col] = df.mean()
elif col in ReduceThisColumnByMax:
result.df[col] = df.max()
This seems like a detour to me, and might not scale well (not a big concern, as the number of reduce operations will most probably not grow beyond a few). Is there a more pandas-esk way to aggregate multiple columns with specific operations to a single row?
You can select all columns by list, get mean and max and join together by concat, last convert Series to one row DataFrame by Series.to_frame and transpose:
result_df = pd.concat([df[ReduceThisColumnByMean].mean(),
df[ReduceThisColumnByMax].max()]).to_frame().T
pandas.Dataframe.groupby(['date','some_category']).agg([np.sum, np.size]) produces a count that is repeated for each sum column. Is it possible to output just a single count column when passing a list of aggregate functions?
a = df_all.groupby(['date','some_category']).sum()
b = df_all.groupby(['date','some_category']).size()
pd.concat([a,b], axis=1)
produces basically what I want but seems awkward.
df.pivot_table(index=['date', 'some_category'],aggfunc=['sum', 'size']) is what I was looking for. This produces a single size column (though I am not sure why it is labeled '0'), rather than repeated (identical) size for each summed column. Thank all, I learned some useful things along the way.
I am at my wit's end as I am writing this. This is probably an incredibly small issue, but I've not been able to get around it. Here's what is going on:
I have a dataframe df with 80 columns
Performing value_counts().count() over df iteratively, I am able to print the column names and the number of unique values in that column.
Here's the problem: What I am also wanting to do is sum up the count() of unique values of the all columns. Essentially I will need just one number. S0 basically, if column1 had 10 uniques, column2 had 5, column3 had 3.., I am expecting the sum() to be 18.
About #2, here's what works (simple for loop) -
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
That works. It prints it in this format - the column - unique values
Now, alongside that, I'd like to print the sum of the unique values. Whatever I tried, it either prints the unique value of the last column (which is incidentally 2), or prints some thing random. I know it's something to do with the for loop, but I can't seem to figure out what.
I also know that in order to get what I want, which is essentially sum(df[evry_colm].value_counts().count()), I will need to convert df[evry_colm].value_counts().count() to a series, or even a dataframe, but I am stuck with that too!
Thanks in advance for your help.
You could use nunique, which returns a series across all your columns, which you can then sum:
df.nunique().sum()
My first instinct was to do it by series with a list comprehension
sum([df[col].nunique() for col in list(df)])
but this is slower and less Pandorable!