How can I drop the exact duplicates of a row. So if I have a data frame that looks like so:
A B C
1 2 3
3 2 2
1 2 3
now my data frame is a lot larger than this but is their a way that we can have python look at every row and if the values in the rows are the exact same as another row just drop or delete that row. I want to take in to account for the whole data frame i don't want to specify the column I want to get unique values for.
you can use DataFrame.drop_duplicates() method:
In [23]: df
Out[23]:
A B C
0 1 2 3
1 3 2 2
2 1 2 3
In [24]: df.drop_duplicates()
Out[24]:
A B C
0 1 2 3
1 3 2 2
You can get a de-duplicated dataframe with the inverse of .duplicated:
df[~df.duplicated(['A','B','C'])]
Returns:
>>> df[~df.duplicated(['A','B','C'])]
A B C
0 1 2 3
1 3 2 2
Related
I am trying to do some basic dimensional reduction. I have a CSV file that looks something like this:
A B C A B B A C
1 1 2 2 1 3 1 1
1 2 3 0 0 1 1 2
0 2 1 3 0 1 2 2
I want to import as a pandas DF but without renaming the headers to A.1 A.2 etc. Instead I want to sum the duplicates and keep the columns names. Ideally my new DF should look like this:
A B C
4 5 3
2 3 5
5 3 3
Is it possible to do this easily or would you recommend a different way? I can also use bash, R, or anything that can do the trick with a file that is 1 million lines and 1000 columns.
Thank you!
Try split the column names by . and groupby the first part:
df.groupby(df.columns.str.split('.').str[0], axis=1).sum()
Output:
A B C
0 4 5 3
1 2 3 5
2 5 3 3
Just load the dataframe normally and group by the first letter of the column name, and sum the values:
df.groupby(lambda colname: colname[0], axis=1).sum()
which gives
A B C
0 4 5 3
1 2 3 5
2 5 3 3
I would like to achieve the result below in Python using Pandas.
I tried groupby and sum on the id and Group columns using the below:
df.groupby(['id','Group'])['Total'].sum()
I got the first two columns, but I'm not sure how to get the third column (Overall_Total).
How can I do it?
Initial data (before grouping)
id
Group
Time
1
a
2
1
a
2
1
a
1
1
b
1
1
b
1
1
c
1
2
e
2
2
a
4
2
e
1
2
a
5
3
c
1
3
e
4
3
a
3
3
e
4
3
a
2
3
h
4
Assuming df is your initial dataframe, please try this:
df_group = df.groupby(['id','group']).sum(['time']).rename(columns={'time':'Total'})
df_group['All_total'] = df_group.groupby(['id'])['Total'].transform('sum')
I have df like this
A B
1 1
1 2
1 3
2 2
2 1
3 2
3 3
3 4
I would like to extract rows whose col B is not ascending like
A B
2 2
2 1
I tried
df.groupby("A").filter()...
But I stacked to extract.
If you have any solution,please let me know.
One way is to use pandas.Series.is_monotonic:
df[df.groupby('A')['B'].transform(lambda x:not x.is_monotonic)]
Output:
A B
3 2 2
4 2 1
Use GroupBy.transform with Series.diff and compare by Series.lt for at least one negative value with Series.any and filter by boolean indexing:
df1 = df[df.groupby('A')['B'].transform(lambda x: x.diff().lt(0).any())]
print (df1)
A B
3 2 2
4 2 1
Is there an elegant way to reassign group values to increasing ones?
I have a table which has is already in order:
X = pandas.DataFrame([['a',2],['b',4],['ba',4],['c',8]],columns=['value','group'])
X
Out[18]:
value group
0 a 2
1 b 4
2 ba 4
3 c 8
But I would like to remap group values to that they would increase one by one. The end result would look like:
value group
0 a 1
1 b 2
2 ba 2
3 c 3
Using category or factorize
X.group.astype('category').cat.codes+1 # pd.factorize(X.group)[0]+1
Out[105]:
0 1
1 2
2 2
3 3
dtype: int8
I want to apply an operation on multiple groups of a data frame and then fill all values of that group by the result. Lets take mean and np.cumsum as an example and the following dataframe:
df=pd.DataFrame({"a":[1,3,2,4],"b":[1,1,2,2]})
which looks like this
a b
0 1 1
1 3 1
2 2 2
3 4 2
Now I want to group the dataframe by b, then take the mean of a in each group, then apply np.cumsum to the means, and then replace all values of a by the (group dependent) result.
For the first three steps, I would start like this
df.groupby("b").mean().apply(np.cumsum)
which gives
a
b
1 2
2 5
But what I want to get is
a b
0 2 1
1 2 1
2 5 2
3 5 2
Any ideas how this can be solved in a nice way?
You can use map by Series:
df1 = df.groupby("b").mean().cumsum()
print (df1)
a
b
1 2
2 5
df['a'] = df['b'].map(df1['a'])
print (df)
a b
0 2 1
1 2 1
2 5 2
3 5 2