Pandas: remove rows of dataframe with unique index value - python

I am trying to remove the rows of my dataframe (df) which have unique value as index. This is my df:
A B
1 3.803 4.797
1 3.276 3.878
2 5.181 6.342
3 6.948 9.186
3 8.762 10.136
4 10.672 12.257
4 8.266 13.252
5 13.032 14.656
6 15.021 17.681
6 16.426 15.07
I would like to remove the rows with index=2,5 to get a new dataframe (df_new) as follow:
A B
1 3.803 4.797
1 3.276 3.878
3 6.948 9.186
3 8.762 10.136
4 10.672 12.257
4 8.266 13.252
6 15.021 17.681
6 16.426 15.07
Is there some handy function in pandas to do that?
Thank you

Use get_duplicates:
In [36]:
df.loc[df.index.get_duplicates()]
Out[36]:
A B
1 3.803 4.797
1 3.276 3.878
3 6.948 9.186
3 8.762 10.136
4 10.672 12.257
4 8.266 13.252
6 15.021 17.681
6 16.426 15.070
get_duplicates returns an array of the duplicated indices:
In [37]:
df.index.get_duplicates()
Out[37]:
[1, 3, 4, 6]

Related

How to explode aggregated pandas column

I have a df that looks like this:
df
time score
83623 4
83624 3
83625 3
83629 2
83633 1
I want to explode df.time so that the single digit increments by 1, and then the df.score value is duplicated for each added row. See example below:
time score
83623 4
83624 3
83625 3
83626 3
83627 3
83628 3
83629 2
83630 2
83631 2
83632 2
83633 1
From your sample, I assume df.time is integer. You may try this way
df_final = df.set_index('time').reindex(range(df.time.min(), df.time.max()+1),
method='pad').reset_index()
Out[89]:
time score
0 83623 4
1 83624 3
2 83625 3
3 83626 3
4 83627 3
5 83628 3
6 83629 2
7 83630 2
8 83631 2
9 83632 2
10 83633 1

How to rest a row value to the nths rows values of another dataframe

I have this two df's
df1:
lon lat
0 -60.7 -2.8333333333333335
1 -55.983333333333334 -2.4833333333333334
2 -51.06666666666667 -0.05
3 -66.96666666666667 -0.11666666666666667
4 -48.483333333333334 -1.3833333333333333
5 -54.71666666666667 -2.4333333333333336
6 -44.233333333333334 -2.6
7 -59.983333333333334 -3.15
df2:
lon lat
0 -24.109 -2.0035
1 -17.891 -1.70911
2 -14.5822 -1.7470700000000001
3 -12.8138 -1.72322
4 -14.0688 -1.5028700000000002
5 -13.8406 -1.44416
6 -12.1292 -0.671266
7 -13.8406 -0.8824270000000001
8 -15.12 -18.223
I want to rest each value of df1['lat'] with all values of df2
Something like this :
results0=df1.loc[0,'lat']-df2.loc[:,'lat']
results1=df1.loc[1,'lat']-df2.loc[:,'lat']
#etc etc....
So i tried this:
for i,j in zip(range(len(df1)), range(len(df2))):
exec(f"result{i}=df1.loc[{i},'lat']-df2.loc[{j},'lat']")
But it only gave me one result value for each result, instead of 8 values for each result.
I will appreciate any possible solution. Thanks!
You can create list of Series:
L = [df1.loc[i,'lat']-df2['lat'] for i in df1.index]
Or you can use numpy for new DataFrame:
arr = df1['lat'].to_numpy() - df2['lat'].to_numpy()[:, None]
df3 = pd.DataFrame(arr, index=df2.index, columns=df1.index)
print (df3)
0 1 2 3 4 5 \
0 -0.829833 -0.479833 1.953500 1.886833 0.620167 -0.429833
1 -1.124223 -0.774223 1.659110 1.592443 0.325777 -0.724223
2 -1.086263 -0.736263 1.697070 1.630403 0.363737 -0.686263
3 -1.110113 -0.760113 1.673220 1.606553 0.339887 -0.710113
4 -1.330463 -0.980463 1.452870 1.386203 0.119537 -0.930463
5 -1.389173 -1.039173 1.394160 1.327493 0.060827 -0.989173
6 -2.162067 -1.812067 0.621266 0.554599 -0.712067 -1.762067
7 -1.950906 -1.600906 0.832427 0.765760 -0.500906 -1.550906
8 15.389667 15.739667 18.173000 18.106333 16.839667 15.789667
6 7
0 -0.596500 -1.146500
1 -0.890890 -1.440890
2 -0.852930 -1.402930
3 -0.876780 -1.426780
4 -1.097130 -1.647130
5 -1.155840 -1.705840
6 -1.928734 -2.478734
7 -1.717573 -2.267573
8 15.623000 15.073000
Since df1 has one less row than df2
df1['lat'] = df1['lat'] - df2.loc[:df1.shape[0]-1, 'lat']
output:
0 -0.829833
1 -0.774223
2 1.697070
3 1.606553
4 0.119537
5 -0.989173
6 -1.928734
7 -2.267573
Name: lat, dtype: float64

How to find the maximum value of a column with pandas?

I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())
hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8
Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9
Or use iloc this way:
print(df.iloc[[30, 31], 2].max())

Pandas indexing behavior after grouping: do I see an "extra row"?

This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.
It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2
There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).

Python - Get group names from aggregated results in pandas

I have a dataframe like this:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
When I apply
df.groupby('minute').sum().sort('values', ascending=False)
This gives:
values
minute
3 7
2 6
4 6
1 4
I want to get first two values in minute column in an array like [3,2]. How can I access values in minute column
If what you want is the values from the minute column in the grouped dataframe (which would be the index column as well) , you can use DataFrame.index , to access that column. Example -
grouped = df.groupby('minute').sum().sort('values', ascending=False)
grouped.index[:2]
If you really want it as a list, you can use .tolist() to convert it to a list. Example -
grouped.index[:2].tolist()
Demo -
In [3]: df
Out[3]:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
In [4]: grouped = df.groupby('minute').sum().sort('values', ascending=False)
In [5]: grouped.index[:2]
Out[5]: Int64Index([3, 2], dtype='int64', name='minute')
In [6]: grouped.index[:2].tolist()
Out[6]: [3, 2]

Categories