Issue with removing duplicates in pandas dataframe

Issue with removing duplicates in pandas dataframe - python

Edit: This has been solved thanks to fsl, duplicated where removed and the issue was the index that needed to be reseted.
I have this dataframe:
Ubicacion lat lon
0 a 19.28034 -99.17121
1 b 19.28333 -99.17535
2 c 19.28028 -99.16887
3 a 19.28034 -99.17121
4 b 19.28333 -99.17535
5 c 19.28028 -99.16887
6 b 19.28333 -99.17535
7 d 19.29259 -99.17757
8 d 19.29259 -99.17757
9 d 19.29259 -99.17757
And I want to remove all duplicate rows, so I use:
ubicaciones_finales = ubicaciones_finales.drop_duplicates(keep="first")
And I get this:
Ubicacion lat lon
0 a 19.28034 -99.17121
1 b 19.28333 -99.17535
2 c 19.28028 -99.16887
7 d 19.29259 -99.17757
Everything seems fine except that rows go 0, 1, 2 and then 7. So when I run:
for k, row in ubicaciones_finales.iterrows():
print(k)
I get:
0
1
2
7
How do I solve this? btw, already check pandas documentation
df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
And its the same, it goes from 0 to 2 witouth 1. Thank you for your time.

IIUC, go with reset_index or simply pass ignore_index=True:
df = df.drop_duplicates(keep='first').reset_index(drop=True)
# or
df = df.drop_duplicates(keep='first', ignore_index=True)
Output:
Ubicacion lat lon
0 a 19.28034 -99.17121
1 b 19.28333 -99.17535
2 c 19.28028 -99.16887
3 d 19.29259 -99.17757

Related

How to rest a row value to the nths rows values of another dataframe

I have this two df's
df1:
lon lat
0 -60.7 -2.8333333333333335
1 -55.983333333333334 -2.4833333333333334
2 -51.06666666666667 -0.05
3 -66.96666666666667 -0.11666666666666667
4 -48.483333333333334 -1.3833333333333333
5 -54.71666666666667 -2.4333333333333336
6 -44.233333333333334 -2.6
7 -59.983333333333334 -3.15
df2:
lon lat
0 -24.109 -2.0035
1 -17.891 -1.70911
2 -14.5822 -1.7470700000000001
3 -12.8138 -1.72322
4 -14.0688 -1.5028700000000002
5 -13.8406 -1.44416
6 -12.1292 -0.671266
7 -13.8406 -0.8824270000000001
8 -15.12 -18.223
I want to rest each value of df1['lat'] with all values of df2
Something like this :
results0=df1.loc[0,'lat']-df2.loc[:,'lat']
results1=df1.loc[1,'lat']-df2.loc[:,'lat']
#etc etc....
So i tried this:
for i,j in zip(range(len(df1)), range(len(df2))):
exec(f"result{i}=df1.loc[{i},'lat']-df2.loc[{j},'lat']")
But it only gave me one result value for each result, instead of 8 values for each result.
I will appreciate any possible solution. Thanks!

You can create list of Series:
L = [df1.loc[i,'lat']-df2['lat'] for i in df1.index]
Or you can use numpy for new DataFrame:
arr = df1['lat'].to_numpy() - df2['lat'].to_numpy()[:, None]
df3 = pd.DataFrame(arr, index=df2.index, columns=df1.index)
print (df3)
0 1 2 3 4 5 \
0 -0.829833 -0.479833 1.953500 1.886833 0.620167 -0.429833
1 -1.124223 -0.774223 1.659110 1.592443 0.325777 -0.724223
2 -1.086263 -0.736263 1.697070 1.630403 0.363737 -0.686263
3 -1.110113 -0.760113 1.673220 1.606553 0.339887 -0.710113
4 -1.330463 -0.980463 1.452870 1.386203 0.119537 -0.930463
5 -1.389173 -1.039173 1.394160 1.327493 0.060827 -0.989173
6 -2.162067 -1.812067 0.621266 0.554599 -0.712067 -1.762067
7 -1.950906 -1.600906 0.832427 0.765760 -0.500906 -1.550906
8 15.389667 15.739667 18.173000 18.106333 16.839667 15.789667
6 7
0 -0.596500 -1.146500
1 -0.890890 -1.440890
2 -0.852930 -1.402930
3 -0.876780 -1.426780
4 -1.097130 -1.647130
5 -1.155840 -1.705840
6 -1.928734 -2.478734
7 -1.717573 -2.267573
8 15.623000 15.073000

Since df1 has one less row than df2
df1['lat'] = df1['lat'] - df2.loc[:df1.shape[0]-1, 'lat']
output:
0 -0.829833
1 -0.774223
2 1.697070
3 1.606553
4 0.119537
5 -0.989173
6 -1.928734
7 -2.267573
Name: lat, dtype: float64

create pandas pivottable with a long multiindex

I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated

I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.

I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1

Modifying DataFrames in loop

Given this data frame:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I'd like to create 3 new data frames; one from each column.
I can do this one at a time like this:
a=pd.DataFrame(df[['A']])
a
A
0 1
1 2
2 3
But instead of doing this for each column, I'd like to do it in a loop.
Here's what I've tried:
a=b=c=df.copy()
dfs=[a,b,c]
fields=['A','B','C']
for d,f in zip(dfs,fields):
d=pd.DataFrame(d[[f]])
...but when I then print each one, I get the whole original data frame as opposed to just the column of interest.
a
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Update:
My actual data frame will have some columns that I do not need and the columns will not be in any sort of order, so I need to be able to get the columns by name.
Thanks in advance!

A simple list comprehension should be enough.
In [68]: df_list = [df[[x]] for x in df.columns]
Printing out the list, this is what you get:
In [69]: for d in df_list:
...: print(d)
...: print('-' * 5)
...:
A
0 1
1 2
2 3
-----
B
0 4
1 5
2 6
-----
C
0 7
1 8
2 9
-----
Each element in df_list is its own data frame, corresponding to each data frame from the original. Furthermore, you don't even need fields, use df.columns instead.

Or you can try this, instead create copy of df, this method will return the result as single Dataframe, not a list, However, I think save Dataframe into a list is better
dfs=['a','b','c']
fields=['A','B','C']
variables = locals()
for d,f in zip(dfs,fields):
variables["{0}".format(d)] = df[[f]]
a
Out[743]:
A
0 1
1 2
2 3
b
Out[744]:
B
0 4
1 5
2 6
c
Out[745]:
C
0 7
1 8
2 9

You should use loc
a = df.loc[:,0]
and then loop through like
for i in range(df.columns.size):
dfs[i] = df.loc[:, i]

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]

You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN

You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Outer join in python Pandas

I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks

You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5

You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0

You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7

You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue with removing duplicates in pandas dataframe - python

IIUC, go with reset_index or simply pass ignore_index=True: df = df.drop_duplicates(keep='first').reset_index(drop=True) # or df = df.drop_duplicates(keep='first', ignore_index=True) Output: Ubicacion lat lon 0 a 19.28034 -99.17121 1 b 19.28333 -99.17535 2 c 19.28028 -99.16887 3 d 19.29259 -99.17757

Related

How to rest a row value to the nths rows values of another dataframe

create pandas pivottable with a long multiindex

Modifying DataFrames in loop

Delete a row if it doesn't contain a specified integer value (Pandas)

Outer join in python Pandas

Categories

Resources