Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index?
Or is my only option to create an if-statement and check for the value in the dataframe before appending it?
EDIT:
It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column.
With
df.append(new_row, verify_integrity=True)
we can check for all columns, but how can we check for only one or two columns?
You can use df.append(..., verify_integrity=True) to maintain a unique row index:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1])
new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9])
This successfully appends a new row (with index 9):
df.append(new_row, verify_integrity=True)
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 9 10 20 30 40
This raises ValueError because 1 is already in the index:
df.append(dup_row, verify_integrity=True)
# ValueError: Indexes have overlapping values: [1]
While the above works to ensure a unique row index, I'm not aware of a similar method for ensuring a unique column index. In theory you could transpose the DataFrame, append with verify_integrity=True and then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of object dtype. Conversion to and from object arrays can be bad for performance.)
If you need both unique row- and column- Indexes, then perhaps a better alternative is to stack your DataFrame so that all the unique column index levels become row index levels. Then you can use append with verify_integrity=True on the reshaped DataFrame.
OP's follow-up question:
With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two
columns?
To check uniqueness of just one column, say the column name is value, one can try
df['value'].duplicated().any()
This will check whether any in this column is duplicated. If duplicated, then it is not unique.
Given two columns, say C1 and C2,to check whether there are duplicated rows, we can still use DataFrame.duplicated.
df[["C1", "C2"]].duplicated()
It will check row-wise uniqueness. You can again use any to check if any of the returned value is True.
Given 2 columns, say C1 and C2, to check whether each column contains duplicated value, we can use apply.
df[["C1", "C2"]].apply(lambda x: x.duplicated().any())
This will apply the function to each column.
NOTE
pd.DataFrame([[np.nan, np.nan],
[ np.nan, np.nan]]).duplicated()
0 False
1 True
dtype: bool
np.nan will also be captured by duplicated. If you want to ignore np.nan, you can try select the non-nan part first.
Related
This question already has answers here:
Remove reverse duplicates from dataframe
(6 answers)
Closed 1 year ago.
I am trying to clean my pandas data frame from duplicate rows. I know how to remove rows where column values are the same but I'm not sure how to do that if (column A on row 1 is equal to column B on row 2 and column B on row 1 is equal to column A on row 2). I hope it is not too confusing. I've added an example of a table below. I would consider rows 2 and 3 to be duplicates. How would I remove them using pandas?
Edit:
Duplicate rows are not necessarily right above or below each other. I need to keep only one of those rows (doesn't matter which one specifically).
Use np.sort to make each row have the same values in order
import pandas as pd
import numpy as np
# toy data
df = pd.DataFrame(data=[[10, 15], [15, 10]], columns=["A", "B"])
# find duplicates rows
duplicated = pd.DataFrame(np.sort(df[["A", "B"]], axis=1), index=df.index).duplicated()
# filter out
res = df[duplicated]
print(res)
Output
A B
1 15 10
Alternative use frozenset to convert each row into a hashable set where order does not matter.
# find duplicates rows
duplicated = df[["A", "B"]].apply(frozenset, axis=1).duplicated()
# filter out
res = df[duplicated]
print(res)
I'm trying to identify which rows have a value of nan in a specific column (index 2), and either delete the rows that have nan or move the ones that don't have nan into their own dataframe. Any recommendations on how to go about either way?
I've tried to create a vector with all of the rows and specified column, but the data type object is giving me trouble. Also, I tried creating a list and adding all of the rows that != 'nan' in that specific column to the list.
patientsDD = patients.iloc[:,2].values
ddates = []
for value in patients[:,2]:
if value != 'nan':
ddates.append(value)
I'm expecting that it returns all of the rows that != 'nan' in index 2, but nothing is added to the list, and the error I am receiving is '(slice(None, None, None), 2)' is an invalid key.
I'm a newbie to all of this, so I really appreciate any help!
You can use .isna() of pandas:
patients[!patients.iloc[:, 2].isna()]
Instead of delete rows are nan, you can select only rows that are not nan.
You can try this (assuming df is the name of your data frame):
import numpy as np
df1 = df[np.isfinite(df['index 2'])]
This will give you a new data frame df1 with only the rows that have a finite value in the column index 2. You can also try this:
import pandas as pd
df1 = df[pd.notnull(df['index 2'])]
If you want to drop all the rows that have NaN values in any of the columns, you can use this:
df1 = df.dropna()
I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.
I use Pandas dataframes to manipulate data and I usually visualise them as virtual spreadsheets, with rows and columns defining the positions of individual cells. I'm happy with the methods to slice and dice the dataframes but there seems to be some odd behaviour when the dataframe contains a single row. Basically, I want to select rows of data from a large parent dataframe that meet certain criteria and then pass those results as a daughter dataframe to a separate function for further processing. Sometimes there will only be a single record in the parent dataframe that meets the defined criteria and, therefore, the daughter dataframe will only contain a single row. Nevertheless, I still need to be able to access data in the daughter in the same way as for the parent database. To illustrate may point, consider the following dataframe:
import pandas as pd
tempDF = pd.DataFrame({'group':[1,1,1,1,2,2,2,2],
'string':['a','b','c','d','a','b','c','d']})
print(tempDF)
Which looks like:
group string
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 2 d
As an example, I can now select those rows where 'group' == 2 and 'string' == 'c', which yields just a single row. As expected, the length of dataframe is 1 and it's possible to print just a single cell using .ix() based on index values in the original dataframe:
tempDF2 = tempDF.loc[((tempDF['group']==2) & (tempDF['string']=='c')),['group','string']]
print(tempDF2)
print('Length of tempDF2 = ',tempDF2.index.size)
print(tempDF2.loc[6,['string']])
Output:
group string
6 2 c
Length of tempDF2 = 1
string c
However, if I select a single row using .loc, then the dataframe is printed in a transposed form and the length of the dataframe is now given as 2 (rather than 1). Clearly, it's no longer possible to select single cell values based on index of original parent dataframe:
tempDF3 = tempDF.loc[6,['group','string']]
print(tempDF3)
print('Length of tempDF3 = ',tempDF3.index.size)
Output:
group 2
string c
Name: 7, dtype: object
Length of tempDF3 = 2
In my mind, both these methods are actually doing the same thing, namely selecting a single row of data. However, in the second example, the rows and columns are transposed making it impossible to extract data in an expected way.
Why should these 2 behaviours exist? What is the point of transposing a single row of a dataframe as a default behaviour? How can I make sure that a dataframe containing a single row isn't transposed when I pass it to another function?
tempDF3 = tempDF.loc[6,['group','string']]
The 6 in the first position of the .loc selection dictates that the return type will be a Series and hence your problem. Instead use [6]:
tempDF3 = tempDF.loc[[6],['group','string']]
So I imported and merged 4 csv's into one dataframe called data. However, upon inspecting the dataframe's index with:
index_series = pd.Series(data.index.values)
index_series.value_counts()
I see that multiple index entries have 4 counts. I want to completely reindex the data dataframe so each row now has a unique index value. I tried:
data.reindex(np.arange(len(data)))
which gave the error "ValueError: cannot reindex from a duplicate axis." A google search leads me to think this error is because the there are up to 4 rows that share a same index value. Any idea how I can do this reindexing without dropping any rows? I don't particularly care about the order of the rows either as I can always sort it.
UPDATE:
So in the end I did find a way to reindex like I wanted.
data['index'] = np.arange(len(data))
data = data.set_index('index')
As I understand it, I just added a new column called 'index' to my data frame, and then set that column as my index.
As for my csv's, they were the four csv's under "download loan data" on this page of Lending Club loan stats.
It's pretty easy to replicate your error with this sample data:
In [92]: data = pd.DataFrame( [33,55,88,22], columns=['x'], index=[0,0,1,2] )
In [93]: data.index.is_unique
Out[93]: False
In [94:] data.reindex(np.arange(len(data))) # same error message
The problem is because reindex requires unique index values. In this case, you don't want to preserve the old index values, you merely want new index values that are unique. The easiest way to do that is:
In [95]: data.reset_index(drop=True)
Out[72]:
x
0 33
1 55
2 88
3 22
Note that you can leave off drop=True if you want to retain the old index values.