Reindex a dataframe with duplicate index values - python

So I imported and merged 4 csv's into one dataframe called data. However, upon inspecting the dataframe's index with:
index_series = pd.Series(data.index.values)
index_series.value_counts()
I see that multiple index entries have 4 counts. I want to completely reindex the data dataframe so each row now has a unique index value. I tried:
data.reindex(np.arange(len(data)))
which gave the error "ValueError: cannot reindex from a duplicate axis." A google search leads me to think this error is because the there are up to 4 rows that share a same index value. Any idea how I can do this reindexing without dropping any rows? I don't particularly care about the order of the rows either as I can always sort it.
UPDATE:
So in the end I did find a way to reindex like I wanted.
data['index'] = np.arange(len(data))
data = data.set_index('index')
As I understand it, I just added a new column called 'index' to my data frame, and then set that column as my index.
As for my csv's, they were the four csv's under "download loan data" on this page of Lending Club loan stats.

It's pretty easy to replicate your error with this sample data:
In [92]: data = pd.DataFrame( [33,55,88,22], columns=['x'], index=[0,0,1,2] )
In [93]: data.index.is_unique
Out[93]: False
In [94:] data.reindex(np.arange(len(data))) # same error message
The problem is because reindex requires unique index values. In this case, you don't want to preserve the old index values, you merely want new index values that are unique. The easiest way to do that is:
In [95]: data.reset_index(drop=True)
Out[72]:
x
0 33
1 55
2 88
3 22
Note that you can leave off drop=True if you want to retain the old index values.

Related

How to convert cells into columns in pandas? (python) [duplicate]

The problem is, when I transpose the DataFrame, the header of the transposed DataFrame becomes the Index numerical values and not the values in the "id" column. See below original data for examples:
Original data that I wanted to transpose (but keep the 0,1,2,... Index intact and change "id" to "id2" in final transposed DataFrame).
DataFrame after I transpose, notice the headers are the Index values and NOT the "id" values (which is what I was expecting and needed)
Logic Flow
First this helped to get rid of the numerical index that got placed as the header: How to stop Pandas adding time to column title after transposing a datetime index?
Then this helped to get rid of the index numbers as the header, but now "id" and "index" got shuffled around: Reassigning index in pandas DataFrame & Reassigning index in pandas DataFrame
But now my id and index values got shuffled for some reason.
How can I fix this so the columns are [id2,600mpe, au565...]?
How can I do this more efficiently?
Here's my code:
DF = pd.read_table(data,sep="\t",index_col = [0]).transpose() #Add index_col = [0] to not have index values as own row during transposition
m, n = DF.shape
DF.reset_index(drop=False, inplace=True)
DF.head()
This didn't help much: Add indexed column to DataFrame with pandas
If I understand your example, what seems to happen to you is that you transpose takes your actual index (the 0...n sequence as column headers. First, if you then want to preserve the numerical index, you can store that as id2.
DF['id2'] = DF.index
Now if you want id to be the column headers then you must set that as an index, overriding the default one:
DF.set_index('id',inplace=True)
DF.T
I don't have your data reproduced, but this should give you the values of id across columns.

How to delete multiple rows in a pandas DataFrame based on condition?

I know how to delete rows and columns from a dataframe using .drop() method, by passing axis and labels.
Here's the Dataframe:
Now, if i want to remove all rows whose STNAME is equal to from (Arizona all the way to Colorado), how should i do it ?
I know i could just do it by passing row labels 2 to 7 to .drop() method but if i have a lot of data and i don't know the starting and ending indexes, it won't be possible.
Might be kinda hacky, but here is an option:
index1 = df.index[df['STNAME'] == 'Arizona'].tolist()[0]
index2 = df.index[df['STNAME'] == 'Colorado'].tolist()[-1:][0]
df = df.drop(np.arange(index1, index2+1))
This basically takes the first index number of Arizona and the last index number of Colorado, and deletes every row from the data frame between these indexes.

Selecting the first values from a MultiIndexed dataframe with two levels

I am relatively new to python, so please excuse any confusion which may arrise due to my bad terminology.
Anyways, I am currently stuck with trying to obtain the first value for each index of level 2 of a multiindexed dataframe. The df has 2 indexes, level 1 being 'user' and level 2 being 'trial'. Both 'user' and 'trial' are integer values, while 't' are continuous float values.
Basically I want to extract the first 't' value of the following dataframe for each trial, for each user: df=dataframe in question.
I have used df['user'].unique() and df['trial'].unique() (before doing df.set_index(['user','trial']))and discovered that there are 1040 unique users and 97 unique trials. The main problem is that not each user has the same unique trial numbers (i.e. user 1 has a trial number 5, while user 2 does nit, and so on).
Is there anyway to obtain these values and later compile them in a similar dataframe, df2, which is also indexed by 'user' and 'trial'?
Thanks in advance!
Use pd.drop_duplicates
df = df.reset_index()
df = df.drop_duplicates(subset=['user', 'trial'], keep='first')
df = df.set_index(['user', 'trial'])
(replace <column>by the name of the column containing the values you want to sort)

create Pandas Dataframe with unique index

Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index?
Or is my only option to create an if-statement and check for the value in the dataframe before appending it?
EDIT:
It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column.
With
df.append(new_row, verify_integrity=True)
we can check for all columns, but how can we check for only one or two columns?
You can use df.append(..., verify_integrity=True) to maintain a unique row index:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1])
new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9])
This successfully appends a new row (with index 9):
df.append(new_row, verify_integrity=True)
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 9 10 20 30 40
This raises ValueError because 1 is already in the index:
df.append(dup_row, verify_integrity=True)
# ValueError: Indexes have overlapping values: [1]
While the above works to ensure a unique row index, I'm not aware of a similar method for ensuring a unique column index. In theory you could transpose the DataFrame, append with verify_integrity=True and then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of object dtype. Conversion to and from object arrays can be bad for performance.)
If you need both unique row- and column- Indexes, then perhaps a better alternative is to stack your DataFrame so that all the unique column index levels become row index levels. Then you can use append with verify_integrity=True on the reshaped DataFrame.
OP's follow-up question:
With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two
columns?
To check uniqueness of just one column, say the column name is value, one can try
df['value'].duplicated().any()
This will check whether any in this column is duplicated. If duplicated, then it is not unique.
Given two columns, say C1 and C2,to check whether there are duplicated rows, we can still use DataFrame.duplicated.
df[["C1", "C2"]].duplicated()
It will check row-wise uniqueness. You can again use any to check if any of the returned value is True.
Given 2 columns, say C1 and C2, to check whether each column contains duplicated value, we can use apply.
df[["C1", "C2"]].apply(lambda x: x.duplicated().any())
This will apply the function to each column.
NOTE
pd.DataFrame([[np.nan, np.nan],
[ np.nan, np.nan]]).duplicated()
0 False
1 True
dtype: bool
np.nan will also be captured by duplicated. If you want to ignore np.nan, you can try select the non-nan part first.

Cannot get right slice bound for non-unique label when indexing data frame with python-pandas

I have such a data frame df:
a b
10 2
3 1
0 0
0 4
....
# about 50,000+ rows
I wish to choose the df[:5, 'a']. But When I call df.loc[:5, 'a'], I got an error: KeyError: 'Cannot get right slice bound for non-unique label: 5. When I call df.loc[5], the result contains 250 rows while there is just one when I use df.iloc[5]. Why does this thing happen and how can I index it properly? Thank you in advance!
The error message is explained here: if the index is not monotonic, then both slice bounds must be unique members of the index.
The difference between .loc and .iloc is label vs integer position based indexing - see docs. .loc is intended to select individual labels or slices of labels. That's why .loc[5] selects all rows where the index has the value 250 (and the error is about a non-unique index). iloc, in contrast, select row number 5 (0-indexed). That's why you only get a single row, and the index value may or may not be 5. Hope this helps!
To filter with non-unique indexs try something like this:
df.loc[(df.index>0)&(df.index<2)]
The issue with the way you are addressing is that, there are multiple rows with index as 5. So the loc attribute does not know which one to pick. To know just do a df.loc[5] you will get number of rows with same index.
Either you can sort it using sort_index or you can first aggregate data based on index and then retrieve.
Hope this helps.

Categories