Python retrieve row index of a Dataframe - python

Could I ask how to retrieve an index of a row in a DataFrame?
Specifically, I am able to retrieve the index of rows from a df.loc.
idx = data.loc[data.name == "Smith"].index
I can even retrieve row index from df.loc by using data.index like this:
idx = data.loc[data.index == 5].index
However, I cannot retrieve the index directly from the row itself (i.e., from row.index, instead of df.loc[].index). I tried using these codes:
idx = data.iloc[5].index
The result of this code is the column names.
To provide context, the reason I need to retrieve the index of a specific row (instead of rows from df.loc) is to use df.apply for each row.
I plan to use df.apply to apply a code to each row and copy the data from the row immediately above them.
def retrieve_gender (row):
# This is a panel data, whose only data in 2000 is already keyed in. Time-invariant data in later years are the same as those in 2000.
if row["Year"] == 2000:
pass
elif row["Year"] == 2001: # To avoid complexity, let's use only year 2001 as example.
idx = row.index # This is wrong code.
row["Gender"] = row.iloc[idx-1]["Gender"]
return row["Gender"]
data["Gender"] = data.apply(retrieve_gender, axis=1)

With Pandas you can loop through your dataframe like this :
for index in range(len(df)):
if df.loc[index,'year'] == "2001":
df.loc[index,'Gender'] = df.loc[index-1 ,'Gender']

apply gives series indexed by column labels
The problem with idx = data.iloc[5].index is data.iloc[5] converts a row to a pd.Series object indexed by column labels.
In fact, what you are asking for is impossible via pd.DataFrame.apply because the series that feeds your retrieve_gender function does not include any index identifier.
Use vectorised logic instead
With Pandas row-wise logic is inefficient and not recommended; it involves a Python-level loop. Use columnwise logic instead. Taking a step back, it seems you wish to implement 2 rules:
If Year is not 2001, leave Gender unchanged.
If Year is 2001, use Gender from previous row.
np.where + shift
For the above logic, you can use np.where with pd.Series.shift:
data['Gender'] = np.where(data['Year'] == 2001, data['Gender'].shift(), data['Gender'])
mask + shift
Alternatively, you can use mask + shift:
data['Gender'] = data['Gender'].mask(data['Year'] == 2001, data['Gender'].shift())

Related

Pandas: result column by aggregating entire row

I have a pandas dataframe containing tuples of booleans (real value, predicted value) and want to create new columns containing the amount of true/false positives/negatives.
I know i could loop through the indices and set the column value for that index after looping through the entire row, but i believe that's a pandas anti-pattern.
Is there a cleaner and more efficient way to do this?
Another alternative would be to check the whole dataframe for (True,False) values and sum the amount of matches along the columns axis (sum(axis=1)).
df['false_positives'] = df.apply(lambda x: x==(True,False)).sum(axis=1)
This seems to work fine:
def count_false_positives(row):
count = 0
for el in df.columns:
if(row[el][0] and not row[el][1]):
count+=1
return count
df.false_positives = df.apply(lambda row: count_false_positives(row), axis=1)

Select a specific slice of data from a main dataframe, conditional on a value in a main dataframe column

I have a main dataframe (df) with a Date column (non-index), a column 'VXX_Full' with values, and a 'signal' column.
I want to iterate through the signals column, and whenever it is 1, i want to capture a slice (20 rows before, 40 rows after) of the 'VXX_Full' column and create a new dataframe with all the slices. I would like the column name of the new dataframe to be the row number of the original dataframe.
VXX_signal = pd.DataFrame(np.zeros((60,0)))
counter = 1
for row in df.index:
if df.loc[row,'signal'] == 1:
add_row = df.loc[row - 20:row +20,'VXX_Full']
VXX_signal[counter] = add_row
counter +=1
VXX_signal
It just doesn't seem to be working. It creates a dataframe, however the values are all Nan. The first slice, it at least appears to be getting data from the main df, however the data doesn't correspond to the correct location. The following set of columns (there are 30 signals so 30 columns created) in the new df are all NaN
Thanks in advance!
I'm not sure about your current code - but basically all you need is a list of ranges of indexes. If your index is linear, this would be something like:
indexes = list(df[df.signal==1].index)
ranges = [(i,list(range(i-20,i+21))) for i in indexes] #create tuple (original index,range)
dfs = [df.loc[i[1]].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#EDIT: for only the VXX_Full Column:
dfs = [df.loc[i[1]].copy()[['VXX_Full']].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#here we take the -20:+20 slice of df, make a separate dataframe, the
#we change 'VXX_Full' to the original index value, and reset index to give it 0:40 index.
#The new index will be useful when putting all the columns next to each other.
So we made a list of indexes with signal == 1, turned it into a list of ranges and finally a list of dataframes with reset index.
Now we want to merge it all together:
from functools import reduce
merged_df = reduce(lambda left, right: pd.merge(
left, right, left_index=True, right_index=True), dfs)
I would build the resulting dataframe from a dictionnary of lists:
resul = pd.DataFrame({i:df.loc[i-20 if i >=20 else 0: i+40 if i <= len(df) - 40 else len(df),
'VXX_FULL'].values for i in df.loc[df.signal == 1].index})
The trick is that .values extract a numpy array with no associated index.
Beware: above code assumes that the index of the original dataframe is just the row number. Use reset_index first if it is different.

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True

Python does not update dataframe while iterating over rows

I don't get why python won't update my dataframe object:
The code snippet is this:
for index, row in df.iterrows():
t = df.loc[index, :"score"]
b = [float(i) for i in t if i != 's']
m = sum(b)/len(b)
df.at[index, "score"] = m
print(df.at[index, "score"]) # Does not print out m, it prints out 0, the default value
The thing that this snippet should do is get all the values in a row, compute the average and then add this average to the dataframe.
Iterating over rows in a DataFrame is very seldomly the way to go.
Instead, use
df.loc[:, :'score'].mean('columns')
which is more readable and much faster.
To answer your question directly (why your way doesn't work) we would need more information (see comments).

How to Drop rows from pandas dataframe by using a list of indices

Introduction
We have the following dataframe which we create from a CSV file.
data = pd.read_csv(path + name, usecols = ['QTS','DSTP','RSTP','DDATE','RDATE','DTIME','RTIME','DCXR','RCXR','FARE'])
I want to delete specific rows from the dataframe. For this purpose I used a list and appended the ids of the rows we want to delete.
for index,row in data.iterrows():
if (row['FARE'] >= 2500.00):
indices.append(index)
From here i am lost. Don't know how to use the ids in the list to delete the rows from the dataframe
Question
The list containing the row ids must be used in the dataframe to delete rows. Is it possible to do it?
Constraints
We can't use data.drop(index,inplace=True) because it really slows the process
We cannot use a filter because I have some special constraints.
If you are trying to remove rows that have 'FARE' values greater than or equal to zero, you can use a mask that have those values lesser than 2500 -
df_out = df.loc[df.FARE.values < 2500] # Or df[df.FARE.values < 2500]
For large datasets, we might want to work with underlying array data and then construct the output dataframe -
df_out = pd.DataFrame(df.values[df.FARE.values < 2500], columns=df.columns)
To use those indices generated from the loopy code in the question -
df_out = df.loc[np.setdiff1d(df.index, indices)]
Or with masking again -
df_out = df.loc[~df.index.isin(indices)] # or df[~df.index.isin(indices)]
How about filtering data using DataFrame.query() method:
cols = ['QTS','DSTP','RSTP','DDATE','RDATE','DTIME','RTIME','DCXR','RCXR','FARE']
df = pd.read_csv(path + name, usecols=cols).query("FARE < 2500")

Categories