Pandas: result column by aggregating entire row - python

I have a pandas dataframe containing tuples of booleans (real value, predicted value) and want to create new columns containing the amount of true/false positives/negatives.
I know i could loop through the indices and set the column value for that index after looping through the entire row, but i believe that's a pandas anti-pattern.
Is there a cleaner and more efficient way to do this?

Another alternative would be to check the whole dataframe for (True,False) values and sum the amount of matches along the columns axis (sum(axis=1)).
df['false_positives'] = df.apply(lambda x: x==(True,False)).sum(axis=1)

This seems to work fine:
def count_false_positives(row):
count = 0
for el in df.columns:
if(row[el][0] and not row[el][1]):
count+=1
return count
df.false_positives = df.apply(lambda row: count_false_positives(row), axis=1)

Related

Applying a function that inverts column values using pandas

I'm hoping to get someone's advice on a problem I'm running into trying to apply a function over columns in a dataframe I have that inverses the values in the columns.
For example, if the observation is 0 and the max of the column is 7, I subtract the absolute value of the max from the observation: abs(0 - 7) = 7, so the smallest value becomes the largest.
All of the columns essentially have a similar range to the above example. The shape of the sliced df is 16984,512
The code I have written creates a bunch of empty columns, that are then replaced with the max values of those columns. The new shape becomes 16984, 1029 including the 5 columns that I sliced off before. Then I use lambda to apply the function over the columns in question:
#create max cols
col = df.iloc[:, 5:]
col_names = col.columns
maximum = '_max'
for col in df[col_names]:
max_value = df[col].max()
df[col+maximum] = np.zeros((16984,))
df[col+maximum].replace(to_replace = 0, value = max_value)
#for each row and column inverse value of row
def invert_col(x, col):
"""Invert values of a column"""
return abs(x[col] - x[col+"_max"])
for col in col_names:
new_df = df.apply(lambda x: invert_col(x, col), axis = 1)
I've tried this where I includes axis = 1 and when I remove it and the behaviour is quite different. I am fairly new to Python so I'm finding it difficult to troubleshoot why this is happening.
When I remove axis = 1, the error I get is a key error: KeyError: 'TV_TIME_LIVE'
TV_TIME_LIVE is the first column in col_names, so it's as if it's not finding it.
When I include axis = 1, I don't get an error, but all the columns in the df get flattened into a Series, with length equal to the original df.
What I'm expecting is a new_df with the same shape (16984,1029) where the values of the 5th to the 517th column have the inverse function applied to them.
I would really appreciate any guidance as to what's going on here and how we might get to the desired output.
Many thanks
apply is slow. It is better to use vectorized approaches as below.
axis=1 means that your function will work column wise, if you do not specify it will work row wise. When you get key error it means pandas is searching for a column name and it cannot find it. If you really must use apply try searching for a few examples how exactly it works.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randint(0,7,size=(100, 4)), columns=list('ABCD'))
col_list=df.columns.copy()
for col in col_list:
df[col+"inversed"]=abs(df[col]-df[col].max())

How can I merge/sum together a range of rows in a pandas DataFrame into one row?

I have a pandas DataFrame and I am trying to sum together and merge the last several rows into a single row. Is there a way I can specify an index range and have that range of rows summed and merged into a single row across all the columns?
Thanks
Yes, you should be able to specify an index range and have that range of rows summed and merged into a single row across all the columns:
start_row = 18
df.iloc[start_row] = df.iloc[start_row:].sum()
df = df.iloc[:start_row+1]
Try this: I assume your data has only two columns (count & percent):
df.reset_index().replace({'index':[i for i in range(18,24)]}, {'index':18}).groupby('index', as_index=False).sum()
You can use pandas.DataFrame.groupby with a lambda function:
start = 18
df = df.groupby(lambda x: x if x < start else start).sum()

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

Is there a concise way to cumcount string occurences in a dataframe?

I have a df that is in chronological order (oldest to newest) of UFC fights. Each row contains both fighters. How would I create two new columns:
col_a = cumsum of number of fights R_fighter exists in,
col_b = cumsum of number of fights B_fighter exists in
So as an example, in row 1000 of the df, I'd like a cumcount of the amount of times R_fighter occurs in the dataframe from rows 0 through 999.
I'm struggling to wrap my head around this without using a for loop of some kind.
Let's assume your dataframe is called df and is indexed 0 to n. Then you can use apply and value_counts to add the cumcount columns as follows.
def myct(row,col):
return df[col][:row.name+1].value_counts()[row[col]]
df['col_a']=df.apply(lambda row: myct(row, 'R_fighter'), axis=1)
df['col_b']=df.apply(lambda row: myct(row, 'B_fighter'), axis=1)
You can use .value_counts();
df['R_fighter'].value_counts()
Or .groupby() with .size();
df.groupby('R_fighter').size()
Note: .value_counts() sorts the resulting Series in ascending order while the .groupby() method does not sort.

Select a specific slice of data from a main dataframe, conditional on a value in a main dataframe column

I have a main dataframe (df) with a Date column (non-index), a column 'VXX_Full' with values, and a 'signal' column.
I want to iterate through the signals column, and whenever it is 1, i want to capture a slice (20 rows before, 40 rows after) of the 'VXX_Full' column and create a new dataframe with all the slices. I would like the column name of the new dataframe to be the row number of the original dataframe.
VXX_signal = pd.DataFrame(np.zeros((60,0)))
counter = 1
for row in df.index:
if df.loc[row,'signal'] == 1:
add_row = df.loc[row - 20:row +20,'VXX_Full']
VXX_signal[counter] = add_row
counter +=1
VXX_signal
It just doesn't seem to be working. It creates a dataframe, however the values are all Nan. The first slice, it at least appears to be getting data from the main df, however the data doesn't correspond to the correct location. The following set of columns (there are 30 signals so 30 columns created) in the new df are all NaN
Thanks in advance!
I'm not sure about your current code - but basically all you need is a list of ranges of indexes. If your index is linear, this would be something like:
indexes = list(df[df.signal==1].index)
ranges = [(i,list(range(i-20,i+21))) for i in indexes] #create tuple (original index,range)
dfs = [df.loc[i[1]].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#EDIT: for only the VXX_Full Column:
dfs = [df.loc[i[1]].copy()[['VXX_Full']].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#here we take the -20:+20 slice of df, make a separate dataframe, the
#we change 'VXX_Full' to the original index value, and reset index to give it 0:40 index.
#The new index will be useful when putting all the columns next to each other.
So we made a list of indexes with signal == 1, turned it into a list of ranges and finally a list of dataframes with reset index.
Now we want to merge it all together:
from functools import reduce
merged_df = reduce(lambda left, right: pd.merge(
left, right, left_index=True, right_index=True), dfs)
I would build the resulting dataframe from a dictionnary of lists:
resul = pd.DataFrame({i:df.loc[i-20 if i >=20 else 0: i+40 if i <= len(df) - 40 else len(df),
'VXX_FULL'].values for i in df.loc[df.signal == 1].index})
The trick is that .values extract a numpy array with no associated index.
Beware: above code assumes that the index of the original dataframe is just the row number. Use reset_index first if it is different.

Categories