Pythonic Way of Reducing Factor Levels in Large Dataframe - python

I am attempting to reduce the number of factor levels within a column in a pandas dataframe such that the total instances of any factor as a proportion of all column rows lower than a defined threshold (default set to 1%), will be bucketed into a new factor labeled 'Other'. Below is the function I am using to accomplish this task:
def condenseMe(df, column_name, threshold = 0.01, newLabel = "Other"):
valDict = dict(df[column_name].value_counts() / len(df[column_name]))
toCondense = [v for v in valDict.keys() if valDict[v] < threshold]
if 'Missing' in toCondense:
toCondense.remove('Missing')
df[column_name] = df[column_name].apply(lambda x: newLabel if x in toCondense else x)
The issue I am running into is I am working with a large dataset (~18 million rows) and am attempting to use this function on a column with more than 10,000 levels. Because of this, executing this function on this column is taking a very long time to complete. Is there a more pythonic way to reduce the number of factor levels that will execute faster? Any help would be much appreciated!

You can do this with a combination of groupby, tranform, and count:
def condenseMe(df, col, threshold = 0.01, newLabel="Other"):
# Create a new Series with the normalized value counts
counts = df[[col]].groupby(col)[col].transform('count') / len(df)
# Create a 1D mask based on threshold (ignoring "Missing")
mask = (counts < threshold) & (df[col] != 'Missing')
# Assign these masked values a new label
df[col][mask] = newLabel

Related

Calculate weighted average of a data subset with one column as weights

I have a huge data set and I need to calculate a weighted average for subsets. My approach works but is painfully slow (dozens of minutes on my data) and I would like to find a faster solution. I thought to use DataFrame.agg but I cannot find out how to put a second column as the weight.
So far, I build a new DataFrame, an aggregated dataframe, with .apply, see below
def weighted(df):
grouped = df.groupby('spot')
result = pd.DataFrame()
for column in df.columns:
result[column] = grouped.apply(weighted_average, column, 'weight')
return result
def weighted_average(df, target_column, weight_column):
"""The average of a target column is calculated with a second column as weight.
Input:
DataFrame
string for column to average
string for column as weight
Output:
float, weighted average of target column """
return sum(df[weight_column] * df[target_column]) / df[weight_column].sum()
Thanks for your attention & help.
Update: I gained a factor 100 timed with 1/250 of my data by doing the following:
def tot_weighting(df, weight_column, list_of_columns):
norm = df.groupby('spot')[weight_column].transform('sum')
for column in list_of_columns:
df[column] = df[weight_column] * df[column]/norm
result = df.groupby('spot').agg('sum')
return result

Pandas random n samples of consecutive rows / pairs

I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array

Removing outlier from a single column

I am removing outliers from a dataset.
I decided to remove outlier from each column one-by-one. I have columns with a different number of missing values.
I used this code but it removed the whole row containg the outlier and due to many NaN values in my data, number of rows of my data reduced drastically.
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Then I decided to remove outlier from each column, and fill ouliers with NaN in each column
I wrote this code
def remove_outlier(df_in, col_name, thres=1.5):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-thres*iqr
fence_high = q3+thres*iqr
mask = (df_in[col_name] > fence_high) & (df_in[col_name] < fence_low)
df_in.loc[mask, col_name] = np.nan
return df_in
But this code doesn't filters the outliers. gave the same result.
What is wrong in this code? How can I correct it?
Is there any other elegant method to filter outlier?
Check the condition once. How can that be &. It should be |
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
In this snipplet, you select rows based on df_in[col_name] > fence_low and df_in[col_name] < fence_high, hence each time one of these condition is not respected, the row will be removed;
As a general rule, if you have a column with 30% outliers, 30% of you dataset will disappear, and you have two options
1. Fill the missing value ffill, mean constant value ...
2. Or drop these feature, if it is not mandatory, because in some times you would better drop a feature than reduce your dataset too much
Hope it helps

downsample big numpy.ndarray on if condition in python

I have a large numpy.ndarray and I need to downsample this array based on the value of one column. My solution works, but is very slow
data_table = data_table[[i for i in range(0, len(data_table)) if data_table[i][7] > 0.2 and data_table[i][7] < 0.75]]
does anybody know what the fastest way is to do this?
Use column-slicing to select relevant columns and compare those against the thresholds in a vectorized manner to give us a mask of valid rows and then index into the rows for the rows filtered output -
out = data_table[(data_table[:,7] > 0.2) & (data_table[:,7] < 0.75)]

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

Categories