Mask python array based on multiple column indices - python

I have a 64*64 array, and would like to mask certain columns. For one column I know I can do:
mask = np.tri(64,k=0,dtype=bool)
col = np.zeros((64,64),bool)
col[:,3] = True
col_mask = col + np.transpose(col)
col_mask = np.tril(col_mask)
col_mask = col_mask[mask]
but how to extend this to multiple indices? I have tried doing col[:,1] & col[:,2] = True but got
SyntaxError: cannot assign to operator
Also I might have up to 10 columns I would like to mask, so is there a less unwieldily approach? I have also looked at numpy.indices but I don't think this is what I need. Thank you!

You can update multiple indices at the same time:
idx = [1,2,3,10,50]
col[:,idx] = True

Related

Selecting rows from 2 columns based on a list

I have a list which contains a list of indices. Now, I want to select two columns from a dataframe according to the indices.
I am trying:
indices = full_train_df.query("primary == primary").index
X = train_df[["A","B"]][:clean_df_indices].values
y = train_df["year"][:clean_df_indices].values
However, it says that none of them are in the index. What can I do to solve this error?
Use loc like this:
indices = full_train_df.query("primary == primary").index
X = train_df.loc[:clean_df_indices, ["A","B"]].values
y = train_df.loc[:clean_df_indices, "year"].values

How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()

Python Pandas Compare 2 Large DataFrames of Text for Similarity

I have two large dataframes I want to compare. I want a comparison result capable of a column and / or row wise comparison of similarities by percent. This part is simple. However, I want to be able to make the comparison ignore differences based upon value criteria. A small example is below.
d1 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['AA','--','BB']),
'Col2':pd.Series(['AB','AA','BB'])}
d2 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['BB','AB','--']),
'Col2':pd.Series(['AB','AA','AB'])}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1 = df1.set_index('Sample')
df2 = df2.set_index('Sample')
comparison = df1.eq(df2)
# for column stats
comparison.sum(axis=0) / float(len(df1.index))
# for row stats
comparison.sum(axis=1) / float(len(df1.columns))
My problem is that for when value1='AA' and value2 = '--' I want them to be viewed as equal (so when one is '--' basically always be true) but, otherwise perform a normal Boolean comparison. I need an efficient way to do this that doesn't include excessive looping as the datasets are quite large.
Below, I'm interpreting "when one is '--' basically always be true" to mean that any comparison against '--' (no matter what the other value is) should return True. In that case, you could use
mask = (df1=='--') | (df2=='--')
to find every location where either df1 or df2 is equal to '--' and then use
comparison |= mask
to update comparison. For example,
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 10000
df1, df2 = [pd.DataFrame(
np.random.choice(map(''.join, IT.product(list('ABC'), repeat=2))+['--'],
size=(N, 2)),
columns=['Col1', 'Col2']) for i in range(2)]
comparison = df1.eq(df2)
mask = (df1=='--') | (df2=='--')
comparison |= mask
# for column stats
column_stats = comparison.sum(axis=0) / float(len(df1.index))
# for row stats
row_stats = comparison.sum(axis=1) / float(len(df1.columns))
I think loop comprehension should be quite fast:
new_columns = []
for col in df1.columns:
new_columns.append([True if (x==y or x=='--' or y=='--') else False for x,y in zip(df1[col],df2[col])])
results = pd.DataFrame(new_columns).T
results.index = df1.index
This outputs the full true/false df.

Python & Numpy - create dynamic, arbitrary subsets of ndarray

I am looking for a general way to do this:
raw_data = np.array(somedata)
filterColumn1 = raw_data[:,1]
filterColumn2 = raw_data[:,3]
cartesian_product = itertools.product(np.unique(filterColumn1), np.unique(filterColumn2))
for val1, val2 in cartesian_product:
fixed_mask = (filterColumn1 == val1) & (filterColumn2 == val2)
subset = raw_data[fixed_mask]
I want to be able to use any amount of filterColumns. So what I want is this:
filterColumns = [filterColumn1, filterColumn2, ...]
uniqueValues = map(np.unique, filterColumns)
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = ????
subset = raw_data[variable_mask]
Is there a simple syntax to do what I want? Otherwise, should I try a different approach?
Edit: This seems to be working
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = True
for idx, fc in enumerate(filterColumns):
variable_mask &= (fc == combination[idx])
subset = raw_data[variable_mask]
You could use numpy.all and index broadcasting for this
filter_matrix = np.array(filterColumns)
combination_array = np.array(combination)
bool_matrix = filter_matrix == combination_array[newaxis, :] #not sure of the newaxis position
subset = raw_data[bool_matrix]
There are however simpler ways of doing the same thing if your filters are within the matrix, notably through numpy argsort and numpy roll over an axis. First you roll axes until your axes until you've ordered your filters as first columns, then you sort on them and slice the array vertically to get the rest of the matrix.
In general if an for loop can be avoided in Python, better avoid it.
Update:
Here is the full code without a for loop:
import numpy as np
# select filtering indexes
filter_indexes = [1, 3]
# generate the test data
raw_data = np.random.randint(0, 4, size=(50,5))
# create a column that we would use for indexing
index_columns = raw_data[:, filter_indexes]
# sort the index columns by lexigraphic order over all the indexing columns
argsorts = np.lexsort(index_columns.T)
# sort both the index and the data column
sorted_index = index_columns[argsorts, :]
sorted_data = raw_data[argsorts, :]
# in each indexing column, find if number in row and row-1 are identical
# then group to check if all numbers in corresponding positions in row and row-1 are identical
autocorrelation = np.all(sorted_index[1:, :] == sorted_index[:-1, :], axis=1)
# find out the breakpoints: these are the positions where row and row-1 are not identical
breakpoints = np.nonzero(np.logical_not(autocorrelation))[0]+1
# finally find the desired subsets
subsets = np.split(sorted_data, breakpoints)
An alternative implementation would be to transform the indexing matrix into a string matrix, sum row-wise, get an argsort over the now unique indexing column and split as above.
For conveniece, it might be more interesting to first roll the indexing matrix until they are all in the beginning of the matrix, so that the sorting done above is clear.
Something like this?
variable_mask = np.ones_like(filterColumns[0]) # select all rows initially
for column, val in zip(filterColumns, combination):
variable_mask &= (column == val)
subset = raw_data[variable_mask]

Complex Filtering of DataFrame

I've just started working with Pandas and I am trying to figure if it is the right tool for my problem.
I have a dataset:
date, sourceid, destid, h1..h12
I am basically interested in the sum of each of the H1..H12 columns, but, I need to exclude multiple ranges from the dataset.
Examples would be to:
exclude H4, H5, H6 data where sourceid = 4944 and exclude H8, H9-H12
where destination = 481981 and ...
... this can go on for many many filters as we are
constantly removing data to get close to our final model.
I think I saw in a solution that I could build a list of the filters I would want and then create a function to test against, but I haven't found a good example to work from.
My initial thought was to create a copy of the df and just remove the data we didn't want and if we need it back - we could just copy it back in from the origin df, but that seems like the wrong road.
By using masks, you don't have to remove data from the dataframe. E.g.:
mask1 = df.sourceid == 4944
var1 = df[mask1]['H4','H5','H6'].sum()
Or directly do:
var1 = df[df.sourceid == 4944]['H4','H5','H6'].sum()
In case of multiple filters, you can combine the Boolean masks with Boolean operators:
totmask = mask1 & mask2
you can use DataFrame.ix[] to set the data to zeros.
Create a dummy DataFrame first:
N = 10000
df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)])
df["sourceid"] = np.random.randint(0, 50, N)
df["destid"] = np.random.randint(0, 50, N)
Then for each of your filter you can call:
df.ix[df.sourceid == 10, "h4":"h6"] = 0
since you have 600k rows, create a mask array by df.sourceid == 10 maybe slow. You can create Series objects that map value to the index of the DataFrame:
sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index()
destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()
and then exclude h4,h5,h6 where sourceid == 10 by:
df.ix[sourceid[10], "h4":"h6"] = 0
to find row ids where sourceid == 10 and destid == 20:
np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)
to find row ids where 10 <= sourceid <= 12 and 3 <= destid <= 5:
np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)
sourceid and destid are Series with duplicated index values, when the index values are in order, Pandas use searchsorted to find index. it's O(log N), faster then create mask arrays which is O(N).

Categories