Complex Filtering of DataFrame - python

I've just started working with Pandas and I am trying to figure if it is the right tool for my problem.
I have a dataset:
date, sourceid, destid, h1..h12
I am basically interested in the sum of each of the H1..H12 columns, but, I need to exclude multiple ranges from the dataset.
Examples would be to:
exclude H4, H5, H6 data where sourceid = 4944 and exclude H8, H9-H12
where destination = 481981 and ...
... this can go on for many many filters as we are
constantly removing data to get close to our final model.
I think I saw in a solution that I could build a list of the filters I would want and then create a function to test against, but I haven't found a good example to work from.
My initial thought was to create a copy of the df and just remove the data we didn't want and if we need it back - we could just copy it back in from the origin df, but that seems like the wrong road.

By using masks, you don't have to remove data from the dataframe. E.g.:
mask1 = df.sourceid == 4944
var1 = df[mask1]['H4','H5','H6'].sum()
Or directly do:
var1 = df[df.sourceid == 4944]['H4','H5','H6'].sum()
In case of multiple filters, you can combine the Boolean masks with Boolean operators:
totmask = mask1 & mask2

you can use DataFrame.ix[] to set the data to zeros.
Create a dummy DataFrame first:
N = 10000
df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)])
df["sourceid"] = np.random.randint(0, 50, N)
df["destid"] = np.random.randint(0, 50, N)
Then for each of your filter you can call:
df.ix[df.sourceid == 10, "h4":"h6"] = 0
since you have 600k rows, create a mask array by df.sourceid == 10 maybe slow. You can create Series objects that map value to the index of the DataFrame:
sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index()
destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()
and then exclude h4,h5,h6 where sourceid == 10 by:
df.ix[sourceid[10], "h4":"h6"] = 0
to find row ids where sourceid == 10 and destid == 20:
np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)
to find row ids where 10 <= sourceid <= 12 and 3 <= destid <= 5:
np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)
sourceid and destid are Series with duplicated index values, when the index values are in order, Pandas use searchsorted to find index. it's O(log N), faster then create mask arrays which is O(N).


np.where() computes np.random.choice() only once - pandas

I have this dataframe:
N = 10000
N_Seg = 100
df = pd.DataFrame({"Rut_Num": range(1,N+1),
"Segmento": np.random.choice(
["Afluente", "Afluente","Premium", "Preferente", "Preferente", "Preferente", "Preferente", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico"], N),
"If_Seguro": np.random.choice([0,1,1], N)})
Rut_Num Segmento If_Seguro
0 1 Clásico 1
1 2 Preferente 0
2 3 Afluente 0
3 4 Preferente 0
4 5 Clásico 1
When the column If_Seguro is 1, I need a random number between 1 and N_Seg+1, if its 0, I need a 0:
df.loc[:,"id_Seguro"] = np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0)
You can see that the np.where() true condition will give the same number for all the ones when I need a random number for each 1 from If_Seguro
Besides, why np.where() computes np.random.choice() only once for the whole column and it doesn't compute it for each validation (each row) in the column?
The expression np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0) shows what is in my opinion a frequently encountered, but generally undesirable use of where. The solution will also answer your question as to why only one value is being generated.
np.where does not compute much. It just selects values based on a mask from a pair of existing arrays. Normal python semantics don't change here. You are passing in the result of a function call, not the function itself, so it's the value that is used. This means that you need to compute np.random.choice(...) for all of the rows of df, not just the ones where df["If_Seguro"] == 1.
df["If_Seguro"] is a mask, and numpy provides you with some tools for worrying with masks. For example, the actual number of elements you want to generate is
The row locations where you want to insert those values is given by the mask itself. Both numpy and pandas allow you to index with a boolean mask directly. np.where is just an extra layer of inefficiency in many cases.
Finally, to generate N samples from an existing sequence, do either:
np.random.choice(range(1, N_Seg + 1), size=N, replace=True)
replace=True allows the samples to repeat, as your original call to np.where likely intended. A much better way to do the same thing does not involve an explicit sequence object:
np.random.randint(1, N_Seg + 1, N)
In the proposed solution, where will be the number of masked elements, whereas in your original code it should have been N.
So finally we have:
mask = df["If_Seguro"]
df.loc[mask, "id_Seguro"] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
If id_Seguro is not already zeroed out to start with, you can do one of a couple of things. Adding on to the previous:
df.loc[~mask, "id_Seguro"] = 0
Or generating a new array from scratch:
mask = df["If_Seguro"]
result = np.zeros(N)
result[mask] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
df["id_Seguro"] = result

Looping over rows in Pandas dataframe taking too long

I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?
df4 is a panda data frame. df4.head() looks like this
df4 = pd.DataFrame({
'sentiment_score':np.random.choice( [0,1], 3000000),
'user_id':np.random.choice( ['11','12','13'], 3000000),
What I am aiming to have is a new column called rating.
len(df4.index) is 3,037,321.
ratings = []
for index in df4.index:
rowUserID = df4['user_id'][index]
rowTrackID = df4['track_id'][index]
rowSentimentScore = df4['sentiment_score'][index]
condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
allRows = df4[condition]
totalSongListendForContext = len(allRows.index)
rows = df4[(condition & (df4['track_id'] == rowTrackID))]
songListendForContext = len(rows.index)
rating = songListendForContext/totalSongListendForContext
Globally, you'll need groupby. you can either:
use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID), divide the second by the first:
df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
/ df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))
Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:
df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.

Why isn't here order invariance between the two sets of operations?

I'm handling a CSV file/pandas dataframe, where the first column contains the date.
I want to do some conversion here to datetime, some filtering, sorting and reindexing.
What I experience is that if I change the order of the sets of operations, I get different results (the result of the first configuration is bigger, than the other one). Probably the first one is the "good" one.
Can anyone tell me, which sub-operations cause the difference between the results?
Which of those is the "bad" and which is the "good" solution?
Is it possible secure order independence where the user can call those two methods in any order and still got the good results? (Is it possible to get the good results by implementing interchangeable sets of operations?)
jdf1 = x.copy(deep=True)
jdf2 = x.copy(deep=True)
interval = [DATE_START, DATE_END]
dateColName = "Date"
Configuration 1:
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf1.drop_duplicates(subset=dateColName, inplace=True)
jdf1.sort_values(dateColName, inplace=True)
jdf1.reset_index(drop=True, inplace=True)
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf1[dateColName] = pd.to_datetime(jdf1[jdf1.columns[0]], format="%Y-%m-%d")
maskL = jdf1[dateColName] < interval[0]
maskR = jdf1[dateColName] > interval[1]
mask = maskL | maskR
jdf1.drop(jdf1[mask].index, inplace=True)
Configuration 2:
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf2[dateColName] = pd.to_datetime(jdf2[jdf2.columns[0]], format="%Y-%m-%d")
maskL = jdf2[dateColName] < interval[0]
maskR = jdf2[dateColName] > interval[1]
mask = maskL | maskR
jdf2.drop(jdf2[mask].index, inplace=True)
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf2.drop_duplicates(subset=dateColName, inplace=True)
jdf2.sort_values(dateColName, inplace=True)
jdf2.reset_index(drop=True, inplace=True)
val1 = set(jdf1["Date"].values)
val2 = set(jdf2["Date"].values)
# bigger:
val1 - val2
# empty:
val2 - val1
Thank you for your help!
In first look it is same, but NOT.
Because there are 2 different ways for filtering with affect each others:
drop_duplicates() -> remove M rows, together ALL rows - M
boolean indexing with mask -> remove N rows, together ALL - M - N
boolean indexing with mask -> remove K rows, together ALL rows - K
drop_duplicates() -> remove L rows, together ALL - K - L
K != M
L != N
And if swap this operations, result should be different, because both remove rows. And it is important order of calling them, because some rows remove only drop_duplicates, somerows only boolean indexing.
In my opinion both methods are right, it depends what need.

How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
df.loc[df.A == df.A.shift()] = df.B.shift()

Python & Numpy - create dynamic, arbitrary subsets of ndarray

I am looking for a general way to do this:
raw_data = np.array(somedata)
filterColumn1 = raw_data[:,1]
filterColumn2 = raw_data[:,3]
cartesian_product = itertools.product(np.unique(filterColumn1), np.unique(filterColumn2))
for val1, val2 in cartesian_product:
fixed_mask = (filterColumn1 == val1) & (filterColumn2 == val2)
subset = raw_data[fixed_mask]
I want to be able to use any amount of filterColumns. So what I want is this:
filterColumns = [filterColumn1, filterColumn2, ...]
uniqueValues = map(np.unique, filterColumns)
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = ????
subset = raw_data[variable_mask]
Is there a simple syntax to do what I want? Otherwise, should I try a different approach?
Edit: This seems to be working
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = True
for idx, fc in enumerate(filterColumns):
variable_mask &= (fc == combination[idx])
subset = raw_data[variable_mask]
You could use numpy.all and index broadcasting for this
filter_matrix = np.array(filterColumns)
combination_array = np.array(combination)
bool_matrix = filter_matrix == combination_array[newaxis, :] #not sure of the newaxis position
subset = raw_data[bool_matrix]
There are however simpler ways of doing the same thing if your filters are within the matrix, notably through numpy argsort and numpy roll over an axis. First you roll axes until your axes until you've ordered your filters as first columns, then you sort on them and slice the array vertically to get the rest of the matrix.
In general if an for loop can be avoided in Python, better avoid it.
Here is the full code without a for loop:
import numpy as np
# select filtering indexes
filter_indexes = [1, 3]
# generate the test data
raw_data = np.random.randint(0, 4, size=(50,5))
# create a column that we would use for indexing
index_columns = raw_data[:, filter_indexes]
# sort the index columns by lexigraphic order over all the indexing columns
argsorts = np.lexsort(index_columns.T)
# sort both the index and the data column
sorted_index = index_columns[argsorts, :]
sorted_data = raw_data[argsorts, :]
# in each indexing column, find if number in row and row-1 are identical
# then group to check if all numbers in corresponding positions in row and row-1 are identical
autocorrelation = np.all(sorted_index[1:, :] == sorted_index[:-1, :], axis=1)
# find out the breakpoints: these are the positions where row and row-1 are not identical
breakpoints = np.nonzero(np.logical_not(autocorrelation))[0]+1
# finally find the desired subsets
subsets = np.split(sorted_data, breakpoints)
An alternative implementation would be to transform the indexing matrix into a string matrix, sum row-wise, get an argsort over the now unique indexing column and split as above.
For conveniece, it might be more interesting to first roll the indexing matrix until they are all in the beginning of the matrix, so that the sorting done above is clear.
Something like this?
variable_mask = np.ones_like(filterColumns[0]) # select all rows initially
for column, val in zip(filterColumns, combination):
variable_mask &= (column == val)
subset = raw_data[variable_mask]
