Filter by columns combination in pandas - python

I have two DataFrames "A" and "B". Each has two columns "key1" and "key2", but a unique key is a combination of two. I want to select from second DataFrame all rows with combination of "key1" and "key2" columns which are contained in DataFrame "A".
Simple example:
A = pd.DataFrame({'a':list(range(20000))*100,
'b':np.repeat(list(range(100)),20000)})
B = pd.DataFrame({'a':list(range(40000))*100,
'b':np.repeat(list(range(100)),40000),
'c':np.random.randint(4000000, size = 4000000)})
Solution 1:
%%time
A['marker'] = True
C = B.merge(A, on=['a','b'], how='inner').drop('marker', axis=1)
1.26 s
Solution 2:
%%time
A['marker'] = A['a'].astype(str) + '_' + A['b'].astype(str)
B['marker'] = B['a'].astype(str) + '_' + B['b'].astype(str)
C = B[B.marker.isin(A.marker)]
20.4 s
This works, but is there a more elegant (and fast) solution?

You could try taking a look at pd.MultiIndex and using multi-level indices instead of plain/meaningless integer ones. Not sure if it would be a lot faster in the real data, but modifying your example data slightly:
index1 = pd.MultiIndex.from_arrays([range(20000)*100, np.repeat(range(100),20000)]) #former A
index2 = pd.MultiIndex.from_arrays([range(40000)*100, np.repeat(range(100),40000)]) #index of B[['a', 'b']]
s = pd.Series(np.random.randint(4000000, size = 4000000), index=index2) #former B['c']
In [93]: %timeit c = s[index1]
1 loops, best of 3: 803 ms per loop
The indexing of s with a different index (index1) from its original index (index2) is roughly equivalent to the your merge operation.
Usually operations on the index tend to be faster than those performed on regular DataFrame columns. But either way, you are probably looking for marginal improvement here. I don't think you can get this done in the microsecond scale.

Related

Rownumber without groupby Pandas

I have three columns: id(unique), value, time
I want to create a new column that does a simple row_number without any partitioning
I tried : df['test'] = df.groupby('id_col').cumcount()+1
But the output is only ones.
Expecting to get 1->len of the dataframe
Also , is there a way to do it in numpy for better performance
If your index is already ordered starting from 0
df["row_num"] = df.index + 1
else:
df["row_num"] = df.reset_index().index + 1
Comparing time with %%timeit speed from fastest to slowest: #Scott Boston's method > #Henry Ecker's method > mine
df["row_num"] = range(1,len(df)+1)
Alternative:
df.insert(0, "row_num", range(1,len(df)+1))

Remove first x number of characters from each row in a column of a Python dataframe

I have a pandas dataframe with about 1,500 rows and 15 columns. For one specific column, I would like to remove the first 3 characters of each row. As a simple example here is a dataframe:
import pandas as pd
d = {
'Report Number':['8761234567', '8679876543','8994434555'],
'Name' :['George', 'Bill', 'Sally']
}
d = pd.DataFrame(d)
I would like to remove the first three characters from each field in the Report Number column of dataframe d.
Use vectorised str methods to slice each string entry
In [11]:
d['Report Number'] = d['Report Number'].str[3:]
d
Out[11]:
Name Report Number
0 George 1234567
1 Bill 9876543
2 Sally 4434555
It is worth noting Pandas "vectorised" str methods are no more than Python-level loops.
Assuming clean data, you will often find a list comprehension more efficient:
# Python 3.6.0, Pandas 0.19.2
d = pd.concat([d]*10000, ignore_index=True)
%timeit d['Report Number'].str[3:] # 12.1 ms per loop
%timeit [i[3:] for i in d['Report Number']] # 5.78 ms per loop
Note these aren't equivalent, since the list comprehension does not deal with null data and other edge cases. For these situations, you may prefer the Pandas solution.
You can also call str.slice. To remove the first 3 characters from each string:
df['Report Number'] = df['Report Number'].str.slice(3)
To slice the 2-4th characters from each string:
df['Report Number'] = df['Report Number'].str.slice(1, 4)

How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()

Efficiently find matching rows (based on content) in a pandas DataFrame

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.
As a small example
np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]
Here raw_data is:
And random_sample is derived to guarantee a match and is:
Currently I have written:
for idx, row in raw_data.iterrows():
if random_sample.equals(row):
print "match"
break
Which works but on the large dataset is very slow. Is there a more efficient way to check if an entire row is contained in the DataFrame?
BTW: My example also needs to be able to compare np.NaN equality which is why I am using the equals() method
equals doesn't seem to broadcast, but we can always do the equality comparison manually:
>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
0 1 2 3 4 5 6 \
599999 0.07832 0.064828 0.502513 0.851816 0.976464 0.761231 0.275242
7 8 9
599999 0.426393 0.91632 0.569807
which is much faster than the iterative version for me (which takes > 30s.)
But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like
def finder(df, row):
for col in df:
df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
return df
gives me
>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop
which is roughly an order of magnitude faster, because after the first column there's only one row left.
(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)
The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large:
tmp = raw_data
for idx, val in random_sample.iteritems():
try:
if np.isnan(val):
continue
except:
pass
tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"
Note: This is actually a slower for the above small dimensional example. But on a large dataset this ~9 times faster than the basic iteration

Approach to speed up pandas multilevel index selection?

I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.

Categories