Python 3 occurrence matching - python

I started to write this on a whim for little more than curiosity. I've been looking at the code in a visualizer and it looks like it iterates like I'd expect but doesn't output what I think it should. Can someone show me what I'm missing? It's just a funny example of how sql join tables look after processing.
def query(a=[1,2,3,4], b=[3,1,1,2,3,4,5,6]):
"""
table A table B Expected Output Actual Output
idx value idx value indxA indxB indxA indxB
0 1 0 3 0 1 0 1
1 2 1 1 0 2 0 1
2 3 2 1 1 3 1 3
3 4 3 2 2 0 2 0
5 4 2 3 2 0
6 5 3 5 3 5
7 6
EXAMPLE
Table A index 0 occurs at Table B index 1 and 2
PROBLEM
Anywhere there are multiple matches only first occurrence prints
"""
for idx, itemA in enumerate(a):
if itemA in b:
for itemB in b:
if itemA == itemB:
print("{} {}".format(a.index(itemA), b.index(itemB)))
query()

list.index(x) returns the index in the list of the first item whose value is x.
So you should use enumerate(b).
def query(a=[1,2,3,4], b=[3,1,1,2,3,4,5,6]):
for index_a, value_a in enumerate(a):
for index_b, value_b in enumerate(b):
if value_a == value_b:
print("{} {}".format(index_a, index_b))
query()

For large lists iterating over each element in one list for each other element in the other list can be slow. This is a quadratic algorithm.
This is the solution with only lists. I took the printing out as it would use most of time and return the results in a list:
def query_list(a, b):
res = []
for index_a, value_a in enumerate(a):
for index_b, value_b in enumerate(b):
if value_a == value_b:
res.append((index_a, index_b))
return res
This is an alternative implementation using dictionaries:
def query_dict(a, b):
indices = {value: index for index, value in enumerate(a)}
res = []
for index_b, value_b in enumerate(b):
if value_b in indices:
res.append((indices[value_b], index_b))
return sorted(res)
Generating some example data:
import random
a = list(range(1, 1000))
b = [random.randint(1, 100) for x in range(10000)]
The list version:
%timeit query_list(a, b)
1 loops, best of 3: 1.09 s per loop
is much slower than the dict version:
%timeit query_dict(a, b)
100 loops, best of 3: 11.8 ms per loop
This about a factor of 10.
Using larger example data:
import random
a = list(range(1, 10000))
b = [random.randint(1, 100) for x in range(10000)]
The difference becomes even more pronounced:
%timeit query_list(a, b)
1 loops, best of 3: 11.4 s per loop
%timeit query_dict(a, b)
100 loops, best of 3: 13.7 ms per loop
Going up to a factor of close to 100.

Related

Vectorize calculation of a Pandas Dataframe

I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance.
Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. I do have the first row of values in DF_B.
df_a = [
[1,2,3,4]
[5,6,7,8]
[..... more rows]
]
df_b = [
[1,2,3,4]
[ rows of all 0 values here, so dimensions match df_a]
]
What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. So in this case:
df_b.loc[2] = [6,8,10,12]
I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. Doesn't seem super efficient.
Here is a numpy solution. This should be significantly faster than a pandas loop, especially since it uses JIT-compiling via numba.
from numba import jit
a = df_a.values
b = df_b.values
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
df_b = pd.DataFrame(fill_b(a, b))
# 0 1 2 3
# 0 1 2 3 4
# 1 6 8 10 12
# 2 15 18 21 24
# 3 28 32 36 40
# 4 45 50 55 60
Performance benchmarking
import pandas as pd, numpy as np
from numba import jit
df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
def jp(df_a):
a = df_a.values
b = np.empty(df_a.values.shape)
b[0] = np.arange(1, 1001)
return pd.DataFrame(fill_b(a, b))
%timeit df_a.cumsum() # 16.1 ms
%timeit jp(df_a) # 6.05 ms
You can just create df_b using the cumulative sum over df_a, like so
df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()
0 1 2 3
0 1 2 3 4
1 6 8 10 12
2 15 18 21 24
3 28 32 36 40

Removing usernames from a dataframe that do not appear a certain number of times?

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop

How to get the number of continuous increase data point in Pandas dataframe

Thanks for your help in advance!
I have one dataframe. say, for colume incr, the number if 0 1 1 1 -1 0 1 1 ...., I would like to get parse the list and get for each datapoint, how many times the data list had increased (or more exactly not decreased); decrease in the point would reset the output into zero for the current data point. for example for the list (named output['inc_adj'] in the code )
0 1 1 1 -1 0 1 -1
I should get named output['cont_inc'] in the code
1 2 3 4 0 1 2 0
I wrote the following code, but it is very inefficient, any suggestion for improve the efficiency significantly, please? It seemed that I am keeping reloading the cache in CPU among the two loops(if my feeling is correct), but I could not find a better solution at the current stage.
output['cont_inc']=0;
for i in xrange(1,output['inc_adj'].count()):
j=i;
while(output['inc_adj'][j] != -1):
#for both increase or unchanged
output['cont_inc'][i]+=1;
j-=1
Thanks in advance!
If memory allows, I would suggest building a list with all the adjacent values for comparison to start with (in my sample using zip), and append the result to a new list, re-assign the whole result list back to the DataFrame after completion.
Although it sounds odd, but in reality it improves the performance a little by eliminating a little overhead of constant DataFrame index/value lookup:
import pandas as pd
import random
# random DataFrame with values from -1 to 2
df = pd.DataFrame([random.randint(-1, 2) for _ in xrange(999)], columns=['inc_adj'])
df['cont_inc'] = 0
def calc_inc(df):
inc = [1]
# I use zip to PREPARE the adjacent values
for i, n in enumerate(zip(df['inc_adj'][1:], df['inc_adj'][:-1]), 0):
if n[0] >= n[1]:
inc.append(inc[i]+1)
continue
inc.append(0)
df['cont_inc'] = inc
calc_inc(df)
df.head()
inc_adj cont_inc
0 0 1
1 0 2
2 1 3
3 -1 0
4 0 1
%timeit calc_inc(df)
1000 loops, best of 3: 696 µs per loop
As a comparison, using indexing and/or lookup and in-place assignment, similarly coding logic:
def calc_inc_using_ix(df):
for idx, row in df.iterrows():
try:
if row['inc_adj'] >= df['inc_adj'][idx-1]:
row['cont_inc'] = df['cont_inc'][idx-1] + 1
continue
row['cont_inc'] = 0
except KeyError:
row['cont_inc'] = 1
calc_inc_using_ix(df)
df.head()
inc_adj cont_inc
0 0 1
1 1 2
2 1 3
3 0 0
4 2 1
%timeit calc_inc_using_ix(df)
10 loops, best of 3: 58.5 ms per loop
That said, I'm also interested in any other solutions that will further improve the performance, always willing to learn.

Count number of elements in each column less than x

I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433
In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion
Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()

how to compute a new column based on the values of other columns in pandas - python

Let's say my data frame contains these data:
>>> df = pd.DataFrame({'a':['l1','l2','l1','l2','l1','l2'],
'b':['1','2','2','1','2','2']})
>>> df
a b
0 l1 1
1 l2 2
2 l1 2
3 l2 1
4 l1 2
5 l2 2
l1 should correspond to 1 whereas l2 should correspond to 2.
I'd like to create a new column 'c' such that, for each row, c = 1 if a = l1 and b = 1 (or a = l2 and b = 2). If a = l1 and b = 2 (or a = l2 and b = 1) then c = 0.
The resulting data frame should look like this:
a b c
0 l1 1 1
1 l2 2 1
2 l1 2 0
3 l2 1 0
4 l1 2 0
5 l2 2 1
My data frame is very large so I'm really looking for the most efficient way to do this using pandas.
df = pd.DataFrame({'a': numpy.random.choice(['l1', 'l2'], 1000000),
'b': numpy.random.choice(['1', '2'], 1000000)})
A fast solution assuming only two distinct values:
%timeit df['c'] = ((df.a == 'l1') == (df.b == '1')).astype(int)
10 loops, best of 3: 178 ms per loop
#Viktor Kerkes:
%timeit df['c'] = (df.a.str[-1] == df.b).astype(int)
1 loops, best of 3: 412 ms per loop
#user1470788:
%timeit df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)
1 loops, best of 3: 363 ms per loop
#herrfz
%timeit df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
1 loops, best of 3: 387 ms per loop
You can also use the string methods.
df['c'] = (df.a.str[-1] == df.b).astype(int)
df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
You can just use logical operators. I'm not sure why you're using strings of 1 and 2 rather than ints, but here's a solution. The astype at the end converts it from boolean to 0's and 1's.
df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)

Categories