More efficient way to query for column permutations in SQLite - python

For dummy's sake, let's say that I have a database with columns text, ind and sentid. You can think of it as a database with one row per word, with a word's text, its position in the sentence, and the ID of the sentence.
To query for specific occurrences of n words, I join the table with itself n times. That table does not have a unique column except for the default rowid. I oftentimes want to query these columns in such a way that the integers in the n ind columns are sequential without any integer between them. Sometimes the order matters, sometimes the order does not. At the same time, each of the n columns also needs to fulfil some requirement, e.g. n0.text = 'a' AND n1.text = 'b' AND n2.text = 'c'. Put differently, in every sentence (unique sentid), find all occurrences of a b c either ordered or in any order (but sequential).
I have solved the ordered case quite easily. For three columns with names ind0, ind1, ind2 you could simply have a query like the following and scale it accordingly as n columns grows.
WHERE ind1 = ind0 + 1 AND ind2 = ind1 + 1
Perhaps there is a better way (do tell me if that is so) but I have found that this works reliably well in terms of performance (query speed).
The tougher nut to crack is those cases where the integers also have to be sequential but where the order does not matter (e.g., 7 9 8, 3 1 2, 11 10 9). My current approach is "brute forcing" by simply generating all possible permutations of orders (e.g., (ind1 = ind0 + 1 AND ind2 = ind1 + 1) OR (ind0 = ind1 + 1 AND ind2 = ind0 + 1) OR ...)). But as n grows, this becomes a huge list of possibilities and my query speed seems to be really hurting on this. For example, for n=6 (the max requirement) this will generate 720 potential orders separate with OR. Such an approach, that works, is given as a minimal but documented example below for you to try out.
I am looking for a more generic, SQL-y solution that hopefully positively impacts performance when querying for sequential (but not necessarily ordered) columns.
Fiddle with data and current implementation here, reproducible Python code below.
Note that it is possible to get multiple results per sentid, but only if at least one of the word indices differs in the three matched words. E.g. permutations of the results themselves are not needed. E.g., 1-2-0 and 3-2-1 can both be valid results for one sentid, but both 1-2-0 and 2-1-0 can't.
import sqlite3
from itertools import permutations
from pathlib import Path
from random import shuffle
def generate_all_possible_sequences(n: int) -> str:
"""Given an integer, generates all possible permutations of the 'n' given indices with respect to
order in SQLite. What this means is that it will generate all possible permutations, e.g., for '3':
0, 1, 2; 0, 2, 1; 1, 0, 2; 1, 2, 0 etc. and then build corresponding SQLite requirements, e.g.,
0, 1, 2: ind1 = ind0 + 1 AND ind2 = ind1 + 1
0, 2, 1: ind2 = ind0 + 1 AND ind1 = ind2 + 1
...
and all these possibilities are then concatenated with OR to allow every possibility:
((ind1 = ind0 + 1 AND ind2 = ind1 + 1) OR (ind2 = ind0 + 1 AND ind1 = ind2 + 1) OR ...)
"""
idxs = list(range(n))
order_perms = []
for perm in permutations(idxs):
this_perm_orders = []
for i in range(1, len(perm)):
this_perm_orders.append(f"w{perm[i]}.ind = w{perm[i-1]}.ind + 1")
order_perms.append(f"({' AND '.join(this_perm_orders)})")
return f"({' OR '.join(order_perms)})"
def main():
pdb = Path("temp.db")
if pdb.exists():
pdb.unlink()
conn = sqlite3.connect(str(pdb))
db_cur = conn.cursor()
# Create a table of words, where each word has its text, its position in the sentence, and the ID of its sentence
db_cur.execute("CREATE TABLE tbl(text TEXT, ind INTEGER, sentid INTEGER)")
# Create dummy data
vals = []
for sent_id in range(20):
shuffled = ["a", "b", "c", "d", "e", "a", "c"]
shuffle(shuffled)
for word_id, word in enumerate(shuffled):
vals.append((word, word_id, sent_id))
# Wrap the values in single quotes for SQLite
vals = [(f"'{v}'" for v in val) for val in vals]
# Convert values into INSERT commands
cmds = [f"INSERT INTO tbl VALUES ({','.join(val)})" for val in vals]
# Build DB
db_cur.executescript(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;")
print(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;\n")
# Query DB for sequential occurences in ind0, ind1, and ind2: the order does not matter
# but they have to be sequential
query = f"""SELECT w0.ind, w1.ind, w2.ind, w0.sentid
FROM tbl AS w0
JOIN tbl AS w1 USING (sentid)
JOIN tbl AS w2 USING (sentid)
WHERE w0.text = 'a'
AND w1.text = 'b'
AND w2.text = 'c'
AND {generate_all_possible_sequences(3)}"""
print(query)
print()
print("a_idx\tb_idx\tc_idx\tsentid")
for res in db_cur.execute(query).fetchall():
print("\t".join(map(str, res)))
db_cur.close()
conn.commit()
conn.close()
pdb.unlink()
if __name__ == '__main__':
main()

This is a solution that uses mostly the rowids to create all the possible permutations:
WITH cte AS (
SELECT t0.rowid rowid0, t1.rowid rowid1, t2.rowid rowid2
FROM tbl AS t0
JOIN tbl AS t1 ON t1.sentid = t0.sentid AND t1.ind = t0.ind + 1
JOIN tbl AS t2 ON t2.sentid = t1.sentid AND t2.ind = t1.ind + 1
)
SELECT t0.ind ind0, t1.ind ind1, t2.ind ind2
FROM cte c
JOIN tbl t0 ON t0.rowid IN (c.rowid0, c.rowid1, c.rowid2)
JOIN tbl t1 ON t1.rowid IN (c.rowid0, c.rowid1, c.rowid2) AND t1.rowid <> t0.rowid
JOIN tbl t2 ON t2.rowid IN (c.rowid0, c.rowid1, c.rowid2) AND t2.rowid NOT IN (t0.rowid, t1.rowid)
See a simplified demo.
The query plan (in the above demo) shows that SQLite uses covering indexes and the rowids to perform the joins.
I don't know how this would scale for more columns as this requirement is a performance killer by design because of the multiple joins and the number of rows that must be retuned for each tuple of n columns that satisfy the conditions (=n!).

I found that the easiest way to implement this is to use the following condition:
AND MAX(w0.ind, w1.ind, w2.ind) - MIN(w0.ind, w1.ind, w2.ind) = 2
where 2 = the number of words that we are looking for -1. That being said, it is hard to say much about the performance since "no, indexes can't be used in expressions like MAX(...) - MIN(...) where functions are used." as #forpas mentions.

Related

How to check if 2 different values are from the same list and obtaining the list name

** I modified the entire question **
I have an example list specified below and i want to find if 2 values are from the same list and i wanna know which list both the value comes from.
list1 = ['a','b','c','d','e']
list2 = ['f','g','h','i','j']
c = 'b'
d = 'e'
i used for loop to check whether the values exist in the list however not sure how to obtain which list the value actually is from.
for x,y in zip(list1,list2):
if c and d in x or y:
print(True)
Please advise if there is any work around.
First u might want to inspect the distribution of values and sizes where you can improve the result with the least effort like this:
df_inspect = df.copy()
df_inspect["size.value"] = ["size.value"].map(lambda x: ''.join(y.upper() for y in x if x.isalpha() if y != ' '))
df_inspect = df_inspect.groupby(["size.value"]).count().sort_values(ascending=False)
Then create a solution for the most occuring size category, here "Wide"
long = "adasda, 9.5 W US"
short = "9.5 Wide"
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
print(len(get_intersection(long, short)) / len(short) >= 0.6)
Then apply the solution to the dataframe
df["defective_attributes"] = df.apply(lambda x: len(get_intersection(x["item_name.value"], x["size.value"])) / len(x["size.value"]) >= 0.6)
Basically, get_intersection search for the longest intersection between the itemname and the size. Then takes the length of the intersection and says, its not defective if at least 60% of the size_value are also in the item_name.

For cycle gets stuck in Python

My code below is getting stuck on a random point:
import functions
from itertools import product
from random import randrange
values = {}
tables = {}
letters = "abcdefghi"
nums = "123456789"
for x in product(letters, nums): #unnecessary
values[x[0] + x[1]] = 0
for x in product(nums, letters): #unnecessary
tables[x[0] + x[1]] = 0
for line_cnt in range(1,10):
for column_cnt in range(1,10):
num = randrange(1,10)
table_cnt = functions.which_table(line_cnt, column_cnt) #Returns a number identifying the table considered
#gets the values already in the line and column and table considered
line = [y for x,y in values.items() if x.startswith(letters[line_cnt-1])]
column = [y for x,y in values.items() if x.endswith(nums[column_cnt-1])]
table = [x for x,y in tables.items() if x.startswith(str(table_cnt))]
#if num is not contained in any of these then it's acceptable, otherwise find another number
while num in line or num in column or num in table:
num = randrange(1,10)
values[letters[line_cnt-1] + nums[column_cnt-1]] = num #Assign the number to the values dictionary
print(line_cnt) #debug
print(sorted(values)) #debug
As you can see it's a program that generates random sudoku schemes using 2 dictionaries : values that contains the complete scheme and tables that contains the values for each table.
Example :
5th square on the first line = 3
|
v
values["a5"] = 3
tables["2b"] = 3
So what is the problem? Am I missing something?
import functions
...
table_cnt = functions.which_table(line_cnt, column_cnt) #Returns a number identifying the table considered
It's nice when we can execute the code right ahead on our own computer to test it. In other words, it would have been nice to replace "table_cnt" with a fixed value for the example (here, a simple string would have sufficed).
for x in product(letters, nums):
values[x[0] + x[1]] = 0
Not that important, but this is more elegant:
values = {x+y: 0 for x, y in product(letters, nums)}
And now, the core of the problem:
while num in line or num in column or num in table:
num = randrange(1,10)
This is where you loop forever. So, you are trying to generate a random sudoku. From your code, this is how you would generate a random list:
nums = []
for _ in range(9):
num = randrange(1, 10)
while num in nums:
num = randrange(1, 10)
nums.append(num)
The problem with this approach is that you have no idea how long the program will take to finish. It could take one second, or one year (although, that is unlikely). This is because there is no guarantee the program will not keep picking a number already taken, over and over.
Still, in practice it should still take a relatively short time to finish (this approach is not efficient but the list is very short). However, in the case of the sudoku, you can end up in an impossible setting. For example:
line = [6, 9, 1, 2, 3, 4, 5, 8, 0]
column = [0, 0, 0, 0, 7, 0, 0, 0, 0]
Where those are the first line (or any line actually) and the last column. When the algorithm will try to find a value for line[8], it will always fail since 7 is blocked by column.
If you want to keep it this way (aka brute force), you should detect such a situation and start over. Again, this is very unefficient and you should look at how to generate sudokus properly (my naive approach would be to start with a solved one and swap lines and columns randomly but I know this is not a good way).

Finding the position of a subsequence in a sequence

If T1 is this:
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
and P is this:
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
how do I get the positions of P within T1 ?
In terms of min and max I would like to see this returned
min max
3 6
8 11
If these dataframes were represented as SQL tables I could use this SQL method translated to pandas:
DECLARE #Items INT = (SELECT COUNT(*) FROM #P);
SELECT MIN(t.KeyCol) AS MinKey,
MAX(t.KeyCol) AS MaxKey
FROM dbo.T1 AS t
INNER JOIN #P AS p ON p.Val = t.Val
GROUP BY t.KeyCol - p.KeyCol
HAVING COUNT(*) = #Items;
This SQL solution is from Pesomannen's reply to http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2
well, you can always do a workaround like this:
t1 = ''.join(T1.val)
p = ''.join(P.val)
start, res = 0, []
while True:
try:
res.append(t1.index(p, start))
start = res[-1] + 1
except:
break
to get the starting indices and then figure out the ending indices by mathing and access the dataframe by using iloc. you should use 0-based indexing (not 1-based, like you do in the example)
Granted, this doesn't utilize P, but may serve your purposes.
groups = T1.groupby(T1.val).groups
pd.DataFrame({'min': [min(x) for x in groups.values()],
'max': [max(x) for x in groups.values()]}, index=groups.keys())
yields
max min
E 7 2
B 10 0
D 9 1
A 8 3
[4 rows x 2 columns]
I think I've worked it out by following the same approach as the SQL solution - a type of relational division (ie match up on the values, group by the differences in the key columns and select the group that has the count equal to the size of the subsequence):
import pandas as pd
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
# use the index to create a new column that's going to be the key (zero based)
T1 = T1.reset_index()
# do the same for the subsequence that we want to find within T1
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
P = P.reset_index()
# join on the val column
J = T1.merge(P,on=['val'],how='inner')
# group by difference in key columns calculating the min, max and count of the T1 key
FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})
# Final result is where the count is the size of the subsequence - in this case 4
FullResult[FullResult['count'] == 4]
Really enjoying using pandas !

Find two disjoint pairs of pairs that sum to the same vector

This is a follow-up to Find two pairs of pairs that sum to the same value .
I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(m,n))
I would like to determine if the matrix has two disjoint pairs of pairs of columns which sum to the same column vector. I am looking for a fast method to do this. In the previous problem ((0,1), (0,2)) was acceptable as a pair of pairs of column indices but in this case it is not as 0 is in both pairs.
The accepted answer from the previous question is so cleverly optimised I can't see how to make this simple looking change unfortunately. (I am interested in columns rather than rows in this question but I can always just do A.transpose().)
Here is some code to show it testing all 4 by 4 arrays.
n = 4
nxn = np.arange(n*n).reshape(n, -1)
count = 0
for i in xrange(2**(n*n)):
A = (i >> nxn) %2
p = 1
for firstpair in combinations(range(n), 2):
for secondpair in combinations(range(n), 2):
if firstpair < secondpair and not set(firstpair) & set(secondpair):
if (np.array_equal(A[firstpair[0]] + A[firstpair[1]], A[secondpair[0]] + A[secondpair[1]] )):
if (p):
count +=1
p = 0
print count
This should output 3136.
Here is my solution, extended to do what I believe you want. It isn't entirely clear though; one may get an arbitrary number of row-pairs that sum to the same total; there may exist unique subsets of rows within them that sum to the same value. For instance:
Given this set of row-pairs that sum to the same total
[[19 19 30 30]
[11 16 11 16]]
There exists a unique subset of these rows that may still be counted as valid; but should it?
[[19 30]
[16 11]]
Anyway, I hope those details are easy to deal with, given the code below.
import numpy as np
n = 20
#also works for non-square A
A = np.random.randint(2, size=(n*6,n)).astype(np.int8)
##A = np.array( [[0, 0, 0], [1, 1, 1], [1, 1 ,1]], np.uint8)
##A = np.zeros((6,6))
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
R = np.zeros(a.shape[1:], dtype)
for col in columns:
R *= base
R += col
yield R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
#keep all packed columns; we might need them later
columns = []
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
columns.append(packed_column)
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all combinations
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those combinations which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
"""
we now have a small set of relevant combinations, but we have lost the details of their particulars
to see which combinations of rows does sum to the same value, we do need to consider rows as a whole
we can simply apply the same mechanism, but for all columns at the same time,
but only for the selected subset of row combinations known to be relevant
"""
#construct full packed matrix
B = np.ascontiguousarray(np.vstack(columns).T)
#perform all relevant sums, over all columns
rowsums = sum(B[I[active_combinations]] for I in combinations_index)
#find the unique rowsums, by viewing rows as a void object
unique, count, inverse = unique_count(voidview(rowsums))
#if not, we did something wrong in deciding on active combinations
assert(np.all(count>1))
#loop over all sets of rows that sum to an identical unique value
for i in xrange(len(unique)):
#set of indexes into combinations_index;
#note that there may be more than two combinations that sum to the same value; we grab them all here
combinations_group = active_combinations[inverse==i]
#associated row-combinations
#array of shape=(mulitplicity,group_size)
row_combinations = combinations_index[:,combinations_group]
#if no duplicate rows involved, we have a match
if len(np.unique(row_combinations[:,[0,-1]])) == multiplicity*2:
print row_combinations
return True
#none of identical rowsums met uniqueness criteria
return False
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(1):
## print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Edit: code cleanup

Computing ratings in matrix in python

I have been trying for a long time to solve this. But am unable to think of a clean data structure to do the following.
I have a csv file as follows:
user_id --->
item_id ratings
|
|
|
V
So for example:
1,2,3,4,..
a,4, ,2, ,...
b, ,2,3, ,..
c, ,1,2,3,
d
and so on...
The blank value means that user hasn't rated a given item.
Now, for a given user (say 1), I have this dictionary:
weight_vector = {2:0.3422,3:0.222}
The computation I want to do is following:
For user 1: the values which are missing (item b and c), I want to assign a rating to it
as the following:
rating_for_item_for_user_1 = [rating_given_by_user_2* weight_2] + [rating_given_by_user_3*weight_3]/[weight2 + weight3]
If user 2 or 3 has not rated a given item, then weight = 0.
I have a feeling that with numpy this should be fairly straightforward. But have not been able to think straight.
Lets assume that you have a rating matrix, and a list of weights vectors `weights', then you can simply do (assuming, that these "empty" fields are zeros - this is some border case you have to think of, because you can encounter dividing by 0 either way, when all of the users "neighbours" also did not give any rating to some item):
empty=np.where(ratings==0)
for (x,y) in zip(empty[0],empty[1]):
ratings[x,y] = sum( ratings[n][y] * weights[x][y] for n in weights[x] if ratings[n][y] != 0) / sum( weights[x][w] for w in weights[x] if ratings[w,x] != 0 )
To prevent division by zero errors you could just check for it before assignment
empty=np.where(ratings==0)
for (x,y) in zip(empty[0],empty[1]):
normalizer = sum( weights[x][w] for w in weights[x] if ratings[w,x] != 0 )
if normalizer > 0:
ratings[x,y] = sum( ratings[n,y] * weights[x][y] for n in weights[x] if ratings[n][y] != 0) / normalizer
Another possibility is to use defaultdict from collections.
http://docs.python.org/2/library/collections.html#collections.defaultdict
from collections import defaultdict
dict = defaultdict(float)
dict[x]=0
If you want it as matrix so you can access both column wise and rows wise you might want to load id to two different data structures or to load it to one, calculate and then transpose it.

Categories