How to add a lot of words into FuzzySet quickly? - python

I have a corpus of about 5 million words that I want to put into a Fuzzy set.
Currently, it takes almost 5 minutes. Is there any faster way to do that?
this is my code:
import fuzzyset
fuzzy_set = fuzzyset.FuzzySet()
for word in list_of_words: # len(list_of_words)=~5M
fuzzy_set.add(word)
I know that for-loop is not the fastest way to do things in Python but couldn't find any documentation to add a list to FuzzySet.
Thanks for the help.

You could use multi processing.
Split your list_of_words up into chunks and run it in a pool.
import fuzzyset
import multiprocessing as mp
fuzzy_set = fuzzyset.FuzzySet()
def add_words(chunk):
for word chunk:
fuzzy_set.add(word)
if __name__ == '__main__':
n = 500000 # or whatever size you want your chunks split up into
chunks = [list_of_words[x:x + n ] for x in range(0, len(list_of_words), n )]
pool = mp.Pool(mp.cpu_count())
pool.map(add_words, chunks)

Related

How to create a script that gives me every combination possible of a six digit code

Me and a friend want to create a script that gives us every possible permutation of a six digit code, comprised of 36 alphanumeric characters (0-9, and a-z), in alphabetical order, then be able to see them in a .txt file.
And I want it to use all of the CPU and RAM it can, so that it takes less time to complete the task.
So far, this is the code:
import random
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
file = open("codes.txt", "a")
for g in range(0, 36**6):
key = ""
base = ""
print(str(g))
for i in range(0, 6):
char = random.choice(charset)
key += char
base += key
file.write(base + "\n")
file.close()
This code randomly generates the combinations and immediately writes them in a .txt file, while printing the amount of codes it has already created but, it isn't in alphabetical order (have to do it afterwards), and it takes too long.
How can the code be improved to give the desired outcome?
Thanks to #R0Best for providing the best answer
Although this post already has 6 answers, I'm not content with any of them, so I've decided to contribute a solution of my own.
First, note that many of the answers provide the combinations or permutations of letters, however the post actually wants the Cartesian Product of the alphabet with itself (repeated N times, where N=6). There is (at this time) two answers that do this, however they both write an excessive number of times, resulting in subpar performance, and also concatenate their intermediate results in the hottest portion of the loop (also bringing down performance).
In the interest of taking optimization to the absolute max, I present the following code:
from string import digits, ascii_lowercase
from itertools import chain
ALPHABET = (digits + ascii_lowercase).encode("ascii")
def fast_brute_force():
# Define some constants to make the following sections more readable
base_size = 6
suffix_size = 4
prefix_size = base_size - suffix_size
word_size = base_size + 1
# define two containers
# word_blob - placeholder words, with hyphens in the unpopulated characters (followed by newline)
# sleds - a tuple of repeated bytes, used for substituting a bunch of characters in a batch
word_blob = bytearray(b"-" * base_size + b"\n")
sleds = tuple(bytes([char]) for char in ALPHABET)
# iteratively extend word_blob and sleds, and filling in unpopulated characters using the sleds
# in doing so, we construct a single "blob" that contains concatenated suffixes of the desired
# output with placeholders so we can quickly substitute in the prefix, write, repeat, in batches
for offset in range(prefix_size, base_size)[::-1]:
word_blob *= len(ALPHABET)
word_blob[offset::word_size] = chain.from_iterable(sleds)
sleds = tuple(sled * len(ALPHABET) for sled in sleds)
with open("output.txt", "wb") as f:
# I've expanded out the logic for substituting in the prefixes into explicit nested for loops
# to avoid both redundancy (reassigning the same value) and avoiding overhead associated with
# a recursive implementation
# I assert this below, so any changes in suffix_size will fail loudly
assert prefix_size == 2
for sled1 in sleds:
word_blob[0::word_size] = sled1
for sled2 in sleds:
word_blob[1::word_size] = sled2
# we write to the raw FileIO since we know we don't need buffering or other fancy
# bells and whistles, however in practice it doesn't seem that much faster
f.raw.write(word_blob)
There's a lot of magic happening in that code block, but in a nutshell:
I batch the writes, so that I'm writing 36**4 or 1679616 entries at once, so there's less context switching.
I update all 1679616 entries per batch simultaneously with the new prefix, using bytearray slicing / assignment.
I operate on bytes, write to the raw FileIO, expand the loops for the prefix assignments, and other small optimizations to avoid encoding/buffering/function call overhead/other performance hits.
Note, unless you have a very fast disk and slowish CPU, you won't see much benefit from the smaller optimizations, just the write batching probably.
On my system, it takes about 45 seconds to product + write the 14880348 file, and that's writing to my slowest disk. On my NVMe drive, it takes 6.868 seconds.
The fastest way I can think of is using pypy3 with this code:
import functools
import time
from string import digits, ascii_lowercase
#functools.lru_cache(maxsize=128)
def main():
cl = []
cs = digits + ascii_lowercase
for letter in cs:
cl.append(letter)
ct = tuple(cl)
with open("codes.txt", "w") as file:
for p1 in ct:
for p2 in ct:
for p3 in ct:
for p4 in ct:
for p5 in ct:
for p6 in ct:
file.write(f"{p1}{p2}{p3}{p4}{p5}{p6}\n")
if __name__ == '__main__':
start = time.time()
main()
print(f"Done!\nTook {time.time() - start} seconds!")
It writes at around 10-15MB/s. The total file is around 15GB I believe so it would take like 990-1500 seconds to generate. The results are on a VM of unraid with 1 3.4 ghz core of server CPU, with an old SATA3 SSD. You will probably get better results with an NVME drive and a faster single core CPU.
Random Can be very inefficient. You can try :
from itertools import permutations
from pandas import Series
charset = list("0123456789abcdefghijklmnopqrstuvwxyz")
links = []
file = open("codes.txt", "a")
comb = permutations(charset,6)
comb = list(comb)
comb = list(map(lambda x:return ''.join(x),comb))
mySeries = Series(comb)
mySeries = mySeries.sort_values()
base = ""
for k in mySeries:
base += k
file.write(base + "\n")
file.close()
You could use itertools.permutaions from the default itertools library. You can also specify the number of characters in the combination.
from itertools import permutations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
c = permutations(charset, 6)
with open('code.txt', 'w') as f:
for i in c:
f.write("".join(i) + '\n')
Runs on my computer in about 200 milliseconds for creating the list of permutations, then spends a lot of time writing to the file
For permutations, this would do the trick:
from itertools import permutations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
with open("codes.txt", "w") as f:
for permutation in permutations(charset, 6):
f.write(''.join(permutation) + '\n')
FYI, it would create a 7.8 GigaByte file
For combinations, this would do the trick:
from itertools import combinations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
with open("codes.txt", "w") as f:
for comb in combinations(charset, 6):
f.write(''.join(comb)+ '\n')
FYI, it would create a 10.8 megabyte file
First thing; There is better ways to do this but I want to write something clear and understandable.
Pseudo Code:
base = "";
for(x1=0; x1<charset.length(); x1++)
for(x2=0; x2<charset.length(); x2++)
for(x3=0; x3<charset.length(); x3++)
.
.
.
{ base = charset[x1]+charset[x2]+charset[x3]+.....+charset[x6];
file.write(base + "\n")
}
This is a combination problem where you are trying to get combinations of length 6 from the character set of length 36. This will produce an output of size 36!/(30!*6!) . You can refer the itertools for solving a combination problem like yours. You can refer the Combination function in itertools
Documentation. It is recommended not to perform such a performance intensive computation using Python.

Convert stream of hex values to 16-bit ints

I get packages of binary strings of size 61440 in hex values, somehting like:
b'004702AF42324fe380ac...'
I need to split those into batches of 4 and convert them to integers. 16 bit would be preferred but casting this later is not a problem. The way i did it looks like this and it works.
out = [int(img[i][j:j+4],16) for j in range(0,len(img[i]), 4)]
The issue im having is performance. Thing is i get a minimum of 200 of those a second possibly more and without multithreading i can only pass through 100-150 a second.
Can i improve the speed of this in some way?
This is a rewrite of my earlier offering showing how multithreading does, in fact, make a very significant difference - possibly depending on the system architecture.
The following code executes in ~0.05s on my machine:-
import random
from datetime import datetime
import concurrent.futures
N = 10
R = 61440
IMG = []
for _ in range(N):
IMG.append(''.join(random.choice('0123456789abcdef')
for _ in range(R)))
"""
now IMG has N elements each containg R pseudo randomly generated hexadecimal values
"""
def tfunc(img, k):
return k, [int(img[j:j + 4], 16) for j in range(0, len(img), 4)]
R = [0] * N
start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
"""
note that we pass the relevant index to the worker function
because we can't be sure of the order of completion
"""
for i in range(N):
futures.append(executor.submit(tfunc, IMG[i], i))
for future in concurrent.futures.as_completed(futures):
k, r = future.result()
R[k] = r
"""
list R now contains the converted values from the same relative indexes in IMG
"""
print(f'Duration={datetime.now()-start}')
I don't think multi-threading will help in this case as it's purely CPU intensive. The overheads of breaking it down over, say, 4 threads would outweigh any theoretical advantages. Your list comprehension appears to be as efficient as it can be although I'm unclear as to why img seems to have multiple dimensions. I've written the following simulation and on my machine this consistently executes in ~0.8 seconds. I think the performance you'll get from your code is going to be highly dependent on your CPU's capabilities. Here's the code:-
import random
from datetime import datetime
hv = '0123456789abcdef'
img = ''.join(random.choice(hv) for _ in range(61440))
start = datetime.now()
for _ in range(200):
out = [int(img[j:j + 4], 16) for j in range(0, len(img), 4)]
print(f'Duration={datetime.now()-start}')
I did some more research and found that not multi-threading but multi-processes are what I need. That gave me a speedup from 220 batches per second to ~370 batches per second. This probably bottnecks somewhere else now since i only got 15% load on all cores but puts me comfortably above spec and thats good enough.
from multiprocessing import Pool
def combine(img):
return np.array([int(img[j:j+4],16) for j in range(0,len(img), 4)]).reshape((24,640))
p = Pool(20)
img = p.map(combine, tmp)

How to Quickly Remove Similar Strings?

Right now, I have a dataset of around 70,000 tweets and I'm trying to remove tweets that are overly similar. I have decided to use a Levenshtein ratio of greater than 0.9 as a threshold for tweets that are too similar. However, my current code is essentially comparing every tweet to every other tweet, giving me O(N^2), which is painfully slow. It takes over 12 hours to run, which is just too much. I tried parallelizing it, which is what I'm running right now, but I don't know if that will speed it up to an acceptable degree. Is there some way I can accomplish this in O(N)?
import json
import multiprocessing as mp
from pathlib import Path
import Levenshtein
def check_ratio(similar_removed, try_tweet):
for tweet in similar_removed:
if Levenshtein.ratio(try_tweet, tweet[1]) > 0.9:
return 1
return 0
def main():
path = Path('C:/Data/Python/JobLoss')
orig_data = []
with open(path / 'Processed.json') as f:
data = json.load(f)
for tweet in data:
orig_data.append(tweet[2])
pool = mp.Pool(mp.cpu_count())
similar_removed = []
similar_removed.append((0, orig_data[0]))
for i in range(1, len(orig_data)):
print('%s / %s' % (i, len(orig_data)))
too_similar = 0
try_tweet = orig_data[i]
too_similar = pool.apply(check_ratio, args=(similar_removed, try_tweet))
if not too_similar:
similar_removed.append((i, try_tweet))
pool.close()
final = [data[ind[0]] for ind in similar_removed]
with open(path / 'ProcessedSimilarRemoved.json', 'w') as f:
json.dump(final, f)
if __name__ == '__main__':
main()
I ended up using a method described in the top answer to this question, which uses LSH for sublinear time queries, giving an overall complexity of sub quadratic; fast enough for my needs. Essentially you turn each tweet into a set with k-shingling, then use min hash lsh for a quick way to bucket similar sets together. Then, instead of comparing each tweet to every other tweet, you only need to look in its bucket for matches, making this method considerably faster than using the Levenshtein ratio on all pairs.

np.array of texts memory size

I have a list of (possibly long) strings.
When i convert it to np.array i quite fast run out of RAM because it seems to take much more memory than a simple list. Why and how to deal with it? Or maybe I'm just doing something wrong?
The code:
import random
import string
import numpy as np
from sys import getsizeof
cnt = 100
sentences = []
for i in range(0, cnt):
word_cnt = random.randrange(30, 100)
words = []
for j in range(0, word_cnt):
word_length = random.randrange(20)
letters = [random.choice(string.ascii_letters) for x in range(0, word_length)]
words.append(''.join(letters))
sentences.append(' '.join(words))
list_size = sum([getsizeof(x) for x in sentences]) + getsizeof(sentences)
print(list_size)
arr = np.array(sentences)
print(getsizeof(arr))
print(arr.nbytes)
The output:
76345
454496
454400
I'm not sure if i use getsizeof() correctly, but I started to investigate it when I noticed memory problems so I'm pretty sure there's something going on : )
(Bonus question)
I'm trying to run something similar to https://autokeras.com/examples/imdb/. The original example requires about 3GB of memory, and I wanted to use a bigger dataset. Maybe there's some better way?
I'm using python3.6.9 with numpy==1.17.0 on Ubuntu 18.04.

How to reduce for loop execution time using multiprocessing in python

I have two lists. List X contains 1000 words. List Y contains 500 words. I am trying to find similar words for List X with respect to Y.
I am using Spacy's similarity function.
The problem I am facing is that it takes a long time for the for loop part of the execution. I have understood from research that in python, multi threading only gives a illusion of concurrency and hence does not have any real performance increase. Thus I thought multiprocessing is the way but I am new to multiprocessing usage, hence request help.
How do I speed up the execution of the for loop part through multiprocessing in python?
The following is my code.
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
ListX =['HSBC', 'JP Morgan',......] #500 words lists
ListY = ['Currency','Blockchain'.......] #1000 words lists
s_words = []
for token1 in ListY:
list_to_sort = []
for token2 in ListX:
list_to_sort.append((token1, token2,nlp(str(token1)).similarity(nlp(str(token2)))))
sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
You can try this:
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
def compare_function(token1, token2, nlp):
return token1, token2, nlp(str(token1)).similarity(nlp(str(token2)))
from multiprocessing import Pool
import itertools
tokenlist = [(a,b, nlp) for a, b in itertools.product(ListX, ListY)]
p = Pool(8)
results = p.map(compare_function, tokenlist)
If you are under windows, use
if __name__ == '__main__':
results = p.map(compare_function, tokenlist)

Categories