How to Quickly Remove Similar Strings? - python

Right now, I have a dataset of around 70,000 tweets and I'm trying to remove tweets that are overly similar. I have decided to use a Levenshtein ratio of greater than 0.9 as a threshold for tweets that are too similar. However, my current code is essentially comparing every tweet to every other tweet, giving me O(N^2), which is painfully slow. It takes over 12 hours to run, which is just too much. I tried parallelizing it, which is what I'm running right now, but I don't know if that will speed it up to an acceptable degree. Is there some way I can accomplish this in O(N)?
import json
import multiprocessing as mp
from pathlib import Path
import Levenshtein
def check_ratio(similar_removed, try_tweet):
for tweet in similar_removed:
if Levenshtein.ratio(try_tweet, tweet[1]) > 0.9:
return 1
return 0
def main():
path = Path('C:/Data/Python/JobLoss')
orig_data = []
with open(path / 'Processed.json') as f:
data = json.load(f)
for tweet in data:
orig_data.append(tweet[2])
pool = mp.Pool(mp.cpu_count())
similar_removed = []
similar_removed.append((0, orig_data[0]))
for i in range(1, len(orig_data)):
print('%s / %s' % (i, len(orig_data)))
too_similar = 0
try_tweet = orig_data[i]
too_similar = pool.apply(check_ratio, args=(similar_removed, try_tweet))
if not too_similar:
similar_removed.append((i, try_tweet))
pool.close()
final = [data[ind[0]] for ind in similar_removed]
with open(path / 'ProcessedSimilarRemoved.json', 'w') as f:
json.dump(final, f)
if __name__ == '__main__':
main()

I ended up using a method described in the top answer to this question, which uses LSH for sublinear time queries, giving an overall complexity of sub quadratic; fast enough for my needs. Essentially you turn each tweet into a set with k-shingling, then use min hash lsh for a quick way to bucket similar sets together. Then, instead of comparing each tweet to every other tweet, you only need to look in its bucket for matches, making this method considerably faster than using the Levenshtein ratio on all pairs.

Related

How to create a script that gives me every combination possible of a six digit code

Me and a friend want to create a script that gives us every possible permutation of a six digit code, comprised of 36 alphanumeric characters (0-9, and a-z), in alphabetical order, then be able to see them in a .txt file.
And I want it to use all of the CPU and RAM it can, so that it takes less time to complete the task.
So far, this is the code:
import random
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
file = open("codes.txt", "a")
for g in range(0, 36**6):
key = ""
base = ""
print(str(g))
for i in range(0, 6):
char = random.choice(charset)
key += char
base += key
file.write(base + "\n")
file.close()
This code randomly generates the combinations and immediately writes them in a .txt file, while printing the amount of codes it has already created but, it isn't in alphabetical order (have to do it afterwards), and it takes too long.
How can the code be improved to give the desired outcome?
Thanks to #R0Best for providing the best answer
Although this post already has 6 answers, I'm not content with any of them, so I've decided to contribute a solution of my own.
First, note that many of the answers provide the combinations or permutations of letters, however the post actually wants the Cartesian Product of the alphabet with itself (repeated N times, where N=6). There is (at this time) two answers that do this, however they both write an excessive number of times, resulting in subpar performance, and also concatenate their intermediate results in the hottest portion of the loop (also bringing down performance).
In the interest of taking optimization to the absolute max, I present the following code:
from string import digits, ascii_lowercase
from itertools import chain
ALPHABET = (digits + ascii_lowercase).encode("ascii")
def fast_brute_force():
# Define some constants to make the following sections more readable
base_size = 6
suffix_size = 4
prefix_size = base_size - suffix_size
word_size = base_size + 1
# define two containers
# word_blob - placeholder words, with hyphens in the unpopulated characters (followed by newline)
# sleds - a tuple of repeated bytes, used for substituting a bunch of characters in a batch
word_blob = bytearray(b"-" * base_size + b"\n")
sleds = tuple(bytes([char]) for char in ALPHABET)
# iteratively extend word_blob and sleds, and filling in unpopulated characters using the sleds
# in doing so, we construct a single "blob" that contains concatenated suffixes of the desired
# output with placeholders so we can quickly substitute in the prefix, write, repeat, in batches
for offset in range(prefix_size, base_size)[::-1]:
word_blob *= len(ALPHABET)
word_blob[offset::word_size] = chain.from_iterable(sleds)
sleds = tuple(sled * len(ALPHABET) for sled in sleds)
with open("output.txt", "wb") as f:
# I've expanded out the logic for substituting in the prefixes into explicit nested for loops
# to avoid both redundancy (reassigning the same value) and avoiding overhead associated with
# a recursive implementation
# I assert this below, so any changes in suffix_size will fail loudly
assert prefix_size == 2
for sled1 in sleds:
word_blob[0::word_size] = sled1
for sled2 in sleds:
word_blob[1::word_size] = sled2
# we write to the raw FileIO since we know we don't need buffering or other fancy
# bells and whistles, however in practice it doesn't seem that much faster
f.raw.write(word_blob)
There's a lot of magic happening in that code block, but in a nutshell:
I batch the writes, so that I'm writing 36**4 or 1679616 entries at once, so there's less context switching.
I update all 1679616 entries per batch simultaneously with the new prefix, using bytearray slicing / assignment.
I operate on bytes, write to the raw FileIO, expand the loops for the prefix assignments, and other small optimizations to avoid encoding/buffering/function call overhead/other performance hits.
Note, unless you have a very fast disk and slowish CPU, you won't see much benefit from the smaller optimizations, just the write batching probably.
On my system, it takes about 45 seconds to product + write the 14880348 file, and that's writing to my slowest disk. On my NVMe drive, it takes 6.868 seconds.
The fastest way I can think of is using pypy3 with this code:
import functools
import time
from string import digits, ascii_lowercase
#functools.lru_cache(maxsize=128)
def main():
cl = []
cs = digits + ascii_lowercase
for letter in cs:
cl.append(letter)
ct = tuple(cl)
with open("codes.txt", "w") as file:
for p1 in ct:
for p2 in ct:
for p3 in ct:
for p4 in ct:
for p5 in ct:
for p6 in ct:
file.write(f"{p1}{p2}{p3}{p4}{p5}{p6}\n")
if __name__ == '__main__':
start = time.time()
main()
print(f"Done!\nTook {time.time() - start} seconds!")
It writes at around 10-15MB/s. The total file is around 15GB I believe so it would take like 990-1500 seconds to generate. The results are on a VM of unraid with 1 3.4 ghz core of server CPU, with an old SATA3 SSD. You will probably get better results with an NVME drive and a faster single core CPU.
Random Can be very inefficient. You can try :
from itertools import permutations
from pandas import Series
charset = list("0123456789abcdefghijklmnopqrstuvwxyz")
links = []
file = open("codes.txt", "a")
comb = permutations(charset,6)
comb = list(comb)
comb = list(map(lambda x:return ''.join(x),comb))
mySeries = Series(comb)
mySeries = mySeries.sort_values()
base = ""
for k in mySeries:
base += k
file.write(base + "\n")
file.close()
You could use itertools.permutaions from the default itertools library. You can also specify the number of characters in the combination.
from itertools import permutations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
c = permutations(charset, 6)
with open('code.txt', 'w') as f:
for i in c:
f.write("".join(i) + '\n')
Runs on my computer in about 200 milliseconds for creating the list of permutations, then spends a lot of time writing to the file
For permutations, this would do the trick:
from itertools import permutations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
with open("codes.txt", "w") as f:
for permutation in permutations(charset, 6):
f.write(''.join(permutation) + '\n')
FYI, it would create a 7.8 GigaByte file
For combinations, this would do the trick:
from itertools import combinations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
with open("codes.txt", "w") as f:
for comb in combinations(charset, 6):
f.write(''.join(comb)+ '\n')
FYI, it would create a 10.8 megabyte file
First thing; There is better ways to do this but I want to write something clear and understandable.
Pseudo Code:
base = "";
for(x1=0; x1<charset.length(); x1++)
for(x2=0; x2<charset.length(); x2++)
for(x3=0; x3<charset.length(); x3++)
.
.
.
{ base = charset[x1]+charset[x2]+charset[x3]+.....+charset[x6];
file.write(base + "\n")
}
This is a combination problem where you are trying to get combinations of length 6 from the character set of length 36. This will produce an output of size 36!/(30!*6!) . You can refer the itertools for solving a combination problem like yours. You can refer the Combination function in itertools
Documentation. It is recommended not to perform such a performance intensive computation using Python.

Finding image similarities in a folder of thousands

I've cobbled together/wrote some code (Thanks stackoverflow users!) that checks for similarities in images using imagehash, but now I am having issues checking thousands of images (roughly 16,000). Is there anything that I could improve with the code (or a different route entirely) that can more accurately find matches and/or decrease time required? Thanks!
I first changed my list that is created to an itertools combination, so it only compares unique combinations of images.
new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')
duplicates = []
dup = []
for f1, f2 in itertools.combinations(dirloc,2):
#Honestly not sure which hash method to use, so I went with dhash.
dhash1 = imagehash.dhash(Image.open(f1))
dhash2 = imagehash.dhash(Image.open(f2))
hashdif = dhash1 - dhash2
if hashdif < 5: #May change the 5 to find more accurate matches
print("images are similar due to dhash", "image1", f1, "image2", f2)
duplicates.append(f1)
dup.append(f2)
#Setting up a CSV file with the similar images to review before deleting
with open("duplicates.csv", "w") as myfile:
wr = csv.writer(myfile)
wr.writerows(zip(duplicates, dup))
Currently, this code may take days to process the number of images I have in the folder. I'm hoping to reduce this down to hours if possible.
Try this, instead of hashing each image at comparison (127,992,000 hashes), you hash ahead of time and compare the hashes since those are not going to change (16,000 hashes).
new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')
duplicates = []
dup = []
hashes = []
for file in dirloc:
hashes.append((file, imagehash.dhash(Image.open(file))))
for pair1, pair2 in itertools.combinations(hashes,2):
f1, dhash1 = pair1
f2, dhash2 = pair2
#Honestly not sure which hash method to use, so I went with dhash.
hashdif = dhash1 - dhash2
if hashdif < 5: #May change the 5 to find more accurate matches
print("images are similar due to dhash", "image1", f1, "image2", f2)
duplicates.append(f1)
dup.append(f2)
#Setting up a CSV file with the similar images to review before deleting
with open("duplicates.csv", "w") as myfile: # also move this out of the loop so you arent rewriting the file every time
wr = csv.writer(myfile)
wr.writerows(zip(duplicates, dup))

Python - List appending getting slower?

I have a problem with my Python code. I am processing a big (3.5gb) JSON file that contains scores, and I need it in chunks of 21984 scores (which is all the scores for one query). The code works fine, but my testset is 4000 queries. The first 10 go fast but after that it exponentially increases time to calculate this part of the code. So after 5 hours i'm on 500 queries. The prints were for logging, and it seems that the problem lies in the translating or appending the lines to the list. Does anyone know how to make this faster or sees what caused it to get slower?
def getscorebatch(number):
print('Creating Batch..')
batch_temp = list()
with open(json_file_name, 'r') as FileObj:
print("Creating slice...")
lines_gen = islice(FileObj, (21894 * number), ((21894 * number) + 21894))
print("Appending slice...")
for line in lines_gen:
line = line.translate({ord(c): None for c in ':",}{ \n'})
batch_temp.append(line)
return batch_temp
UPDATE: I tried to implement your suggestions, and it is way faster! Thank you so much. Im fairly new to generators, so what i do not understand is now, how do I get the correct chunk the code? This will give me the first chunk everytime..
def generator(file_to_read):
c = 0
while c < 21894:
data = file_to_read.readline()
c += 1
if not data:
break
yield data
def getscorebatch(number):
print('Creating Batch..')
batch_temp = [0]*22000
with open(json_file_name, 'r') as FileObj:
gen_file = generator(FileObj)
batch_temp = [line.translate(line.maketrans("", "", REMOVE)) for line in gen_file]
print(len(batch_temp))
return batch_temp

Python input/output optimisation

I think this code takes too long to execute, so maybe there are better ways to do this. I'm not looking for an answer related to parallelising the for loops, or using more than one processor.
What I'm trying to do is to read values from "file" using "np.genfromtxt(file)". I have 209*500*16 of these files. I want to extract the minimum value of the highest 1000 values of the 209 loop, and putting these 500 values in 16 different files. If the files are missing or the data hasn't the adequate size, the info is written to the "missing_all" file.
The questions are:
Is this the best method to open a file?
Is this the best method to write to files?
How can I make this code faster?
Code:
import numpy as np
import os.path
output_filename2 = '/home/missing_all.txt'
target2 = open(output_filename2, 'w')
for w in range(16):
group = 1200 + 50*w
output_filename = '/home/veto_%s.txt' %(group)
target = open(output_filename, 'w')
for z in range(1,501):
sig_b = np.zeros((209*300))
y = 0
for index in range(1,210):
file = '/home/BandNo_%s_%s/%s_209.dat' %(group,z,index)
if not os.path.isfile(file):
sig_b[y:y+300] = 0
y = y + 300
target2.write('%s %s %s\n' % (group,z,index))
continue
data = np.genfromtxt(file)
if (data.shape[0] < 300):
sig_b[y:y+300] = 0
y = y + 300
target2.write('%s %s %s\n' % (group,z,index))
continue
sig_b[y:y+300] = np.sort(data[:,4])[::-1][0:300]
y = y + 300
sig_b = np.sort(sig_b[:])[::-1][0:1000]
target.write('%s\n' % (sig_b[-1]))
Profiler
You can use a profiler to figure out what parts of your script take the most time. This way you know exactly what takes the most time and can optimize those lines instead of blindly trying to optimize your code. The time invested to figure out how the profiler works will pay for itself easily later on.
Some possible slow-downs
Here are some guesses, but they really are only guesses.
You open() only 17 files, so it probably doesn't matter how exactly you do this.
I don't know much about writing-performance. Using file.write() seems fine to me.
genfromtextfile probably takes quite a while (depends on your input files), is loadtxt an alternative for you? The docs states you can use it for data without holes.
Using a binary file format instead of text could speed up reading the file.
You sort your array on every iteration. Is there a way to sort it only at the end?
Usually asking the file system something is not very fast, i.e. os.path.isfile(file) is potentially slow. You could try creating a dict of all the children of the parent directory and use that cached version.
Similarly, if most of your files exist, using exceptions can be faster:
try:
data = np.genfromtxt(file)
except FileNotFoundError: # not sure if this is the correct exception
sig_b[y:y+300] = 0
y += 300
target2.write('%s %s %s\n' % (group,z,index))
continue
I didn't try to understand your code in detail. Maybe you can reduce the necessary work by using a smarter algorithm?
PS: I like that you try to put all equal signs on the same column. Unfortunately here it makes it harder to read your code.

Python compare every line in file with all others

I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop (O(n2)) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the 'with open' mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.
I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
# get the first set from the list and...
set1 = set(ast.literal_eval(row))
# ...find the intersection of every other set in the file
for i in range(0, len(list1)):
# don't compare the set with itself
if not(pos == i):
set2 = list1[i]
set3 = set1.intersection(set2)
# if the two sets have less than 3 items in common
if(len(set3) < filterValue):
# and you've reached the end of the file
if(i == len(list1)):
# append the row in outfile
outfile.write(row)
# increase position in infile
pos += 1
else:
break
else:
outfile.write(row)
Sample input would be a file with this format:
[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]
The output file if this were the input file would be:
[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]
...because the two sets removed contained three or more of the same user id's.
This answer is not about how to split code in functions, name variables etc. It's about faster algorithm in terms of complexity.
I'd use a dictionary. Will not write exact code, you can do it yourself.
Sets = dict()
for rowID, row in enumerate(Rows):
for userID in row:
if Sets.get(userID) is None:
Sets[userID] = set()
Sets[userID].add(rowID)
So, now we have a dictionary, which can be used to quickly obtain rownumbers of rows containing given userID.
BadRows = set()
for rowID, row in enumerate(Rows):
Intersections = dict()
for userID in row:
for rowID_cmp in Sets[userID]:
if rowID_cmp != rowID:
Intersections[rowID_cmp] = Intersections.get(rowID_cmp, 0) + 1
# Now Intersections contains info about how many "times"
# row numbered rowID_cmp intersectcs current row
filteredOut = False
for rowID_cmp in Intersections:
if Intersections[rowID_cmp] >= filterValue:
BadRows.add(rowID_cmp)
filteredOut = True
if filteredOut:
BadRows.add(rowID)
Having rownumbers of all filtered out rows saved to BadRows, now we do iteration one last time:
for rowID, row in enumerate(Rows):
if rowID not in BadRows:
# output row
This works in 3 scans and in O(nlogn) time. Maybe you'd have to rework iterating Rows array, because it's a file in your case, but doesn't really change much.
Not sure about python syntax and details, but you get the idea behind my code.
First of all, please pack your the code into functions which do one thing well.
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def loop_over_some_result(result):
# insert assertions so that you don't end up looping in infinity:
assert result is not None
...
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
intersects = find_intersections_sets(L1,L2)
...
if __name__ == "__main__":
myfunc()
then you can easily profile the code using:
if __name__ == "__main__":
import cProfile
cProfile.run('myfunc()')
which gives you invaluable insight into your code behaviour and allows you to track down logical bugs. For more on cProfile, see How can you profile a python script?
An option to track down a logical flaw (we're all humans, right?) is to user a timeout function in a decorate like this (python2) or this (python3):
Hereby myfunc can be changed to:
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
#timeout(10) # seconds <---- the clever bit!
intersects = find_intersections_sets(L1,L2)
...
...where the timeout operation will raise an error if it takes too long.
Here is my best guess:
import ast
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def less_than_x_in_common(set1, set2, limit=3):
if len(set1.intersection(set2)) < limit:
return True
else:
return False
def check_infile(datafile, savefile, filtervalue=3):
list1 = [get_ast_set(row) for row in get_data(datafile)]
outlist = []
for row in list1:
if any([less_than_x_in_common(set(row), set(i), limit=filtervalue) for i in outlist]):
outlist.append(row)
with open(savefile, 'w') as fo:
fo.writelines(outlist)
if __name__ == "__main__":
datafile = str(arg2 + arg1 + "/file")
savefile = str(arg2 + arg1 + "/filteredSets" + str(filterValue))
check_infile(datafile, savefile)

Categories