Trying to optimize quicksort for larger files - python

Does anyone know how I can optimize this code better to run larger files. It works with smaller inputs, but I need it to run a file with over 200,000 words. Any suggestions?
Thank you.
import random
import re
def quick_sort(a,i,n):
if n <= 1:
return
mid = (len(a)) // 2
x = a[random.randint(0,len(a)-1)]
p = i - 1
j = i
q = i + n
while j < q:
if a[j] < x:
p = p + 1
a[j],a[p] = a[p],a[j]
j = j + 1
elif a[j] > x:
q = q - 1
a[j],a[q] = a[q],a[j]
else:
j = j + 1
quick_sort(a,i,p-i+1)
quick_sort(a,q,n-(q-i))
file_name = input("Enter file name: ")
my_list = []
with open(file_name,'r') as f:
for line in f:
line = re.sub('[!#?,.:";\']', '', line).lower()
token = line.split()
for t in token:
my_list.append(t)
a = my_list
quick_sort(a,0,len(my_list))
print("List After Calling Quick Sort: ",a)

Your random selection of an index to use for your pivot x is using the whole size of the input list a, not just the part you're supposed to be sorting on the current call. This means that very often your pivot won't be in the current section at all, and so you won't be able to usefully reduce your problem (because all of the values will be on the same side of the pivot). This leads to lots and lots of recursion, and for larger inputs you'll almost always hit the recursion cap.
The fix is simple, just change how you get x:
x = a[random.randrange(i, i+n)]
I like randrange a lot better than randint, but you could use randint(i, i+n-1) if you feel the other way.

Must you use a quicksort? If you can use a heapq or PriorityQueue, the .get/(.pop()) methods automatically implement the sort:
import sys
from queue import PriorityQueue
pq = PriorityQueue()
inp = open(sys.stdin.fileno(), newline='\n')
#inp = ['dag', 'Rug', 'gob', 'kex', 'mog', 'Wes', 'pox', 'sec', 'ego', 'wah'] # for testing
for word in inp:
word = word.rstrip('\n')
pq.put(word)
while not pq.empty():
print(pq.get())
Then test with some large random word input or file e.g.:
shuf /usr/share/dict/words | ./word_pq.py
where shuf is Gnu /usr/local/bin/shuf.

Related

runtime difference : quick sort and binary search

I got a problem with what I learned.
What I want to do
write first list and sort it.
write second list.
define binary search.
compare runtime of sorting list(built-in) and runtime of binary search.
input size : 1,000,000 (both of list).
What is my problem
I learned big-O of quick sorting is n x log(n). and that of binary search is log(n). so, repeat binary search by n-times, then we might get similar runtime with quick sorting but it isn't.
Code
make first list
import numpy as np
arr = np.random.randint(low=-10000000, high=10000000, size=(1000000,))
with open('./1st_list.txt', 'w') as f:
f.write(str(len(arr)) + '\n')
for i in arr:
f.write(str(i) + '\n')
make second list
arr = np.random.randint(low=-10000000, high=10000000, size=(1000000,))
with open('./2nd_list.txt', 'w') as f:
f.write(str(len(arr)) + '\n')
for i in arr:
f.write(str(i) + '\n')
define binary search
def binary_search(
arr: list, #sorted list
target: int):
start, end = 0, len(arr)-1
while(True):
if start > end:
return None
mid = (start+end)//2
if target == arr[mid]:
return mid+1
elif target > arr[mid]:
start = mid+1
else:
end = mid-1
load 1st_list
start = time.time()
arr_1st = []
for _ in range(1000000):
arr_1st.append(int(input().strip()))
arr_1st.sort()
time.time() - start
load 2nd_list
sys.stdin = open('./2nd_list.txt')
input = sys.stdin.readline
arr_2nd = []
for _ in range(1000000):
arr_2nd.append(int(input().strip()))
compute runtime
import time
start = time.time()
for ele in arr_2nd:
binary_search(arr_1st, ele)
time.time() - start
result
runtime
my local resource
load 1st_list and sort : 0.5s.
compute runtime of binary search : 4.2s
I dont know why these is different.
thank you for detailed information.

Extract multiple longest common prefix from a file

I am a newbie in python and stuck in one problem to get the longest common prefix from a file. I have found the solution on the web to get the common prefix between 2 strings, but unable to get any solution from a file
Below program returns me 9, whereas the output I want is 9415007 and 95420070144.
fname = 'Book1 - Copy.csv'
fh = open(fname)
file2 = fh.read()
a = list(file2.split())
prefix_len = len(a[0])
count = 0
lst = list()
for x in a:
prefix_len = min(prefix_len, len(x))
while not x.startswith(a[0][: prefix_len]):
prefix_len = prefix_len-1
prefix = a[0][: prefix_len]
print(prefix)
I expect the output to be 9415007 and 954200701441.
Sample data:
9415007301578
9415007301585
9415007014416
9542007014416
9542007014417
9542007014418
The os.path module contains a commonprefix function that you can use. To find the longest prefix between any two lines, you should first sort the lines and then compare consecutive pairs (keep the longest).
For example:
from os.path import commonprefix
sLines = sorted(lines)
longest = max((commonprefix([a,b]) for a,b in zip(sLines,sLines[1:])),key=len)
common = commonprefix(lines)
print(common,longest) # 9, 954200701441
note that your sample data only has "9" as the common prefix to all lines because there are instances of 94... and 95... To get 9415007, you would need to remove the last 4 lines.
If you need to do this on a company by company basis, you will need to group the data by company identifier (first 7 characters):
from collections import defaultdict
companies = next( d for d in [defaultdict(list)] if [d[s[:7]].append(s) for s in lines])
companies = {c:sorted(s) for c,s in companies.items()}
companies = {c:max((commonprefix([a,b]) for a,b in zip(s,s[1:])),key=len) for c,s in companies.items()}
print(companies) # {'9415007': '94150073015', '9542007': '954200701441'}
I'm sure, that this is not the best solution, but definately simplest one.
data = """9415007301578
9415007301585
9415007014416
9542007014416
9542007014417
9542007014418""".splitlines()
longest_prefix = ""
for i in range(len(data) - 1):
temp_prefix = ""
for j in range(min(len(data[i]), len(data[i+1]))):
if data[i][j] == data[i + 1][j]:
temp_prefix += data[i][j]
else:
break
if len(temp_prefix) > len(longest_prefix):
longest_prefix = temp_prefix
print(longest_prefix)
Output:
954200701441

Longest chain of last word of line/first word of next

Okay, so I am trying to find from a text file the longest chain in which the last word of one line is the first word of the next (works well for poetry). The Python script I have to far works well but still takes an immensely long time. I am no coding expert and have really no idea of optimization. Am I running through more options than necessary?
How can I reduce the time it takes to run through a longer text?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import sys
# Opening the source text
with open("/text.txt") as g:
all_lines = g.readlines()
def last_word(particular_line):
if particular_line != "\n":
particular_line = re.sub(ur'^\W*|\W*$', "",particular_line)
if len(particular_line) > 1:
return particular_line.rsplit(None, 1)[-1].lower()
def first_word(particular_line):
if particular_line != "\n":
particular_line = re.sub(ur'^\W*|\W*$', "",particular_line)
if len(particular_line) > 1:
return particular_line.split(None, 1)[0].lower()
def chain(start, lines, depth):
remaining = list(lines)
del remaining[remaining.index(start)]
possibles = [x for x in remaining if (len(x.split()) > 2) and (first_word(x) == last_word(start))]
maxchain = []
for c in possibles:
l = chain(c, remaining, depth)
sys.stdout.flush()
sys.stdout.write(str(depth) + " of " + str(len(all_lines)) + " \r")
sys.stdout.flush()
if len(l) > len(maxchain):
maxchain = l
depth = str(depth) + "." + str(len(maxchain))
return [start] + maxchain
#Start
final_output = []
#Finding the longest chain
for i in range (0, len(all_lines)):
x = chain(all_lines[i], all_lines, i)
if len(x) > 2:
final_output.append(x)
final_output.sort(key = len)
#Output on screen
print "\n\n--------------------------------------------"
if len(final_output) > 1:
print final_output[-1]
else:
print "Nothing found"
import itertools
def matching_lines(line_pair):
return line_pair[0].split()[-1].lower() == line_pair[1].split()[0].lower()
line_pairs = ((line,next_line) for line,next_line in itertools.izip(all_lines,all_lines[1:]))
grouped_pairs = itertools.groupby(line_pairs,matching_lines)
print max([len(list(y))+1 for x,y in grouped_pairs if x])
although im not sure it will be faster (but i think it will be since it only iterates one time and uses mostly builtins)
Yes, this code has the complexity of $O(n^2)$. It means that if your file has n lines, then the amount of iterations your code will perform is 1 * (n-1) for the first line, then 1 * (n-2) for the second line etc, with n such elements. For a big n, this is relatively equal to $n^2$. Actually, there's a bug in the code in this line
del remaining[remaining.index(start)]
where you probably meant to run this:
del remaining[:remaining.index(start)]
(notice the ':' in the square brackets) which expands the runtime (now you have (n-1) + (n-1) + .. + (n-1) = n*(n-1), which is slightly bigger then (n-1) + (n-2) + (n-3) ..).
Your can optimize the code as so: begin with maxchainlen = 0, curchainlen = 0. Now, iterate through the lines, every time compare the first word of the current line to the last word of the previous line. If they match, increase curchainlen by 1. If they don't, check if maxchainlen < curchainlen, if so, assign maxchainlen = curchainlen, and init curchainlen to 0. After you finish iterating through the lines, do this checkup for maxchainlen again. Example:
lw = last_word(lines[0])
curchainlen = 0
maxchainlen = 0
for l in lines[2:]:
if lw = first_word(l):
curchainlen = curchainlen + 1
else:
maxchainlen = max(maxchainlen, curchainlen)
curchainlen = 0
maxchainlen = max(maxchainlen, curchainlen)
print(maxchainlen)
I'd try splitting this job into two phases: first finding the chains and then comparing them. That will simplify the code a lot. Since chains will be a small subset of all the lines in the file, finding them first and then sorting them will be quicker than trying to process the whole thing in one big go.
The first part of the problem is a lot easier if you use the python yield keyword, which is similar to return but doesn't end a function. This lets you loop over your content one line at a time and process it in small bites without needing to hold the whole thing in memory at all times.
Here's a basic way to grab a file one line at a time. It uses yield to pull out the chains as it finds them
def get_chains(*lines):
# these hold the last token and the
# members of this chain
previous = None
accum = []
# walk through the lines,
# seeing if they can be added to the existing chain in `accum`
for each_line in lines:
# split the line into words, ignoring case & whitespace at the ends
pieces = each_line.lower().strip().split(" ")
if pieces[0] == previous:
# match? add to accum
accum.append(each_line)
else:
# no match? yield our chain
# if it is not empty
if accum:
yield accum
accum = []
# update our idea of the last, and try the next line
previous = pieces[-1]
# at the end of the file we need to kick out anything
# still in the accumulator
if accum:
yield accum
When you feed this function a string of lines, it will yield out chains if it finds them and then continue. Whoever calls the function can capture the yielded chains and do things with them.
Once you've got the chains, it's easy to sort them by length and pick the longest. Since Python has built-in list sorting, just collect a list of line-length -> line pairs and sort it. The longest line will be the last item:
def longest_chain(filename):
with open (filename, 'rt') as file_handle:
# if you loop over an open file, you'll get
# back the lines in the file one at a time
incoming_chains = get_chains(*file_handle)
# collect the results into a list, keyed by lengths
all_chains = [(len(chain), chain ) for chain in incoming_chains]
if all_chains:
all_chains.sort()
length, lines = all_chains[-1]
# found the longest chain
return "\n".join(lines)
else:
# for some reason there are no chains of connected lines
return []

MemoryError while trying to using itertools.permutations, how use less memory?

I'm loading from a text document containing so random strings and I'm trying to print every possible permutation of the characters in that string.
If the notepad contains for example:
123
abc
I want my output to be
123,132,213,231,312,321
abc,acb,bac,bca,cab,cba
The text file contains some pretty large strings so I can see why I am getting this MemoryError.
My first attempt I used this:
import sys
import itertools
import math
def organize(to_print):
number_list = []
upper_list = []
lower_list = []
for x in range(0,len(to_print)):
if str(to_print[x]).isdigit() is True:
number_list.append(to_print[x])
elif to_print[x].isupper() is True:
upper_list.append(to_print[x])
else:
lower_list.append(to_print[x])
master_list = number_list + upper_list + lower_list
return master_list
number = open(*file_dir*, 'r').readlines()
factorial = math.factorial(len(number))
complete_series = ''
for x in range(0,factorial):
complete_string = ''.join((list(itertools.permutations(organize(number)))[x]))
complete_series += complete_string+','
edit_series = complete_series[:-1]
print(edit_series)
The reason for def organize is if I have a string 1aB I would need to preorder it by number,uppercase,lowercase before I start the permutations.
I got the memory error here: complete_string = ''.join((list(itertools.permutations(organize(number)))[x])) so my initial attempt was to bring it out of the for-loop.
My second attempt is this:
import sys
import itertools
import math
def organize(to_print):
number_list = []
upper_list = []
lower_list = []
for x in range(0,len(to_print)):
if str(to_print[x]).isdigit() is True:
number_list.append(to_print[x])
elif to_print[x].isupper() is True:
upper_list.append(to_print[x])
else:
lower_list.append(to_print[x])
master_list = number_list + upper_list + lower_list
return master_list
number = open(*file_dir*, 'r').readlines()
factorial = math.factorial(len(number))
complete_series = ''
the_permutation = list(itertools.permutations(organize(number)))
for x in range(0,factorial):
complete_string = ''.join((the_permutation[x]))
complete_series += complete_string+','
edit_series = complete_series[:-1]
print(edit_series)
But I am still getting a memory error. I don't necessarily need or want the answer directly as this is good learning practice to reduce my inefficiencies, so hints in the right direction would be nice.
Added 3rd attempt:
import sys
import itertools
import math
def organize(to_print):
number_list = []
upper_list = []
lower_list = []
for x in range(0,len(to_print)):
if str(to_print[x]).isdigit() is True:
number_list.append(to_print[x])
elif to_print[x].isupper() is True:
upper_list.append(to_print[x])
else:
lower_list.append(to_print[x])
master_list = number_list + upper_list + lower_list
return master_list
number = open(*file_dir*, 'r').readlines()
factorial = math.factorial(len(number))
complete_series = ''
the_permutation = itertools.permutations(organize(number))
for x in itertools.islice(the_permutation,factorial):
complete_string = ''.join(next(the_permutation))
complete_series += complete_string+','
edit_series = complete_series[:-1]
print(edit_series)
Don't call list, just iterate over the permutations:
the_permutation = itertools.permutations(organize(number))
for x in the_permutation:
complete_string = ''.join(the_permutation)
list(itertools.permutations(organize(number))) stores all the permutations in memory then you store all the permutations in a string in your loop, there is no guarantee that you will be able to store all the data even using this approach depending on how much data is in the_permutation
If you only want a certain amount of the permutations you can call next om the permutations object:
the_permutation = itertools.permutations(organize(number))
for x in range(factorial):
complete_string = ''.join(next(the_permutation))
Or use itertools.islice:
for x in itertools.islice(the_permutation,factorial):
complete_string = ''.join(next(the_permutation))
Keep in mind factorials grow enormously fast
... so even for a string of moderate length the number of permutations is enormous. For 12 letters its ~ 480 million.

Python 3.0+ Calculating Mode

I have written a program to calculate the most often occurring number. This works great unless you have 2 most occurring numbers in a list such as 7,7,7,9,9,9. For that I wrote in:
if len(modeList) > 1 and modeList[0] != modeList[1]:
break
but then I encounter other problems like a set of number with 7,9,9,9,9. What do I do. Below is my code that will calculate one Mode.
list1 = [7,7,7,9,9,9,9]
numList=[]
modeList=[]
finalList =[]
for i in range(len(list1)):
for k in range(len(list1)):
if list1[i] == list1[k]:
numList.append(list1[i])
numList.append("EOF")
w = 0
for w in range(len(numList)):
if numList[w] == numList[w + 1]:
modeList.append(numList[w])
if numList[w + 1] == "EOF":
break
w = 0
lenMode = len(modeList)
print(lenMode)
while lenMode > 1:
for w in range(lenMode):
print(w)
if w != lenMode - 1:
if modeList[w] == modeList[w + 1]:
finalList.append(modeList[w])
print(w)
lenFinal = len(finalList)
modeList = []
for i in range(lenFinal):
modeList.append(finalList[i])
finalList = []
lenMode = len(modeList)
and then
print(modeList)
We have not learned counters but I would be open to it if someone could explain!
I would just use collections.Counter for this:
>>> from collections import Counter
>>> c = Counter([7,9,9,9,9])
>>> max(c.items(), key=lambda x:x[1])[0]
9
This is really rather simple. All it does is count how many times each value appears in the list, and then selects the element with the highest count.
I would use statistics.mode() for this. If there is more than one mode, it will raise an exception. If you need to handle multiple modes (it's not clear to me whether that's the case), you probably want to use a collections.Counter object as suggested by NPE.

Categories