runtime difference : quick sort and binary search - python

I got a problem with what I learned.
What I want to do
write first list and sort it.
write second list.
define binary search.
compare runtime of sorting list(built-in) and runtime of binary search.
input size : 1,000,000 (both of list).
What is my problem
I learned big-O of quick sorting is n x log(n). and that of binary search is log(n). so, repeat binary search by n-times, then we might get similar runtime with quick sorting but it isn't.
Code
make first list
import numpy as np
arr = np.random.randint(low=-10000000, high=10000000, size=(1000000,))
with open('./1st_list.txt', 'w') as f:
f.write(str(len(arr)) + '\n')
for i in arr:
f.write(str(i) + '\n')
make second list
arr = np.random.randint(low=-10000000, high=10000000, size=(1000000,))
with open('./2nd_list.txt', 'w') as f:
f.write(str(len(arr)) + '\n')
for i in arr:
f.write(str(i) + '\n')
define binary search
def binary_search(
arr: list, #sorted list
target: int):
start, end = 0, len(arr)-1
while(True):
if start > end:
return None
mid = (start+end)//2
if target == arr[mid]:
return mid+1
elif target > arr[mid]:
start = mid+1
else:
end = mid-1
load 1st_list
start = time.time()
arr_1st = []
for _ in range(1000000):
arr_1st.append(int(input().strip()))
arr_1st.sort()
time.time() - start
load 2nd_list
sys.stdin = open('./2nd_list.txt')
input = sys.stdin.readline
arr_2nd = []
for _ in range(1000000):
arr_2nd.append(int(input().strip()))
compute runtime
import time
start = time.time()
for ele in arr_2nd:
binary_search(arr_1st, ele)
time.time() - start
result
runtime
my local resource
load 1st_list and sort : 0.5s.
compute runtime of binary search : 4.2s
I dont know why these is different.
thank you for detailed information.

Related

python infinite loop and numpy delete do not work properly

I wrote a function and it does not end. Logically len(array) should be decreasing but it stuck in 227. I think numpy delete does not work properly or I made mistake somewhere??
def segmenting (file, threshold):
segments = []
check = True
count = 0
while check == True:
if len(file) <= 2:
check = False
sequence = []
ids = []
for i in range(1, len(file)):
vector = [file[i,1] - file[0,1] , file[i,2]- file[0,2] ]
magnitude = math.sqrt(vector[0]**2 + vector[1]**2)
print(i)
if magnitude <= threshold:
sequence.append(file[i])
ids.append(i)
if i == len(file) and len(sequence) == 0:
file = np.delete(file, 0 , axis = 0)
break
if len(ids) >0 and len(sequence)>0 :
segments.append(sequence)
file = np.delete(file, ids , axis = 0)
print('sequence after :',sequence)
sequence = []
ids = []
print(len(file))
return segments
The following (simplified) logic will never be executed
for i in range(1, len(file)):
if i == len(file):
file = np.delete(file, 0)
Without having a way to remove the first line of the file, you have no way to exhaust your array. This check is superfluous anyway since after each iteration you won't need the first line anymore.
As a first fix you can put the check outside the loop and only check whether you've found any matches
for i in range(1, len(file)):
...
if len(sequence) == 0:
file = np.delete(file, 0)
But that way you would have one iteration where you find (and remove) matches and then one more with no more matches where you then remove it. Therefore, as said above, you should always remove the first line after each iteration.
With more simplifications, your code can be reduced down to:
def segmenting(file, threshold):
segments = []
while len(file) > 2:
idx = np.sqrt(np.sum((file[1:,1:3] - file[0,1:3])**2, axis=1)) <= threshold
file = file[1:]
segments.append(list(file[idx]))
file = file[np.logical_not(idx)]
return segments
It's likely due to the fact you are removing element from file array within a for loop, and also trying to iterate over for loop using file array. Try iterate over a clean version of file array(no modification on it), and do the deletion on a copy of file array
For example, one possible solution is to fix this line
for i in range(1, len(file)):
Fix like below
N=len(file)
for i in range(1, N):
Also you could remove flag variable 'check' and replace with break statement

Trying to optimize quicksort for larger files

Does anyone know how I can optimize this code better to run larger files. It works with smaller inputs, but I need it to run a file with over 200,000 words. Any suggestions?
Thank you.
import random
import re
def quick_sort(a,i,n):
if n <= 1:
return
mid = (len(a)) // 2
x = a[random.randint(0,len(a)-1)]
p = i - 1
j = i
q = i + n
while j < q:
if a[j] < x:
p = p + 1
a[j],a[p] = a[p],a[j]
j = j + 1
elif a[j] > x:
q = q - 1
a[j],a[q] = a[q],a[j]
else:
j = j + 1
quick_sort(a,i,p-i+1)
quick_sort(a,q,n-(q-i))
file_name = input("Enter file name: ")
my_list = []
with open(file_name,'r') as f:
for line in f:
line = re.sub('[!#?,.:";\']', '', line).lower()
token = line.split()
for t in token:
my_list.append(t)
a = my_list
quick_sort(a,0,len(my_list))
print("List After Calling Quick Sort: ",a)
Your random selection of an index to use for your pivot x is using the whole size of the input list a, not just the part you're supposed to be sorting on the current call. This means that very often your pivot won't be in the current section at all, and so you won't be able to usefully reduce your problem (because all of the values will be on the same side of the pivot). This leads to lots and lots of recursion, and for larger inputs you'll almost always hit the recursion cap.
The fix is simple, just change how you get x:
x = a[random.randrange(i, i+n)]
I like randrange a lot better than randint, but you could use randint(i, i+n-1) if you feel the other way.
Must you use a quicksort? If you can use a heapq or PriorityQueue, the .get/(.pop()) methods automatically implement the sort:
import sys
from queue import PriorityQueue
pq = PriorityQueue()
inp = open(sys.stdin.fileno(), newline='\n')
#inp = ['dag', 'Rug', 'gob', 'kex', 'mog', 'Wes', 'pox', 'sec', 'ego', 'wah'] # for testing
for word in inp:
word = word.rstrip('\n')
pq.put(word)
while not pq.empty():
print(pq.get())
Then test with some large random word input or file e.g.:
shuf /usr/share/dict/words | ./word_pq.py
where shuf is Gnu /usr/local/bin/shuf.

How can I optimize this function which is related to reversal of string?

I have a string: "String"
The first thing you do is reverse it: "gnirtS"
Then you will take the string from the 1st position and reverse it again: "gStrin"
Then you will take the string from the 2nd position and reverse it again: "gSnirt"
Then you will take the string from the 3rd position and reverse it again: "gSntri"
Continue this pattern until you have done every single position, and then you will return the string you have created. For this particular string, you would return: "gSntir"
And I have to repeat this entire procedure for x times where the string and x can be very big . (million or billion)
My code is working fine for small strings but it's giving timeout error for very long strings.
def string_func(s,x):
def reversal(st):
n1=len(st)
for i in range(0,n1):
st=st[0:i]+st[i:n1][::-1]
return st
for i in range(0,x):
s=reversal(s)
return s
This linear implementation could point you in the right direction:
from collections import deque
from itertools import cycle
def special_reverse(s):
d, res = deque(s), []
ops = cycle((d.pop, d.popleft))
while d:
res.append(next(ops)())
return ''.join(res)
You can recognize the slice patterns in the following examples:
>>> special_reverse('123456')
'615243'
>>> special_reverse('1234567')
'7162534'
This works too:
my_string = "String"
my_string_len = len(my_string)
result = ""
for i in range(my_string_len):
my_string = my_string[::-1]
result += my_string[0]
my_string = my_string[1:]
print(result)
And this, though it looks spaghetti :D
s = "String"
lenn = len(s)
resultStringList = []
first_half = list(s[0:int(len(s) / 2)])
second_half = None
middle = None
if lenn % 2 == 0:
second_half = list(s[int(len(s) / 2) : len(s)][::-1])
else:
second_half = list(s[int(len(s) / 2) + 1 : len(s)][::-1])
middle = s[int(len(s) / 2)]
lenn -= 1
for k in range(int(lenn / 2)):
print(k)
resultStringList.append(second_half.pop(0))
resultStringList.append(first_half.pop(0))
if middle != None:
resultStringList.append(middle)
print(''.join(resultStringList))
From the pattern of the original string and the result I constructed this algorithm. It has minimal number of operations.
str = 'Strings'
lens = len(str)
lensh = int(lens/2)
nstr = ''
for i in range(lensh):
nstr = nstr + str[lens - i - 1] + str[i]
if ((lens % 2) == 1):
nstr = nstr + str[lensh]
print(nstr)
or a short version using iterator magic:
def string_func(s):
ops = (iter(reversed(s)), iter(s))
return ''.join(next(ops[i % 2]) for i in range(len(s)))
which does the right thing for me, while if you're happy using some library code, you can golf it down to:
from itertools import cycle, islice
def string_func(s):
ops = (iter(reversed(s)), iter(s))
return ''.join(map(next, islice(cycle(ops), len(s))))
my original version takes 80microseconds for a 512 character string, this updated version takes 32µs, while your version took 290µs and schwobaseggl's solution is about 75µs.
I've had a play in Cython and I can get runtime down to ~0.5µs. Measuring this under perf_event_open I can see my CPU is retiring ~8 instructions per character, which seems pretty good, while a hard-coded loop in C gets this down to ~4.5 instructions per ASCII char. These don't seem to be very "Pythonic" solutions so I'll leave them out of this answer. But included this paragraph to show that the OP has options to make things faster, and that running this a billion times on a string consisting of ~500 characters will still take hundreds of seconds even with relatively careful C code.

Longest chain of last word of line/first word of next

Okay, so I am trying to find from a text file the longest chain in which the last word of one line is the first word of the next (works well for poetry). The Python script I have to far works well but still takes an immensely long time. I am no coding expert and have really no idea of optimization. Am I running through more options than necessary?
How can I reduce the time it takes to run through a longer text?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import sys
# Opening the source text
with open("/text.txt") as g:
all_lines = g.readlines()
def last_word(particular_line):
if particular_line != "\n":
particular_line = re.sub(ur'^\W*|\W*$', "",particular_line)
if len(particular_line) > 1:
return particular_line.rsplit(None, 1)[-1].lower()
def first_word(particular_line):
if particular_line != "\n":
particular_line = re.sub(ur'^\W*|\W*$', "",particular_line)
if len(particular_line) > 1:
return particular_line.split(None, 1)[0].lower()
def chain(start, lines, depth):
remaining = list(lines)
del remaining[remaining.index(start)]
possibles = [x for x in remaining if (len(x.split()) > 2) and (first_word(x) == last_word(start))]
maxchain = []
for c in possibles:
l = chain(c, remaining, depth)
sys.stdout.flush()
sys.stdout.write(str(depth) + " of " + str(len(all_lines)) + " \r")
sys.stdout.flush()
if len(l) > len(maxchain):
maxchain = l
depth = str(depth) + "." + str(len(maxchain))
return [start] + maxchain
#Start
final_output = []
#Finding the longest chain
for i in range (0, len(all_lines)):
x = chain(all_lines[i], all_lines, i)
if len(x) > 2:
final_output.append(x)
final_output.sort(key = len)
#Output on screen
print "\n\n--------------------------------------------"
if len(final_output) > 1:
print final_output[-1]
else:
print "Nothing found"
import itertools
def matching_lines(line_pair):
return line_pair[0].split()[-1].lower() == line_pair[1].split()[0].lower()
line_pairs = ((line,next_line) for line,next_line in itertools.izip(all_lines,all_lines[1:]))
grouped_pairs = itertools.groupby(line_pairs,matching_lines)
print max([len(list(y))+1 for x,y in grouped_pairs if x])
although im not sure it will be faster (but i think it will be since it only iterates one time and uses mostly builtins)
Yes, this code has the complexity of $O(n^2)$. It means that if your file has n lines, then the amount of iterations your code will perform is 1 * (n-1) for the first line, then 1 * (n-2) for the second line etc, with n such elements. For a big n, this is relatively equal to $n^2$. Actually, there's a bug in the code in this line
del remaining[remaining.index(start)]
where you probably meant to run this:
del remaining[:remaining.index(start)]
(notice the ':' in the square brackets) which expands the runtime (now you have (n-1) + (n-1) + .. + (n-1) = n*(n-1), which is slightly bigger then (n-1) + (n-2) + (n-3) ..).
Your can optimize the code as so: begin with maxchainlen = 0, curchainlen = 0. Now, iterate through the lines, every time compare the first word of the current line to the last word of the previous line. If they match, increase curchainlen by 1. If they don't, check if maxchainlen < curchainlen, if so, assign maxchainlen = curchainlen, and init curchainlen to 0. After you finish iterating through the lines, do this checkup for maxchainlen again. Example:
lw = last_word(lines[0])
curchainlen = 0
maxchainlen = 0
for l in lines[2:]:
if lw = first_word(l):
curchainlen = curchainlen + 1
else:
maxchainlen = max(maxchainlen, curchainlen)
curchainlen = 0
maxchainlen = max(maxchainlen, curchainlen)
print(maxchainlen)
I'd try splitting this job into two phases: first finding the chains and then comparing them. That will simplify the code a lot. Since chains will be a small subset of all the lines in the file, finding them first and then sorting them will be quicker than trying to process the whole thing in one big go.
The first part of the problem is a lot easier if you use the python yield keyword, which is similar to return but doesn't end a function. This lets you loop over your content one line at a time and process it in small bites without needing to hold the whole thing in memory at all times.
Here's a basic way to grab a file one line at a time. It uses yield to pull out the chains as it finds them
def get_chains(*lines):
# these hold the last token and the
# members of this chain
previous = None
accum = []
# walk through the lines,
# seeing if they can be added to the existing chain in `accum`
for each_line in lines:
# split the line into words, ignoring case & whitespace at the ends
pieces = each_line.lower().strip().split(" ")
if pieces[0] == previous:
# match? add to accum
accum.append(each_line)
else:
# no match? yield our chain
# if it is not empty
if accum:
yield accum
accum = []
# update our idea of the last, and try the next line
previous = pieces[-1]
# at the end of the file we need to kick out anything
# still in the accumulator
if accum:
yield accum
When you feed this function a string of lines, it will yield out chains if it finds them and then continue. Whoever calls the function can capture the yielded chains and do things with them.
Once you've got the chains, it's easy to sort them by length and pick the longest. Since Python has built-in list sorting, just collect a list of line-length -> line pairs and sort it. The longest line will be the last item:
def longest_chain(filename):
with open (filename, 'rt') as file_handle:
# if you loop over an open file, you'll get
# back the lines in the file one at a time
incoming_chains = get_chains(*file_handle)
# collect the results into a list, keyed by lengths
all_chains = [(len(chain), chain ) for chain in incoming_chains]
if all_chains:
all_chains.sort()
length, lines = all_chains[-1]
# found the longest chain
return "\n".join(lines)
else:
# for some reason there are no chains of connected lines
return []

How do I obtain data from different files and combining into one array in python?

I have to collect data to prove my hypothesis that typing with your dominant hand is faster than typing with your non-dominant hand. I have written the code below that gives the participant a random word and then they have to copy it. The code will time how long it takes to type each word and then save that data to a new file. For each participant that is tested a new CSV file will be created.
Now I need to write another script that will find the average for each hand for each participant and then create a single array containing the averages so I can create a graph to prove whether or not my hypothesis is true. How would I go about getting data from different files and combining it into one array?
My Script:
import random
import time
name = raw_input('Enter name: ') # get some name for the file
outfile = file(name + '.csv', 'w') # create a file for this user's data
# load up a list of 1000 common words
words = file('1-1000.txt').read().split()
ntrials = 50
answers = []
print """Type With Dominant Hand"""
for i in range(ntrials):
word = random.choice(words)
tstart = time.time()
ans = raw_input('Please type ' + word + ': ')
tstop = time.time()
answers.append((word, ans, tstop - tstart))
print >>outfile, 'Dominant', word, ans, tstop - tstart # write the data to the file
if (i % 5 == 3):
go = raw_input('take a break, type y to continue: ')
print """Type With Nondominant Hand"""
for i in range(ntrials):
word = random.choice(words)
tstart = time.time()
ans = raw_input('Please type ' + word + ': ')
tstop = time.time()
answers.append((word, ans, tstop - tstart))
print >>outfile, 'Nondominant', word, ans, tstop - tstart # write the data to the file
if (i % 5 == 3):
go = raw_input('take a break, type y to continue: ')
outfile.close() # close the file
Sample results from above script:
Dominant sit sit 1.81511306763
Dominant again again 2.54711103439
Dominant from from 1.53057098389
Dominant general general 1.98939108849
Dominant horse horse 1.93938016891
Dominant of of 1.07597017288
Dominant clock clock 1.6587600708
Dominant save save 1.42030906677
Nondominant story story 3.92807888985
Nondominant of of 0.93910908699
Nondominant test test 1.69210004807
Nondominant low low 1.13296699524
Nondominant hit hit 1.15252614021
Nondominant you you 1.22019600868
Nondominant river river 1.42011594772
Nondominant middle middle 1.61595511436
This may seem like another language if you're not familiar with numpy, but here's a solution that takes advantage of it (notice the lack of loops!)
For testing, I created a second user data file, with each entry incremented by 1 second.
import glob
import numpy as np
usecols = [0, 3] # Columns to extract from data file
str2num = {'Dominant': 0, 'Nondominant': 1} # Conversion dictionary
converters = {0: (lambda s: str2num[s])} # Strings -> numbers
userfiles = glob.glob('*.csv')
userdat = np.array([np.loadtxt(f, usecols=usecols, converters=converters)
for f in userfiles])
# Create boolean arrays to filter desired results
dom = userdat[..., 0] == 0
nondom = userdat[..., 0] == 1
# Filter and reshape to keep 'per-user' layout
usercnt, _, colcnt = userdat.shape
domdat = userdat[dom ].reshape(usercnt, -1, colcnt)
nondomdat = userdat[nondom].reshape(usercnt, -1, colcnt)
domavgs = np.average(domdat, axis=1)[:, 1]
nondomavgs = np.average(nondomdat, axis=1)[:, 1]
print 'Dominant averages by user: ', domavgs
print 'Non-dominant averages by user:', nondomavgs
Output:
Dominant averages by user: [ 1.74707571 2.74707571]
Non-dominant averages by user: [ 1.63763103 2.63763103]
If you're going to be doing a lot of analysis, I'd highly recommend getting your head around numpy.
def avg_one(filename):
vals = { 'Dominant': [], 'Nondominant': [] }
for line in input:
hand, _, _, t = split(line.strip())
vals[hand].append(float(t))
d = vals['Dominant']
nd = vals['Nondominant']
return (sum(d)/len(d), sum(nd)/len(nd))
data = []
for f in os.listdir():
if f.endswith('.csv'):
data.append(avg_one(f))
doms, nondoms = zip(data)
print "Dominant: " + repr(doms)
print "Nondominant: " + repr(nondoms)
This presumes that there are no other .csv files in the same dir that have a different format (and would fail parsing). And this needs more error checking in general, but it gets the idea across.
persons = ["billy","bob","joe","kim"]
num_dom,total_dom,num_nondom,total_nondom=0,0,0,0
for person in persons:
data = file('%s.csv' %person, 'r').readlines()
for line in data:
if "Nondominant" in line:
num_nondom+=1
total_nondom+=int(line.split(' ')[-1].strip())
elif "Dominant" in line:
num_dom+=1
total_nondom+=int(line.split(' ')[-1].strip())
else: continue
dom_avg = total_dom/num_dom
nondom_avg = total_nondom/num_nondom
print "Average speed with Dominant hand: %s" %dom_avg
print "Average speed with Non-Dominant hand: %s" %nondom_avg
Fill the "persons" array with the names of your subjects and then do what you please with the data.
PS. Heltonbiker noted your idea and added it. Also fixed the newline bug by adding in strip.

Categories