Python 3.0+ Calculating Mode - python

I have written a program to calculate the most often occurring number. This works great unless you have 2 most occurring numbers in a list such as 7,7,7,9,9,9. For that I wrote in:
if len(modeList) > 1 and modeList[0] != modeList[1]:
break
but then I encounter other problems like a set of number with 7,9,9,9,9. What do I do. Below is my code that will calculate one Mode.
list1 = [7,7,7,9,9,9,9]
numList=[]
modeList=[]
finalList =[]
for i in range(len(list1)):
for k in range(len(list1)):
if list1[i] == list1[k]:
numList.append(list1[i])
numList.append("EOF")
w = 0
for w in range(len(numList)):
if numList[w] == numList[w + 1]:
modeList.append(numList[w])
if numList[w + 1] == "EOF":
break
w = 0
lenMode = len(modeList)
print(lenMode)
while lenMode > 1:
for w in range(lenMode):
print(w)
if w != lenMode - 1:
if modeList[w] == modeList[w + 1]:
finalList.append(modeList[w])
print(w)
lenFinal = len(finalList)
modeList = []
for i in range(lenFinal):
modeList.append(finalList[i])
finalList = []
lenMode = len(modeList)
and then
print(modeList)
We have not learned counters but I would be open to it if someone could explain!

I would just use collections.Counter for this:
>>> from collections import Counter
>>> c = Counter([7,9,9,9,9])
>>> max(c.items(), key=lambda x:x[1])[0]
9
This is really rather simple. All it does is count how many times each value appears in the list, and then selects the element with the highest count.

I would use statistics.mode() for this. If there is more than one mode, it will raise an exception. If you need to handle multiple modes (it's not clear to me whether that's the case), you probably want to use a collections.Counter object as suggested by NPE.

Related

Trying to optimize quicksort for larger files

Does anyone know how I can optimize this code better to run larger files. It works with smaller inputs, but I need it to run a file with over 200,000 words. Any suggestions?
Thank you.
import random
import re
def quick_sort(a,i,n):
if n <= 1:
return
mid = (len(a)) // 2
x = a[random.randint(0,len(a)-1)]
p = i - 1
j = i
q = i + n
while j < q:
if a[j] < x:
p = p + 1
a[j],a[p] = a[p],a[j]
j = j + 1
elif a[j] > x:
q = q - 1
a[j],a[q] = a[q],a[j]
else:
j = j + 1
quick_sort(a,i,p-i+1)
quick_sort(a,q,n-(q-i))
file_name = input("Enter file name: ")
my_list = []
with open(file_name,'r') as f:
for line in f:
line = re.sub('[!#?,.:";\']', '', line).lower()
token = line.split()
for t in token:
my_list.append(t)
a = my_list
quick_sort(a,0,len(my_list))
print("List After Calling Quick Sort: ",a)
Your random selection of an index to use for your pivot x is using the whole size of the input list a, not just the part you're supposed to be sorting on the current call. This means that very often your pivot won't be in the current section at all, and so you won't be able to usefully reduce your problem (because all of the values will be on the same side of the pivot). This leads to lots and lots of recursion, and for larger inputs you'll almost always hit the recursion cap.
The fix is simple, just change how you get x:
x = a[random.randrange(i, i+n)]
I like randrange a lot better than randint, but you could use randint(i, i+n-1) if you feel the other way.
Must you use a quicksort? If you can use a heapq or PriorityQueue, the .get/(.pop()) methods automatically implement the sort:
import sys
from queue import PriorityQueue
pq = PriorityQueue()
inp = open(sys.stdin.fileno(), newline='\n')
#inp = ['dag', 'Rug', 'gob', 'kex', 'mog', 'Wes', 'pox', 'sec', 'ego', 'wah'] # for testing
for word in inp:
word = word.rstrip('\n')
pq.put(word)
while not pq.empty():
print(pq.get())
Then test with some large random word input or file e.g.:
shuf /usr/share/dict/words | ./word_pq.py
where shuf is Gnu /usr/local/bin/shuf.

Python - multiple combinations maths question

I'm trying to make a program that lists all the 64 codons/triplet base sequences of DNA...
In more mathematical terms, there are 4 letters: A, T, G and C.
I want to list all possible outcomes where there are three letters of each and a letter can be used multiple times but I have no idea how!
I know there are 64 possibilities and I wrote them all down on paper but I want to write a program that generates all of them for me instead of me typing up all 64!
Currently, I am at this point but I have most surely overcomplicated it and I am stuck:
list = ['A','T','G','C']
list2 = []
y = 0
x = 1
z = 2
skip = False
back = False
for i in range(4):
print(list[y],list[y],list[y])
if i == 0:
skip = True
else:
y=y+1
for i in range(16):
print(list[y],list[y],list[x])
print(list[y],list[x], list[x])
print(list[y],list[x], list[y])
print(list[y],list[x], list[z])
if i == 0:
skip = True
elif z == 3:
back = True
x = x+1
elif back == True:
z = z-1
x = x-1
else:
x = x+1
z = z+1
Any help would be much appreciated!!!!
You should really be using itertools.product for this.
from itertools import product
l = ['A','T','G','C']
combos = list(product(l,repeat=3 ))
# all 64 combinations
Since this produces an iterator, you don't need to wrap it in list() if you're just going to loop over it. (Also, don't name your list list — it clobbers the build-in).
If you want a list of strings you can join() them as John Coleman shows in a comment under your question.
list_of_strings = ["".join(c) for c in product(l,repeat=3) ]
Look for for pemuations with repetitions there tons of code available for Python .
I would just use library , if you want to see how they implemented it look inside the library . These guys usually do it very efficiency
import itertools
x = [1, 2, 3, 4, 5, 6]
[p for p in itertools.product(x, repeat=2)]

removing numbers which are close to each other in a list

I have a list like
mylist = [75,75,76,77,78,79,154,155,154,156,260,262,263,550,551,551,552]
i need to remove numbers are close to each other by maxumim four number like:
num-4 <= x <= num +4
the list i need at the end should be like :
list = [75,154,260,550]
or
list = [76,156,263,551]
doesn't really matter which number to stay in the list , only one of those which are close.
i tried this which gave me :
for i in range(len(l)):
for j in range(len(l)):
if i==j or i==j+1 or i==j+2 or i == j+3:
pp= l.pop(j)
print(pp)
print(l)
IndexError: pop index out of range
and this one which doesn't work the way i need:
for q in li:
for w in li:
print(q,'////',w)
if q == w or q ==w+1 or q==w+2 or q==w+3:
rem = li.remove(w)
thanks
The below uses groupby to identify runs from the iterable that start with a value start and contain values that differ from start by no more than 4. We then collect all of those start values into a list.
from itertools import groupby
def runs(difference=4):
start = None
def inner(n):
nonlocal start
if start is None:
start = n
elif abs(start-n) > difference:
start = n
return start
return inner
print([next(g) for k, g in groupby(mylist, runs())])
# [75, 154, 260, 550]
This assumes that the input data is already sorted. If it's not, you'll have to sort it: groupby(sorted(mylist), runs()).
You can accomplish this using a set or list, you don't need a dict.
usedValues = set()
newList = []
for v in myList:
if v not in usedValues:
newList.append(v)
for lv in range(v - 4, v + 5):
usedValues.add(lv)
print(newList)
This method stores all values within 4 of every value you've seen so far. When you look at a new value from myList, you only need to check if you've seen something in it's ballpark before by checking usedValues.

IndexError: list assignment index out of range Python

def mode(given_list):
highest_list = []
highest = 0
index = 0
for x in range(0, len(given_list)):
occurrences = given_list.count(given_list[x])
if occurrences > highest:
highest = occurrences
highest_list[0] = given_list[x]
elif occurrences == highest:
highest_list.append(given_list[x])
The code is meant to work out the mode of a given list. I do not understand where I am going wrong.
Exact Error I am receiving.
line 30, in mode
highest_list[0] = given_list[x]
IndexError: list assignment index out of range
The problem is that you have an empty list originally:
highest_list = []
And then in the loop you try to access it at index 0:
highest_list[0] = ...
It's impossible, because it's an empty list and so is not indexable at position 0.
A better way to find the mode of a list is to use a collections.Counter object:
>>> from collections import Counter
>>> L = [1,2,3,3,4]
>>> counter = Counter(L)
>>> max(counter, key=counter.get)
3
>>> [(mode, n_occurrences)] = counter.most_common(1)
>>> mode, n_occurrences
(3, 2)
As far as getting the mode, you can just use a Counter from the collections library
from collections import Counter
x = [0, 1, 2, 0, 1, 0] #0 is the mode
g = Counter(x)
mode = max(g, key = lambda x: g[x])
At that point, at the start of the loop, highest_list is empty, so there's no first index. You can initialize highest_list as [0] so that there is always at least one "highest value."
That said, you can accomplish this more simply as follows:
def mode(given_list):
return max(set(given_list), key=given_list.count)
This will find the highest item in the passed given_list, based on each item's count() in it. Making a set first ensures that each item is only counted once.

Loop to Match Parts of List

My code:
#prints out samenodes
f = open('newerfile.txt')
mylist = list(f)
count = 0
i = 1
while count < 1000:
if mylist[i] == mylist[i+12] and mylist [i+3] == mylist [i+14]:
print mylist[i]
count = count+1
i = i+12
My intention is to look at elt 1, elt 2. If elt 1 == elt 13 AND elt 2==elt 14 I want to print elt 1. Then, I want to look at elt 13 and elt 14. If elt 2 matches elt 13+12 AND elt 14 matches elt 14+12 I want to print it. ETC...
There are certainly parts of my list that fit this criteria, but the program returns no output.
One problem is your indices. Be advised that lists begin with an index of 0.
I'm surprised nobody's answered this yet:
#prints out samenodes
f = open('newerfile.txt')
mylist = list(f)
count = 0
i = 0
while count < 1000:
#print mylist[i]
#print mylist[i+12]
#print mylist[i+13]
#print mylist[i+14]
#...use prints to help you debug
if mylist[i] == mylist[i+12] and mylist [i+1] == mylist [i+13]:
print mylist[i]
count = count+1
i = i+12
This is probably what you want.
To iterate over multiple lists (technically, iterables) in "lockstep", you can use zip. In this case, you want to iterate over four versions of mylist, offset by 0, 12, 2 and 13.
zippedLists = zip(mylist, mylist[12:], mylist[2:], mylist[13:])
Next, you want the 0th, 12th, 24th, etc elements. This is done with slice:
slicedList = zippedLists[::12]
Then you can iterate over that:
for elt1, elt13, elt2, elt14 in slicedList:
if elt1 == elt13 and elt2 == elt14:
print elt1
Putting it together with the file operations, we get
#prints out samenodes
f = open('newerfile.txt')
mylist = list(f)
zippedLists = zip(mylist, mylist[12:], mylist[2:], mylist[13:])
slicedList = zippedLists[::12]
for elt1, elt13, elt2, elt14 in slicedList:
if elt1 == elt13 and elt2 == elt14:
print elt1
Code like this is generally considered more "pythonic" than your current version, as using list indexes are generally discouraged when you are iterating over the list.
Note that if you've got a huge number of elements in your list the above code creates (and destroys at some point) five extra lists. Therefore, you may get better memory performance if you use the equivalent functions in itertools, which uses lazy iterators to prevent copying lists needlessly:
from itertools import islice, izip
#prints out samenodes
f = open('newerfile.txt')
mylist = list(f)
zippedLists = itertools.izip(mylist, islice(mylist, 12), islice(mylist, 2), islice(mylist, 13))
slicedList = itertools.islice(zippedLists, 0, None, 12)
for elt1, elt13, elt2, elt14 in slicedList:
if elt1 == elt13 and elt2 == elt14:
print elt1
There's probably a way in itertools to avoid slurping the entire file into mylist, but I'm not sure I remember what it is - I think that is the use case for itertools.tee.

Categories