I have this piece of code and was wondering if there was any inbuilt way to do it faster?
Words has a simple tokenized string input.
freq_unigrams = nltk.FreqDist(words)
unigram_list = []
count = 0
for x in freq_unigrams.keys():
unigram_list.append(x)
count+=1
if count >= 1000:
break
Does freq_unigrams.keys() return a list? If so, how about the following:
unigram_list = freq_unigrams.keys()[:1000]
This gives you a list containing the first 1000 elements of freq_unigrams.keys(), with no looping.
I suggest:
unigram_list = freq_unigrams.keys()
unigram_list[:] = unigram_list[:1000]
This would not make the copy that: unigram_list = freq_unigrams.keys()[:1000] does.
Although this might be better with iterators:
from itertools import islice
unigram_list[:] = islice(freq_unigrams.iterkeys(),1000)
If your intent is to get the top 1000 most frequent words in the words list you could try:
import collections
# get top words and their frequencies
most_common = collections.Counter(words).most_common(1000)
This is theoretically more efficient:
import itertools
unigram_list = list(itertools.islice(freq_unigrams.iterkeys(), 1000))
...than working off freq_unigrams.keys(), because you're only interested in the top 1000, and not the remaining x, which the using freq_unigrams.keys() will also need to populate in the intermediate list.
**a little late...
To take the first 1000 keys in your dictionary and assign them to a new list:
unigram_list = freq_unigrams.keys()[:1000]
Related
i have this problem where you have to find all length three palindromes and print how many there are.
For example:
aabca
Output:
3
aba
aaa
aca
I already know how to get the num for how many there are with the code i found on the web below:
res = 0
unq_str = set(s)
for ch in unq_str:
st = s.find(ch)
ed = s.rfind(ch)
if st<ed:
res+=len(set(s[st+1:ed]))
return res
but thats only for the num
so i tried on the concept where you iterate through it and take lists with
length three and check if it is a palindrome
for x in range(len(input1)):
if not x < 3:
Str1 = input1[x-3:x]
but then i stopped because it doesn't go for any kind of combination
is there any way to do this?
thanks
I'm not 100% sure this is correct but hopefully it will put you on the correct track.
import itertools
input = "aabca"
palindromes = set() # use a set to ensure no duplicates
# iterate over all combinates of length 3
for t in itertools.combinations(input, 3):
# is this a palindrome? If so save
if t == tuple(reversed(t)):
palindromes.add(''.join(t))
# output results
print(palindromes)
print(len(palindromes))
There may be an itertools recipe which doesn't generate duplicates but I think this works.
Edit: Using join results in a set of strings rather than string characters.
Edit2: To make this equivalent to keithpjolly's answer:
import itertools
input = "aabca"
palindromes = set() # use a set to ensure no duplicates
# iterate over all combinates of length 3
for a,b,c in itertools.combinations(input, 3):
# is this a palindrome? If so save
if a == c:
palindromes.add(''.join((a,b,c)))
# output results
print(palindromes)
print(len(palindromes))
How about:
from itertools import combinations
s = 'aabca'
p = set([''.join([a,b,c]) for a,b,c in combinations(s, 3) if a == c])
print(p)
print(len(p))
Ouput:
{'aaa', 'aba', 'aca'}
3
Edit - combinations much better than permutations.
Edit - Forgot about the length.
I am trying to compare the running time between bubble sort and insert sort. It does this on 10 different lists with the first list containing 1000 numbers and then increases by 1000 each time.
The following loop creates the lists and times each list using both methods. It is SUPPOSED to assign each time to a list, and then to another list because the goal is to have a list of lists that I can later use as data points for a graph. Assume sort_1 and sort_2 are implemented sorting algorithms
counter = 0
empty_list = []
list_limit = 1000
data_point = []
list_of_data_points = []
while list_limit <= 10000:
while counter < list_limit:
num = random.randint(1, 10000)
empty_list.append(num)
counter += 1
copied_list = list(empty_list)
bubble_time = sort_1(copied_list)
insert_time = sort_2(copied_list)
data_point.clear()
data_point.append(bubble_time)
data_point.append(insert_time)
list_of_data_points.append(data_point)
print(list_of_data_points)
list_limit += 1000
Now the reason print(list_of_data_points) is in the loop is because this is where the error is happening. Everything else is working perfectly, but this is what I get when I run the loop:
[[0.10695279999999996, 0.00023099999999998122]]
[[0.43489919999999993, 0.0004752999999999563], [0.43489919999999993, 0.0004752999999999563]]
[[0.9825091000000001, 0.0007314000000000487], [0.9825091000000001, 0.0007314000000000487], [0.9825091000000001, 0.0007314000000000487]]
As you can see, it will calculate the times just fine. However, each time it is supposed to add the data_point into the list, it seems to be clearing the current list and making each data point the same value.
I am confused because I thought the append() method was supposed to just add the current data point list into the larger list of lists (i.e. list_of_data_points. Is there something wrong with my while loop or is this a list issue that I am misunderstanding?
Just replace
data_point.clear()
by
data_point = []
When you do:
a = [1,2,3]
b = a
the assignment to b copies the reference to the list, not the actual list.
So, if you latter change a, such as a[0] = 100, b will become [100,2,3], as well.
When doing data_point = [], you a creating a new list with the new values.
So, in the first time, list_of_data_points will be:
[first_list_under_the_name_of_data_point]
The second time:
[first_list_under_the_name_of_data_point, second_list_under_the_name_of_data_point]
And so on.
On the other hand, when you do data_point.clear(), you do not create a new list, so, in the first time:
[first_list_under_the_name_of_data_point]
The second time:
[first_list_under_the_name_of_data_point_but_with_updated_values,
first_list_under_the_name_of_data_point_but_with_updated_values]
data_point gets cleared, then you append the bubble_time and the insert_time, bubble_time is the same as insert_time but the sort method is a different one.
that means you append the same data 2 time (i guess you just sort it in a different way), but hey if that's what you want, you can do this:
data_point = [bubble_time, insert_time] #replace datapoint.clear() with this and don't use the append method
i am not that good with english but i am glad if that helped, if i got something wrong feel free to write a comment so i can help as much as i can
Try that:
import random
import numpy as np
data_point = []
for i in range(10):
num_size = (i+1) * 1000
random_numbers = list(np.random.randint(low = 1, high = 1000, size=num_size))
bubble_time = len(random_numbers)
insert_time = len(random_numbers)
data_point.insert(i, [bubble_time, insert_time])
Maybe, it would be even better for you to work with a dictionary in this case:
import random
import numpy as np
data_dict = {}
for i in range(10):
num_size = (i+1) * 1000
random_numbers = list(np.random.randint(low = 1, high = 1000, size=num_size))
bubble_time = sum(random_numbers)
insert_time = len(random_numbers)
data_dict[num_size]= [bubble_time, insert_time]
print(data_dict)
# Output of print(data_dict)
{'1000': [527103, 1000], '2000': [994513, 2000], '3000': [1490553, 3000], '4000': [2007809, 4000], '5000': [2539145, 5000], '6000': [3039272, 6000], '7000': [3501393, 7000], '8000': [4011519, 8000], '9000': [4526147, 9000], '10000': [5053926, 10000]}
I chose len() and sum() in order to have two substitutes for your functions.
Working with this dict enables you to plot the data_points with the correct labels easily, for example:
import matplotlib.pyplot as plt
lists = sorted(data_dict.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples
plt.plot(x, [i[0] for i in y])
plt.plot(x, [i[1] for i in y])
plt.show()
This might seem a little bit overwhelming; I just wanted to give you a piece of concise, working code. If you are interested into certain aspects/lines/methods that don't make sense to you just ask me!
How do I take the first character from each string in a list, join them together, then the second character from each string, join them together, and so on - and eventually create one combined string?
eg. if I have strings like these:
homanif
eiesdnt
ltiwege
lsworar
I want the end result to be helloitsmeiwaswonderingafter
I put together a very hackneyed version of this which does the job but produces an extra line of gibberish. Considering this is prone to index going out of range, I don't think this is a good approach:
final_c = ['homanif', 'eiesdnt', 'ltiwege', 'lsworar']
final_message = ""
current_char = 0
for i in range(len(final_c[1])):
for c in final_c:
final_message += c[current_char]
current_char += 1
final_message += final_c[0][:-1]
print(final_message)
gives me helloitsmeiwaswonderingafterhomani when it should simply stop at helloitsmeiwaswonderingafter.
How do I improve this?
Problems related to iterating in some convoluted order can often be solved elegantly with itertools.
Using zip
You can use zip and itertools.chain together.
from itertools import chain
final_c = ['homanif', 'eiesdnt', 'ltiwege', 'lsworar']
final_message = ''.join(chain.from_iterable(zip(*final_c))) # 'helloitsmeiwaswonderingafter'
In the event you needed the strings in final_c to be of different lengths, you could tweak your code a bit by using itertools.zip_longest.
final_message = ''.join(filter(None, chain.from_iterable(zip_longest(*final_c))))
Using cycle
The fun part with itertools is that it offers plenty of clever short solutions for iterating over objects. Here is another using itertools.cycle.
from itertools import cycle
final_c = ['homanif', 'eiesdnt', 'ltiwege', 'lsworara']
final_message = ''.join(next(w) for w in cycle(iter(w) for w in final_c))
You can use a nested comprehension:
x = ["homanif",
"eiesdnt",
"ltiwege",
"lsworar"]
y = "".join(x[i][j]
for j in range(len(x[0]))
for i in range(len(x)))
or use nested joins and zip
y = "".join("".join(y) for y in zip(*x))
Here is a code that works for me :
final_c = ["homanif", "eiesdnt", "ltiwege", "lsworar"]
final_message = ""
current_char = 0
for i in range(len(final_c[1])):
for c in final_c:
final_message += c[current_char]
current_char += 1
# final_message += final_c[0][:-1]
print(final_message)
I hope it helps
I don't understand what you are expecting with the line
final_message += final_c[0][:-1]
The code works just fine without that. Either remove that line or go with something like list comprehensions :
final_message = "".join(final_c[i][j] for j in range(len(final_c[0])) for i in range(len(final_c)))
This gives the expected output:
helloitsmeiwaswonderingafter
looks like you can have a matrix of form nxm where n is the number of words and m is the number of character in a word (the following code will work if all your words have the same length)
import numpy as np
n = len(final_c) # number of words in your list
m = len(final_c[0]) # number of character in a word
array = np_array(''.join([list(w) for w in ''.join(final_c)])
# reshape the array
matrix = array.reshape(n,1,m )
''.join(matrix.transpose().flatten())
import random as rd
ListNumbers1 = []
List1 = []
for j in range(1000):
ListNumbers1 = rd.randint(1,10000)
How would i get the 50 highest numbers from ListNumbers1 and append to list1?
Something like this?
List1.extend(sorted(ListNumbers1)[-50:])
You're assigning the same value over and over in your loop, destroying your list in the process. Use append...
Better: create a list comprehension of the numbers, then use heapq.nlargest to directly get the 50 highest numbers:
import random as rd
import heapq
highest_50 = heapq.nlargest(50,[rd.randint(1,10000) for _ in range(1000)])
print(highest_50)
a result:
[9994, 9983, 9983, 9968, 9934, 9925, 9913, 9912, 9909, 9909, 9902, 9889, 9884, 9880, 9811, 9794, 9793, 9792, 9765, 9756, 9750, 9748, 9738, 9737, 9709, 9707, 9704, 9700, 9691, 9686, 9635, 9618, 9617, 9605, 9604, 9593, 9586, 9584, 9573, 9569, 9569, 9557, 9531, 9528, 9522, 9438, 9438, 9431, 9402, 9400]
Just for fun, I have a more efficient solution:
from random import randint
import heapq
# create a list of 1000 random numbers
# take the negative, so the min heap does what we want
dataset = [-randint(1, 10000) for _ in range(1000)]
# O(n) function to impose the heap invariant
heapq.heapify(dataset)
# sorting is O(n log n)
# extracting from a heap is log n per item
# therefore taking the 50 biggest is much more efficent if
# we use a heap to extract only the ones we need
top50 = [-heapq.heappop(dataset) for _ in range(50)]
print top50
This is a faster solution because the 50 you want to extract is much less than the 1000 total input size. I renamed the variables, but that's just my personal preference.
Like that (notice how assigning random number is fixed):
import random as rd
ListNumbers1 = []
List1 = []
for j in range(1000):
ListNumbers1.append(rd.randint(1,10000)) # fix: append each random element
ListNumbers1.sort() # sort the list, ascending order
List1 = ListNumbers1[-50:] # get last 50 elements of the list and assign it to List1
To search for the n biggest numbers in a list you need two procedures: one to find the biggest, the other to extract the n biggest.
You could also sort the list and take the n first, but this is not my approach because I needed to keep the original list and study it as is. For example, you could also know the offsets of each chosen number, which could be useful in some case.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
def n_biggest(lis,howmany):
#this function returns the biggest of a list with its offset
def biggest_in_list(lis):
n=0
biggest=0
offset=0
for item in lis:
n=n+1
if (item>biggest):
biggest=item
offset=n-1
return[biggest,offset]
#now this is the part where we create the descending list
image=lis#the use of an image will prevent finding twice the same number
result_list=[]
if len(lis)<howmany:#this will prevent errors if list is too small
howmany=len(lis)
for i in range(howmany):
find1=biggest_in_list(image)
result_list.append(find1[0])
image.pop(find1[1])
return result_list
print n_biggest([5,4,6,10,233,422],3)#this line was for testing the above
Hope this helps,
Regards,
import random as rd
List1 = sorted([rd.randint(1,10000) for j in range(1000)])[-50:]
Slice the last 50 elements of a sorted list comprehension, then you don't need ListNumbers1
I am having a list of words and I'd like to find out how many time each permutation occurs in this list of word.
And I'd like to count overlapping permutation also. So count() doesn't seem to be appropriate.
for example: the permutation aba appears twice in this string:
ababa
However count() would say one.
So I designed this little script, but I am not too sure that is efficient. The array of word is an external file, I just removed this part to make it simplier.
import itertools
import itertools
#Occurence counting function
def occ(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
#permutation generator
abc="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
permut = [''.join(p) for p in itertools.product(abc,repeat=2)]
#Transform osd7 in array
arrayofWords=['word1',"word2","word3","word4"]
dict_output['total']=0
#create the array
for perm in permut:
dict_output[perm]=0
#iterate over the arrayofWords and permutation
for word in arrayofWords:
for perm in permut:
dict_output[perm]=dict_output[perm]+occ(word,perm)
dict_output['total']=dict_output['total']+occ(word,perm)
It is working, but it takes looonnnggg time. If I change, product(abc,repeat=2) by product(abc,repeat=3) or product(abc,repeat=4)... It will take a full week!
The question: Is there a more efficient way?
Very simple: count only what you need to count.
from collections import defaultdict
quadrigrams = defaultdict(lambda: 0)
for word in arrayofWords:
for i in range(len(word) - 3):
quadrigrams[word[i:i+4]] += 1
You can use re module to count overlapping match.
import re
print len(re.findall(r'(?=(aba))','ababa'))
Output:
2
More generally,
print len(re.findall(r'(?=(<pattern>))','<input_string>'))