Deciphering script in Python issue - python

Cheers, I am looking for help with my small Python project. Problem says, that program has to be able to decipher "monoalphabetic substitute cipher", while we have complete database, which words will definetely (at least once) be ciphered.
I have tried to create such a database with words, that are ciphered:
lst_sample = []
n = int(input('Number of words in database: '))
for i in range(n):
x = input()
lst_sample.append(x)
Way, that I am trying to "decipher" is to observe words', let's say structure, where different letter I am assigning numbers based on their presence in word (e.g. feed = 0112, hood = 0112 are the same, because it is combination of three different letters in such a combination). I am using subprogram pattern() for it:
def pattern(word):
nextNum = 0
letternNums = {}
wordPattern = []
for letter in word:
if letter not in letterNums:
letternNums[letter] = str(nextNum)
nextNum += 1
wordPattern.append(letterNums[letter])
return ''.join(wordPattern)
Right after, I have made database of ciphered words:
lst_en = []
q = input('Insert ciphered words: ')
if q == '':
print(lst_en)
else:
lst_en.append(q)
With such a databases I could finally create process to deciphering.
for i in lst_en:
for q in lst_sample:
x = p
word = i
if pattern(x) == pattern(word):
print(x)
print(word)
print()
If words in database lst_sample have different letter length (e.g. food, car, yellow), there is no problem to assign decrypted words, even when they have the same length, I can sort them based on their different structure: (e.g. puff, sort).
The main problem, which I am not able to solve, comes, when word has the same length and structure (e.g. jane, word).
I have no idea how to solve this problem, while keeping such an script architecture as described above. Is there any way, how that could be solved using another if statement or anything similar? Is there any way, how to solve it with infortmation that words in lst_sample will for sure be in ciphered text?
Thanks for all help!

Related

TKinter and JSON

I'm looking for a resource to learning me how to connect TKinter and JSON
Like take an input value (word) and search for that word in JSON then print out the result of the search
by way, I have already the python application working through terminal but I want to go further step and build a GUI
Thank you,
import json #import the JSON module
from difflib import get_close_matches #difflib module provides classes and functions
#for comparing sequences
#get_close_matches Return a list of
#the best “good enough” matches
data = json.load(open("data.json")) #load JSON to python dictionary
def translate(w):
w = w.lower() #change the input to lower case
if w in data: #first scenario check if the word exist in the dictionary, if exist load the data
return data[w]
elif w.title() in data: #When the user inputs a proper noun
return data[w.title()] #returns the definition of names that start with a capital letter
elif w.upper() in data: #definition of acronyms
return data[w.upper()]
elif len(get_close_matches(w, data.keys())) > 0: #second scenario compare the word and get the best match
#ask the user if the result of matching what is looking for
YN = input("Did you mean %s instead? Enter y if yes or n if no:" % get_close_matches(w, data.keys())[0])
if YN == "y":
return data[get_close_matches(w, data.keys())[0]]
elif YN == "n":
return "The word doesn't exist. Please double check it."
else:
return "We didn't understand your entry."
#third scenario the word not match or can't found
else:
return "The word doesn't exsit. Please double check it."
word = input("Enter word: ")
#in some cases the word have more than one definition so we need to make the output more readable
output = translate(word)
if type(output) == list:
for item in output:
print(item)
else:
print(output)
Here's how to do it!
Sketch some pictures of what your current program would look like if it had a GUI.
Would "Did you mean %s instead?" be a popup box?
Would you have a list of all the known words?
Build the UI using tkinter.
Connect the UI up to the functions in your program. (You do have those, don't you?)
Your program isn't quite ready yet, since you are doing stuff like using input inside your functions. Rework it so that it makes sense to have these outside your functions, then it'll probably be ready.

Frequency of keywords in a list

Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.

How do I optimize the speed of my python compression code?

I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. Compared to my classmates, mine is irregularly long. It might be my computer or my code, but I have no idea. Does anyone know any tips or shortcuts into making the speed of my code shorter? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (:
(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!)
import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added
size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file, INCLUDING SPACES
print('The size of your file is ', size1,)
words = re.findall('\w+', open('file.txt').read())
wordcounts = Counter(words) #turns all words into array, even capitals
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words
keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
keyword.append(key) #adds each keyword to the array called keywords
kcount.append(value)
characters =['$','#','#','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use
symbols_words = []
char = 0
for i in common100:
symbols_words.append(characters[char]) #makes the array literally contain 0 values
char = char + 1
print("Compression has now started")
f = 0
g = 0
no = 0
while no < 100:
for i in common100:
for w in words:
if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
place = words.index(i)#find exactly where the most common words are in the text
symbols = symbols_words[common100.index(i)] #assigns one character with one common word
words[place] = symbols # replaces the word with the symbol
g = g + 1
no = no + 1
string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()
size2 = os.path.getsize('compression.txt')
no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
'The size of your file now is ', size2)
Using something like
word_substitutes = dict(zip(common100, characters))
will give you a dict that maps common words to their corresponding symbol.
Then you can simply iterate over the words:
# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
# If the current word is in the `word_substitutes` dict, then we know its in the
# 'common' words, and can be replaced by the symbol
if word in word_substitutes:
# Replaces the word in-place
replacement_symbol = word_substitutes[word]
words[word_idx] = replacement_symbol
This will give much better performance, because the dictionary lookup used for the common word symbol mapping is logarithmic in time rather than linear. So the overall complexity will be something like O(N log(N)) rather than O(N^3) that you get from the 2 nested loops with the .index() call inside that.
The first thing I see that is bad for performance is:
for i in common100:
for w in words:
if i == w and len(i)>1:
...
What you are doing is seeing if the word w is in your list of common100 words. However, this check can be done in O(1) time by using a set and not looping through all of your top 100 words for each word.
common_words = set(common100)
for w in words:
if w in common_words:
...
Generally you would do the following:
Measure how much time each "part" of your program needs. You could use a profiler (e.g. this one in the standard library) or simply sprinkle some times.append(time.time.now) into your code and compute differences. Then you know which part of your code is slow.
See if you can improve the algorithm of the slow part. gnicholas answer shows one possibility to speed things up. The while no<=100 seems suspiciously, maybe that can be improved. This step needs understanding of the algorithms you use. Be careful to select the best data structures for your use case.
If you can't use a better algorithm (because you always use the best way to calculate something) you need to speed up the computations themselves. Numerical stuff benefits from numpy, with cython you can basically compile python code to C and numba uses LLVM to compile.

importing random words from a file without duplicates Python

I'm attempting to create a program which selects 10 words from a text file which contains 10+ words. For the purpose of the program when importing these 10 words from the text file, I must not import the same words twice! Currently I'm utilising a list for this however the same words seem to appear. I have some knowledge of sets and know they cannot hold the same value twice. As of now I'm clueless on how to solve this any help would be much appreciated. THANKS!
please find relevant code below! -(p.s. FileSelection is basically open file dialog)
def GameStage03_E():
global WordList
if WrdCount >= 10:
WordList = []
for n in range(0,10):
FileLines = open(FileSelection).read().splitlines()
RandWrd = random.choice(FileLines)
WordList.append(RandWrd)
SelectButton.destroy()
GameStage01Button.destroy()
GameStage04_E()
elif WrdCount <= 10:
tkinter.messagebox.showinfo("ERROR", " Insufficient Amount Of Words Within Your Text File! ")
Make WordList a set:
WordList = set()
Then update that set instead of appending:
WordList.update(set([RandWrd]))
Of course WordList would be a bad name for a set.
There are a few other problems though:
Don't use uppercase names for variables and functions (follow PEP8)
What happens if you draw the same word twice in your loop? There is no guarantee that WordList will contain 10 items after the loop completes, if words may appear multiple times.
The latter might be addressed by changing your loop to:
while len(WordList) < 10:
FileLines = open(FileSelection).read().splitlines()
RandWrd = random.choice(FileLines)
WordList.update(set([RandWrd]))
You would have to account for the case that there don't exist 10 distinct words after all, though.
Even then the loop would still be quite inefficient as you might draw the same word over and over and over again with random.choice(FileLines). But maybe you can base something useful off of that.
not sure i understand you right, but ehehe,
line 3: "if wrdcount" . . where dit you give wrdcount a value ?
Maybe you intent something along the line below?:
wordset = {}
wrdcount = len(wordset)
while wrdcount < 10:
# do some work to update the setcode here
# when end-of-file break

Index error in loop to separate body of text by speaker in python

I've got a corpus of text which takes the following form:
JOHN: Thanks for coming, everyone!
(EVERYONE GRUMBLES)
ROGER: They're really glad to see you, huh?
DAVIS: They're glad to see the both of you.
In order to analyze the text, I want to divide it into chunks, by speaker. I want to retain John and Roger, but not Davis. I also want to find the number of times the certain phrases like (EVERYONE GRUMBLES) occur during each person's speech.
My first thought was to use NLTK, so I imported it and used the following code to remove all the punctuation and tokenize the text, so that each word within the corpus becomes an individual token:
f = open("text.txt")
raw_t = f.read()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(raw_t.decode('utf-8'))
text = nltk.Text(tokens)
Then, I thought that I could create a global list, within which I would include all of the instances of John and Roger speaking.
I figured that I'd first see if each word in the text corpus was upper case and in the list of acceptable names, and if it was, I'd examine every subsequent word until the next incidence of a term that was both upper case and was found in the list of acceptable names. I'd then add all the words from the initial instance of a speaker's name, through to one word less than the next speaker's name, and add this series of tokens/words to my global list.
I've written:
k = 0
i = 0
j = 1
names =["JOHN","ROGER"]
global_list =[]
for i in range(len(text)):
if (text[i].isupper() and text[i] in names):
for j in range(len(text)-i):
if (text[i+j].isupper() and text[i+j] in names):
global_list[k] = text[i:(j-1)]
k+=1
else: j+=1
else: i+=1
Unfortunately, this doesn't work, and I get the following index error:
IndexError Traceback (most recent call last)
<ipython-input-49-97de0c68b674> in <module>()
6 for j in range(len(text)-i):
7 if (text[i+j].isupper() and text[i+j] in names):
----> 8 list_speeches[k] = text[i:(j-1)]
9 k+=1
10 else: j+=1
IndexError: list assignment index out of range
I feel like I'm screwing up something really basic here, but am not exactly why I'm getting this index error. Can anyone shed some light on this?
Break the text into paragraphs with re.split(r"\n\s*\n", text), then examine the first word of each paragraph to see who is speaking. And don't worry about the nltk-- you haven't used it yet, and you don't need to.
Ok, figured this out after a bit of digging around. The initial loop mentioned in the question had a whole bunch of extraneous content, so I simplified it to:
names =["JOHN","ROGER"]
global_list = []
i = 0
for i in range(len(text)):
if (text[i].isupper()) and (text[i] in names):
j=0
while (text[i+j].islower()) and (text[i+j] not in names):
j+=1
global_list.append(text[i:(j-1)])
This generated a list, although, problematically, each item in this list was made up of words starting from the name through to the end of the document. Because each item began at the appropriate name while ending at the last word of the text corpus, it was easy to derive the length of each segment by subtracting the length of the following segment from it:
x=1
new_list = range(len(global_list)-1)
for x in range(len(global_list)):
if x == len(global_list):
new_list[x-1] = global_list[x]
else:
new_list[x-1] = global_list[x][:(len(global_list[x])-len(global_list[x+1]))]
(x was set to 1 because the original code gave me the first speaker's content twice).
This wasn't in the least bit pretty, but it ended up working. If anyone's got a prettier way of doing it — and I'm sure it exists, since I think I've messed up the initial loop — I'd love to see it.

Categories