I have a spanish novel, in a plain textfile, and I want to make a Python script that puts a translation in brackets after difficult words. I have a list of the words (with translations) I want to do this with in a separate text file, which I have tried to format correctly.
I've forgotten everything I knew about Python, which was very little to begin with, so I'm struggling.
This is a script someone helped me with:
bookin = (open("C:\Users\King Kong\Documents\_div_tekstfiler_\coc_es.txt")).read()
subin = open("C:\Users\King Kong\Documents\_div_tekstfiler_\cocdict.txt")
for line in subin.readlines():
ogword, meaning = line.split()
subword = ogword + " (" + meaning + ")"
bookin.replace(ogword, subword)
ogword = ogword.capitalize()
subword = ogword + " (" + meaning + ")"
bookin.replace(ogword, subword)
subin.close()
bookout = open("fileout.txt", "w")
bookout.write(bookin)
bookout.close()
When I ran this, I got this error message:
Traceback (most recent call last):
File "C:\Python27\translscript_secver.py", line 4, in <module>
ogword, meaning = line.split()
ValueError: too many values to unpack
The novel pretty big, and the dictionary I've made consists of about ten thousand key value pairs.
Does this mean there's something wrong with the dictionary? Or it's too big?
Been researching this a lot, but I can't seem to make sense of it. Any advice would be appreciated.
line.split() in ogword, meaning = line.split() returns a list, and in this case it may be returning more than 2 values. Write your code in a way that can handle more than two values. For instance, by assigning line.split() to a list and then asserting that the list has two items:
mylist = line.split()
assert len(mylist) == 2
ogword, meaning = line.split()[:2]
line.split() return a list of words (space separated token) in line. The error you get suggest that somewhere, your dictionnary contains more than just pair. You may add trace message to locate the error (see below).
If your dictionnary contains richer definitions than synonym, you may use following lines, which put the first word in ogword and following ones in meaning.
words = line.split()
ogword, meaning = words[0], " ".join(words[1:])
If your dictionary syntax is more complex (composed ogword), you have to rely on an explicit separator. You can still use split to divide your lines (line.split("=") will split a line on "=" characters)
Edit: to ignore and display bad lines, replace ogword, meaning = line.split() with
try:
ogword,meaning = line.split()
except:
print "wrong formated line:", line
continue
split()
returns a single list, ie one item, you are trying to assign this one thing to two variables.
It will work if the number of items in the list is equal to the number of variables on the left hand side of the assignment statement. I.e., the list is unpacked and the individual parts are assigned to the variables on the left hand side.
In this case, as pointed out by #Josvic Zammit, the problem can occur if there are more than 2 items in the list and can not properly "unpacked" and assigned.
Related
Goal is to a) print a list of unique words from a text file and also b) find the longest word.
I cannot use imports in this challenge.
File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength is obviously incorrect.
with open("doc.txt") as reader, open("unique.txt", "w") as writer:
unwanted = "[],."
unique = set(reader.read().split())
unique = list(unique)
unique.sort(key=len)
regex = [elem.strip(unwanted).split() for elem in unique]
writer.write(str(regex))
reader.close()
maxLength = len(max(regex,key=len ))
print(maxLength)
res = [word for word in regex if len(word) == maxLength]
print(res)
===========
Sample:
pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]
Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split(). (Normally we'd use a regex with re.sub(), but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
hence we guarantee that max_length = len(unique_words[0])
this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
no need to do explicit writer/reader.open()/.close(), that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength. See PEP8
in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read(). (That's not scaleable for huge files, you'd have to read in chunks or n lines).
str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.
Alternative solution if you don't yet know str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])
Here is another solution without any function.
bad = '`~##$%^&*()-_=+[]{}\|;\':\".>?<,/?'
clean = ' '
for i in a:
if i not in bad:
clean += i
else:
clean += ' '
cleans = [i for i in clean.split(' ') if len(i)]
clean_uniq = list(set(cleans))
clean_uniq.sort(key=len)
print(clean_uniq)
print(len(clean_uniq[-1]))
Here is a solution. The trick is to use the python str method .isalpha() to filter non-alphanumerics.
with open("unique.txt", "w") as writer:
with open("doc.txt") as reader:
cleaned_words = []
for line in reader.readlines():
for word in line.split():
cleaned_word = ''.join([c for c in word if c.isalpha()])
if len(cleaned_word):
cleaned_words.append(cleaned_word)
# print unique words
unique_words = set(cleaned_words)
print(unique_words)
# write words to file? depends what you need here
for word in unique_words:
writer.write(str(word))
writer.write('\n')
# print length of longest
print(len(sorted(unique_words, key=len, reverse=True)[0]))
I'm trying to use list indices as arguments for a function that performs regex searches and substitutions over some text files. The different search patterns have been assigned to variables and I've put the variables in a list that I want to feed the function as it loops through a given text.
When I call the function using a list index as an argument nothing happens (the program runs, but no substitutions are made in my text files), however, I know the rest of the code is working because if I call the function with any of the search variables individually it behaves as expected.
When I give the print function the same list index as I'm trying to use to call my function it prints exactly what I'm trying to give as my function argument, so I'm stumped!
search1 = re.compile(r'pattern1')
search2 = re.compile(r'pattern2')
search3 = re.compile(r'pattern3')
searches = ['search1', 'search2', 'search2']
i = 0
for …
…
def fun(find)
…
fun(searches[i])
if i <= 2:
i += 1
…
As mentioned, if I use fun(search1) the script edits my text files as wished. Likewise, if I add the line print(searches[i]) it prints search1 (etc.), which is what I'm trying to give as an argument to fun.
Being new to Python and programming, I've a limited investigative skill set, but after poking around as best I could and subsequently running print(searches.index(search1) and getting a pattern1 is not in list error, my leading (and only) theory is that I'm giving my function the actual regex expression rather than the variable it's stored in???
Much thanks for any forthcoming help!
Try to changes your searches list to be [search1, search2, search3] instead of ['search1', 'search2', 'search2'] (in which you just use strings and not regex objects)
Thanks to all for the help. eyl327's comment that I should use a list or dictionary to store my regular expressions pointed me in the right direction.
However, because I was using regex in my search patterns, I couldn't get it to work until I also created a list of compiled expressions (discovered via this thread on stored regex strings).
Very appreciative of juanpa.arrivillaga point that I should have proved a MRE (please forgive, with a highly limited skill set, this in itself can be hard to do), I'll just give an excerpt of a slightly amended version of my actual code demonstrating the answer (one again, please forgive its long-windedness, I'm not presently able to do anything more elegant):
…
# put regex search patterns in a list
rawExps = ['search pattern 1', 'search pattern 2', 'search pattern 3']
# create a new list of compiled search patterns
compiledExps = [regex.compile(expression, regex.V1) for expression in rawExps]
i = 0
storID = 0
newText = ""
for file in filepathList:
for expression in compiledExps:
with open(file, 'r') as text:
thisText = text.read()
lines = thisThis.splitlines()
setStorID = regex.search(compiledExps[i], thisText)
if setStorID is not None:
storID = int(setStorID.group())
for line in lines:
def idSub(find):
global storID
global newText
match = regex.search(find, line)
if match is not None:
newLine = regex.sub(find, str(storID), line) + "\n"
newText = newText + newLine
storID = plus1(int(storID), 1)
else:
newLine = line + "\n"
newText = newText + newLine
# list index number can be used as an argument in the function call
idSub(compiledExps[i])
if i <= 2:
i += 1
write()
newText = ""
i = 0
Here is my code
import re
with open('newfiles.txt') as f:
k = f.read()
p = re.compile(r'[\w\:\-\.\,\']+|[^[\w\:\-\.\'\,]\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
if word not in uniquelist:
uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
n = p.findall(indexes)
file = open("newfiletwo.txt","w")
file.write (' '.join(str(e) for e in n))
file.close()
file = open("newfilethree.txt","w")
file.write(' '.join(uniquelist))
file.close()
with open('newfiletwo.txt') as f:
indexess = f.read()
with open('newfilethree.txt') as f:
differentwords = f.read()
differentwords = p.findall(differentwords)
indexess = [uniquelist.index(word) for word in originaltext]
for word in originaltext:
if not word in differentwords:
differentwords.append(word)
i = differentwords.index(word)
indexess.append(i)
s = "" # the reconstructed sentence
for i in indexess:
s = s + differentwords[i] + " "
print(s)
The program basically takes an external text file, returns the index of its positions (if any word repeats, then the first position is taken) and then saves the positions as an external file. Whilst doing this, I have split up the text file including splitting punctuation and saved different words and punctuation that occur in the file as an external file too. Now for the hard part, using both of these external files - the indexes and the different separated words, I am trying to recreate the original text file, including the punctuation. But the error shown in the title occurs:
Traceback (most recent call last):
File "E:\Python\Index.py", line 31, in <module>
s = s + differentwords[i] + " "
IndexError: list index out of range
Not trying to sound rude but I am a sort of beginner, please try to change as less as possible in a simple way, as I have created this myself. You guys maybe know a far shorter way to do this, but this is the level of simplicity I can handle, proved by the length of the code. I have tried to shorten the original text file but that proves no use. Anyone know why the error occurs and how to fix it? I am not looking for efficiency right now, maybe after another couple of months of learning, but the simplest (i don't mind long) answer will be the best. Sorry if I have repeated myself a lot :-)
'newfiles' - A bunch of sentences with punctuation
UPDATE
The code does not show the error but prints the original sentence twice. The error has gone due to the removal of +1 on line 23. Does anyone know why the output repeats twice though?
Problem is, how you qualify what word is, what is not. For instance is comma part of word? In your case that is not mentioned as such, while it is also not a separator. So you end up with separate word comma, or dot, and so on. I have no access to your input, so I can just provide sample:
p = re.compile(r'[\w\:\-\.\,]+|[^[\w\:\-\.\,]\s]')
There is one point - in this case: 'Word', 'word', 'Word', 'Word.', 'word,' are all separate words. Since dot, and coma are parts of word. You can't eat cake and have it. To fix that... you need to store information if there is white space before separation.
UPDATE:
Oh, yes. Double output. Files that are stored in the middle - are OK. So something was filed after that. Look at this two lines:
i = differentwords.index(word)
indexess.append(i)
They need to be inside preceding if statement.
So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)
I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!
I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))
As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.
So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()
I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.
i hope this request is legit.
i'm taking a programming course in python for engineers, so i'm kinda new at this business.
anyway, in my homework i was requested to write a function with receive two strings and check if one is a (permutation/Anagrm) of the other. (which means if they both have exactly the same letters and same number of appearances for each letter)
iv'e found some great codes here while searching, but i still don't get what's wrong with my code (and it's important for me to know for my studying process).
we got a tests file which suppose to check our functions, and it gave me that error:
Traceback (most recent call last):
File "C:\Users\Or\Desktop\תכנות\4\hw4\123456789_a4.py", line 110, in <module>
test_hw4()
File "C:\Users\Or\Desktop\תכנות\4\hw4\123456789_a4.py", line 97, in test_hw4
test(is_anagram('Tom Marvolo Riddle','I Am Lord Voldemort'), True)
File "C:\Users\Or\Desktop\תכנות\4\hw4\123456789_a4.py", line 31, in is_anagram
s2_list.sort()
NameError: global name 's2_list' is not defined
this is my code:
def is_anagram(string1, string2):
string1 = string1.lower() #turns Capital letter to small ones
string2 = string2.lower()
string1 = string1.replace(" ","") #turns the words inside the string to one word
string2 = string2.replace(" ","")
if len(string1)!= len(string2):
return False
s1_list = [string1[i] for i in range(len(string1))] #creates a list of string 1 letters
a2_list = [string1[k] for k in range(len(string1))]
s1_list.sort() #sorting the list
s2_list.sort()
booli=False
k=0
for i in s1_list: #for loop which compares each letter in the two lists
if s1_list[k]==s2_list[k]:
booli = True
k=k+1
else:
booli=False
break
return booli
any one know how to fix it ?
Thanks!
It looks like you have a typo with a2_list. That section should read:
s1_list = [string1[i] for i in range(len(string1))] #creates a list of string 1 letters
s2_list = [string2[k] for k in range(len(string2))]
s1_list.sort() #sorting the list
s2_list.sort()
FWIW, here is an interactive prompt example of how to tell if two strings are anagrams of one another:
>>> string1 = 'Logarithm'
>>> string2 = 'algorithm'
>>> sorted(string1.lower()) == sorted(string2.lower()) # see if they are anagrams
True
If you make a listify_string function and use that to set your s1_list and s2_list, it might be easier to see that there are multiple things that look to be wrong with your code, unless you intended both s1_list and s2_list to be populated from the same string.
def listify(string):
return [c for c in string]
Then you can simply do s1_list = listify(string1) and s2_list = ... to set the values.
I would probably turn at least the 'check if the two lists are the same' into a function, so I could use an early return to indicate falseness (so instead of starting with booli as true, setting it on each iteration through the loop and breaking out of the loop if false).
If you look at the join method of Python strings, you might find inspiration for another way to check if s1_list and s2_list are the same.
Try this one-liner instead:
sorted(s1.lower().replace(' ', '')) == sorted(s2.lower().replace(' ', ''))
Python strings are essentially lists, so they can be sorted. We just need to take care of uppercase and whitespace first. The python equals operator then takes care of the actual comparison.