Python: startswith() not working as intended - python

I am writing a program that loads an external txt file of movies. This part works fine. I then have a function that searches a list of movies generated from the file. The function should print out all movies that start with the search string.
def startsWithSearch(movieList):
searchString = input("Enter search string: ")
for movie in movieList:
if(movie.startswith(searchString) == True):
print(movie)
However, no movies are printed when I enter a search string, even though there are movies in the list that start with that string.

if given correct input data your function does work as expected:
def startsWithSearch(movieList):
searchString = "test4"
for movie in movieList:
if(movie.startswith(searchString)):
print(movie)
startsWithSearch(["test1","testnomatch","test4","test4should","not_test4"])
output is:
test4
test4should
so all correct... must be your input data
i know you want a StartsWith solution as your function name says, but actually searching for movies, it is a lot more convenient, if you find any match inside the string, so if i search for "mentalist" i will find "the mentalist", then you could just use:
if searchString in movie:
print(movie)
And as suggested by Anna to ignore case:
if searchString.lower() in movie.lower():
or even fancier with regular expressions (need import re at first line):
if re.match(".*" + searchString,movie,re.I):
or if you really just want match on beginning of name:
if re.match(searchString,movie,re.I):
that should be enough alternatives :)

I would think it might be that the input() function is returning with a new line at the end. Try adding searchString = searchString.strip() after collecting the input data.
Also you might want to try converting both to lower case before comparing.
Also the line if(movie.startswith(searchString) == True): can just be written as if movie.startswith(searchString):

I had same problem when I was reading from files. startswith was failing due to newline character which was read from file. Use rstrip() to remove newline character from both strings before you use startswith function.

Related

Sanatize input to include whitespace in regex?

I'm using python 3.5 currently.
I am trying to make a tool that takes input, and does a regex search for said "Playername" and returns the matching result. I run into an interesting issue because this is videogame related, and some users have special characters in their names (Clan Tags).
To try to sanitize input, I am using re.escape, but I am not getting the behavior I expected out of it.
Example, I am allowing users to input partial matches, and use regex to find a player. So if I input Mall, it should be able to regex to find Mallachar, her is my current example matching setup.
regex_match = r".*" + player_name + r".*"
if re.match(regex_match, str(name_list), re.IGNORECASE):
player_list.append(players)
Because this is a system where user names are not unique, and a player can change their name, I am searching against a "list" of users.
Anyways, the issue I am running into is when people have spaces or clan tags. Example, if the clan ~DOG~ joins the server, and I have people with names ~DOG~ Master and ~DOG- Runner, if I feed in the string ~DOG~ Run, I get all matches to ~DOG~ .*.
My understanding is that re.escape should be escaping the space so it's a part of my search, so it should be trying to match this
.*~DOG~\sRun.*
But instead it seems to be running this, like it's ignoring everything after ~DOG~:
.*~DOG~.*
Am I misunderstanding how re.escape is?
You can use in operator to check if player_name is inside other string:
name_list = ['~DOG~ Master', '~DOG~ Runner']
player_name = '~DOG~ Run'
player_list = []
for name in name_list:
if player_name in name:
player_list.append(name)
print(player_list)
This prints:
['~DOG~ Runner']
Using in is probably the right way to solve this problem but on the use of regex question itself.
Adding a set of parens will let you use matches
python so_post.py
('~DOG~ Run',)
alexl#MBP000413 ~ :)% cat so_post.py
import re
regex_match = r".*(~DOG~ Run).*"
name = "~DOG~ Run"
match = re.match(regex_match, name, re.IGNORECASE)
print(match.groups())
Using named groups lets you use a specific name instead of just a general tuple of matches.
regex_match = r".*(?P<user_clan>~DOG~ Run).*"
name = "~DOG~ Run"
match = re.match(regex_match, name, re.IGNORECASE)
print(match.groups("user_clan"))

Using Python regular expressions substitution

I need to write a program that will deidentify names in a medical record. How can I substitute names that COULD include prefixes, suffixes and first initials or first names, but don't HAVE to have all of the above every time. For example, I can get the program to deidentify Dr. S Smith, but not Dr. Smith.
Thank you!
Here's the program I have so far:
# This program removes names and email addresses occurring in a given input file and saves it in an output file.
import re
def deidentify():
infilename = input("Give the input file name: ")
outfilename = input("Give the output file name: ")
infile = open(infilename,"r")
text = infile.read()
infile.close()
# replace names
nameRE = "(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+"
deidentified_text = re.sub(nameRE,"**name**",text)
outfile = open(outfilename,"w")
print(deidentified_text, file=outfile)
outfile.close()
deidentify()
The [A-Z](\.|[a-z]+) term in
"(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+"
is searching for a first name or initial. You want this part to be optional, so use capture groups.
nameRe = "(Ms\.|Mr\.|Dr\.|Prof\.)( [A-Z](\.|[a-z]+))?( [A-Z][a-z]+)"
re.sub(nameRe, r"\1\4" ,text)
The ? in
re.sub(nameRe, r"\1\4" ,text)
says "this part is optional, but still treat it as a capture group even if it's empty."
The r"\1\4" tells re.sub to use the first and fourth capture groups (basically, a capture group starts evey time you see a ().
Try the following:
((?:Ms\.|Mr\.|Dr\.|Prof\.|Mrs\.) (?:[A-Z](?:\.|(?:[a-z])+) )?[A-Z][a-z]+)
However, I'd recommend parsing this file into a Python data structure (dictionaries, objects, whatever), and then you can simply omit the names when you print results, not to mention all the other handy things you can do once your data are in a Python program (e.g., has this patient been with us for more than five years? what percentage of patients have a credit card number as payment information?).
Turns out the answer was that the expression needed to account for spaces using \s. Once this was entered, the program worked.

Python not reading from list corretly

Okay, below is my issue:
this program reads from a file, makes a list without using rstrip('\n'), which I did on purpose. From there, it prints the list, sorts it, prints it again, saves the new, sorted list to a text file, and allows you to search the list for a value.
The problem I am having is this:
when I search for a name, no matter how I type it in, it tells me that its not in the list.
the code worked til I changed the way I was testing for the variable. Here is the search function:
def searchNames(nameList):
another = 'y'
while another.lower() == 'y':
search = input("What name are you looking for? (Use 'Lastname, Firstname', including comma: ")
if search in nameList:
print("The name was found at index", nameList.index(search), "in the list.")
another = input("Check another name? Y for yes, anything else for no: ")
else:
print("The name was not found in the list.")
another = input("Check another name? Y for yes, anything else for no: ")
For the full code, http://pastebin.com/PMskBtzJ
For the content of the text file: http://pastebin.com/dAhmnXfZ
Ideas? I feel like I should note that I have tried to add ( + '\n') to the search variable
You say you explicitly did not strip off the newlines.
So, your nameList is a list of strings like ['van Rossum, Guido\n', 'Python, Monty\n'].
But your search is the string returned by input, which will not have a newline. So it can't possibly match any of the strings in the list.
There are a few ways to fix this.
First, of course, you could strip the newlines in your list.
Alternatively, you could strip them on the fly during the search:
if search in (name.rstrip() for name in nameList):
Or you could even add them onto the search string:
if search+'\n' in nameList:
If you're doing lots of searches, I would do the stripping just once and keep a list of stripped names around.
As a side note, searching the list to find out if the name is in the list, and then searching it again to find the index, is a little silly. Just search it once:
try:
i = nameList.index(search)
except ValueError:
print("The name was not found in the list.")
else:
print("The name was found at index", i, "in the list.")
another = input("Check another name? Y for yes, anything else for no: ")
Reason for this error is that any input in your list ends with a "\n". SO for example "john, smith\n". Your search function than uses the input which does NOT include "\n".
You've not given us much to go on, but maybe using sys.stdin.readline() instead of input() would help? I don't believe 2.x input() is going to leave a newline on the end of your inputs, which would make the "in" operator never find a match. sys.stdin.readline() does leave the newline at the end.
Also 'string' in list_ is slow compared to 'string' in set_ - if you don't really need indices, you might use a set instead, particularly if your collection is large.

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

python read output

Write a program that outputs the first number within a file specified by the user. It should behave like:
Enter a file name: l11-1.txt
The first number is 20.
You will need to use the file object method .read(1) to read 1 character at a time, and a string object method to check if it is a number. If there is no number, the expected behaviour is:
Enter a file name: l11-2.txt
There is no number in l11-2.txt.
Why is reading 1 character at a time a better algorithm than calling .read() once and then processing the resulting string using a loop?
I have the files and it does correspond to the answers above but im not sure how to make it output properly.
The code i have so far is below:
filenm = raw_input("Enter a file name: ")
datain=file(filenm,"r")
try:
c=datain.read(1)
result = []
while int(c) >= 0:
result.append(c)
c = datain.read(1)
except:
pass
if len(result) > 0:
print "The first number is",(" ".join(result))+" . "
else:
print "There is no number in" , filenm + "."
so far this opens the file and reads it but the output is always no number even if there is one. Can anyone help me ?
OK, you've been given some instructions:
read a string input from the user
open the file given by that string
.read(1) a character at a time until you get the first number or EOF
print the number
You've got the first and second parts here (although you should use open instead of file to open a file), what next? The first thing to do is to work out your algorithm: what do you want the computer to do?
Your last line starts looping over the lines in the file, which sounds like not what your teacher wants -- they want you to read a single character. File objects have a .read() method that lets you specify how many bytes to read, so:
c = datain.read(1)
will read a single character into a string. You can then call .isdigit() on that to determine if it's a digit or not:
c.isdigit()
It sounds like you're supposed to keep reading a digit until you run out, and then concatenate them all together; if the first thing you read isn't a digit (c.isdigit() is False) you should just error out
Your datain variable is a file object. Use its .read(1) method to read 1 character at a time. Take a look at the string methods and find one that will tell you if a string is a number.
Why is reading 1 character at a time a better algorithm than calling .read() once and then processing the resulting string using a loop?
Define "better".
In this case, it's "better" because it makes you think.
In some cases, it's "better" because it can save reading an entire line when reading the first few bytes is enough.
In some cases, it's "better" because the entire line may not be sitting around in the input buffer.
You could use regex like (searching for an integer or a float):
import re
with open(filename, 'r') as fd:
match = re.match('([-]?\d+(\.\d+|))', fd.read())
if match:
print 'My first number is', match.groups()[0]
This with with anything like: "Hello 111." => will output 111.

Categories