Using Python regular expressions substitution - python

I need to write a program that will deidentify names in a medical record. How can I substitute names that COULD include prefixes, suffixes and first initials or first names, but don't HAVE to have all of the above every time. For example, I can get the program to deidentify Dr. S Smith, but not Dr. Smith.
Thank you!
Here's the program I have so far:
# This program removes names and email addresses occurring in a given input file and saves it in an output file.
import re
def deidentify():
infilename = input("Give the input file name: ")
outfilename = input("Give the output file name: ")
infile = open(infilename,"r")
text = infile.read()
infile.close()
# replace names
nameRE = "(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+"
deidentified_text = re.sub(nameRE,"**name**",text)
outfile = open(outfilename,"w")
print(deidentified_text, file=outfile)
outfile.close()
deidentify()

The [A-Z](\.|[a-z]+) term in
"(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+"
is searching for a first name or initial. You want this part to be optional, so use capture groups.
nameRe = "(Ms\.|Mr\.|Dr\.|Prof\.)( [A-Z](\.|[a-z]+))?( [A-Z][a-z]+)"
re.sub(nameRe, r"\1\4" ,text)
The ? in
re.sub(nameRe, r"\1\4" ,text)
says "this part is optional, but still treat it as a capture group even if it's empty."
The r"\1\4" tells re.sub to use the first and fourth capture groups (basically, a capture group starts evey time you see a ().

Try the following:
((?:Ms\.|Mr\.|Dr\.|Prof\.|Mrs\.) (?:[A-Z](?:\.|(?:[a-z])+) )?[A-Z][a-z]+)
However, I'd recommend parsing this file into a Python data structure (dictionaries, objects, whatever), and then you can simply omit the names when you print results, not to mention all the other handy things you can do once your data are in a Python program (e.g., has this patient been with us for more than five years? what percentage of patients have a credit card number as payment information?).

Turns out the answer was that the expression needed to account for spaces using \s. Once this was entered, the program worked.

Related

Best regex pattern to replace input function from a separate python file

I am new to regex so please explain how you got to the answer. Anyway I want to know the best way to match input function from a separate python file.
For example:
match.py
a = input("Enter a number")
b = input()
print(a+b)
Now I want to match ONLY the input statement and replace it with a random number. I will do this in a separate file main.py. So my aim is to replace input function in the match.py with a random numbers so I can check the output will come as expected. You can think of match.py like a coding exercise where he writes the code in that file and main.py will be the file where it evaluates if the users code is right. And to do that I need to replace the input myself and check if it works for all kinds of inputs. I looked for "regex patterns for python input function" but the search did not work right. I have a current way of doing it but I don't think it works in all kinds of cases. I need a perfect pattern which works in all kinds of cases referring to the python syntax. Here is the current main.py I have (It doesn't work for all cases I mean when you write a string with single quote, it does not replace but here is the problem I can just add single quote in pattern but I also need to detect if both are used):
# Evaluating python file checking if input 2 numbers and print sum is correct
import re
import subprocess
input_pattern = re.compile(r"input\s?\([\"]?[\w]*[\"]?\)")
file = open("match.py", 'r')
read = file.read()
file.close()
code = read
matches = input_pattern.findall(code)
for match in matches:
code = code.replace(match, '8')
file = open("match.py", 'w')
file.write(code)
file.close()
process = subprocess.Popen('python3 match.py', stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
out = process.communicate()[0]
print(out == b"16\n")
file = open("match.py", 'w')
file.write(read)
file.close()
Please let me know if you don't understand this question.
The following regex statement is very close to what you need:
input\s?\((?(?=[\"\'])[\"\'].*[\"\']\)|\))
I am using a conditional regex statement. However, I think it may need a nested conditional to avoid the situation that the user enters something like:
input(' text ")
But hopefully this gets you on the right track.

Parsing a file in python

Caveat emptor: I can spell p-y-t-h-o-n and that's pretty much all there is to my knowledge. I tried to take some online classes but after about 20 lectures learning not much, I gave up long time ago. So, what I am going to ask is very simple but I need help:
I have a file with the following structure:
object_name_here:
object_owner:
- me#my.email.com
- user#another.email.com
object_id: some_string_here
identification: some_other_string_here
And this block repeats itself hundreds of times in the same file.
Other than object_name_here being unique and required, all other lines may or may not be present, email addresses can be from none to 10+ different email addresses.
what I want to do is to export this information into a flat file, likes of /etc/passwd, with a twist
for instance, I want the block above to yield a line like this:
object_name_here:object_owner=me#my_email.com,user#another.email.com:objectid=some_string_here:identification=some_other_string_here
again, the number of fields or length of the content fields are not fixed by any means. I am sure this is pretty easy task to accomplish with python but how, I don't know. I don't even know where to start from.
Final Edit: Okay, I am able to write a shell script (bash, ksh etc.) to parse the information, but, when I asked this question originally, I was under the impression that, python had a simpler way of handling uniform or semi-uniform data structures as this one. My understanding was proven to be not very accurate. Sorry for wasting your time.
As jaypb points out, regular expressions are a good idea here. If you're interested in some python 101, I'll give you some simple code to get you started on your own solution.
The following code is a quick and dirty way to lump every six lines of a file into one line of a new file:
# open some files to read and write
oldfile = open("oldfilename","r")
newfile = open("newfilename","w")
# initiate variables and iterate over the input file
count = 0
outputLine = ""
for line in oldfile:
# we're going to append lines in the file to the variable outputLine
# file.readline() will return one line of a file as a string
# str.strip() will remove whitespace at the beginning and end of a string
outputLine = outputLine + oldfile.readline().strip()
# you know your interesting stuff is six lines long, so
# reset the output string and write it to file every six lines
if count%6 == 0:
newfile.write(outputLine + "\n")
outputLine = ""
# increment the counter
count = count + 1
# clean up
oldfile.close()
newfile.close()
This isn't exactly what you want to do but it gets you close. For instance, if you want to get rid of " - " from the beginning of the email addresses and replace it with "=", instead of just appending to outputLine you'd do something like
if some condition:
outputLine = outputLine + '=' + oldfile.readline()[3:]
that last bit is a python slice, [3:] means "give me everything after the third element," and it works for things like strings or lists.
That'll get you started. Use google and the python docs (for instance, googling "python strip" takes you to the built-in types page for python 2.7.10) to understand every line above, then change things around to get what you need.
Since you are replacing text substrings with different text substrings, this is a pretty natural place to use regular expressions.
Python, fortunately, has an excellent regular expressions library called re.
You will probably want to heavily utilize
re.sub(pattern, repl, string)
Look at the documentation here:
https://docs.python.org/3/library/re.html
Update: Here's an example of how to use the regular expression library:
#!/usr/bin/env python
import re
body = None
with open("sample.txt") as f:
body = f.read()
# Replace emails followed by other emails
body = re.sub(" * - ([a-zA-Z.#]*)\n * -", r"\1,", body)
# Replace declarations of object properties
body = re.sub(" +([a-zA-Z_]*): *[\n]*", r"\1=", body)
# Strip newlines
body = re.sub(":?\n", ":", body)
print (body)
Example output:
$ python example.py
object_name_here:object_owner=me#my.email.com, user#another.email.com:object_id=some_string_here:identification=some_other_string_here

Python: startswith() not working as intended

I am writing a program that loads an external txt file of movies. This part works fine. I then have a function that searches a list of movies generated from the file. The function should print out all movies that start with the search string.
def startsWithSearch(movieList):
searchString = input("Enter search string: ")
for movie in movieList:
if(movie.startswith(searchString) == True):
print(movie)
However, no movies are printed when I enter a search string, even though there are movies in the list that start with that string.
if given correct input data your function does work as expected:
def startsWithSearch(movieList):
searchString = "test4"
for movie in movieList:
if(movie.startswith(searchString)):
print(movie)
startsWithSearch(["test1","testnomatch","test4","test4should","not_test4"])
output is:
test4
test4should
so all correct... must be your input data
i know you want a StartsWith solution as your function name says, but actually searching for movies, it is a lot more convenient, if you find any match inside the string, so if i search for "mentalist" i will find "the mentalist", then you could just use:
if searchString in movie:
print(movie)
And as suggested by Anna to ignore case:
if searchString.lower() in movie.lower():
or even fancier with regular expressions (need import re at first line):
if re.match(".*" + searchString,movie,re.I):
or if you really just want match on beginning of name:
if re.match(searchString,movie,re.I):
that should be enough alternatives :)
I would think it might be that the input() function is returning with a new line at the end. Try adding searchString = searchString.strip() after collecting the input data.
Also you might want to try converting both to lower case before comparing.
Also the line if(movie.startswith(searchString) == True): can just be written as if movie.startswith(searchString):
I had same problem when I was reading from files. startswith was failing due to newline character which was read from file. Use rstrip() to remove newline character from both strings before you use startswith function.

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

python read output

Write a program that outputs the first number within a file specified by the user. It should behave like:
Enter a file name: l11-1.txt
The first number is 20.
You will need to use the file object method .read(1) to read 1 character at a time, and a string object method to check if it is a number. If there is no number, the expected behaviour is:
Enter a file name: l11-2.txt
There is no number in l11-2.txt.
Why is reading 1 character at a time a better algorithm than calling .read() once and then processing the resulting string using a loop?
I have the files and it does correspond to the answers above but im not sure how to make it output properly.
The code i have so far is below:
filenm = raw_input("Enter a file name: ")
datain=file(filenm,"r")
try:
c=datain.read(1)
result = []
while int(c) >= 0:
result.append(c)
c = datain.read(1)
except:
pass
if len(result) > 0:
print "The first number is",(" ".join(result))+" . "
else:
print "There is no number in" , filenm + "."
so far this opens the file and reads it but the output is always no number even if there is one. Can anyone help me ?
OK, you've been given some instructions:
read a string input from the user
open the file given by that string
.read(1) a character at a time until you get the first number or EOF
print the number
You've got the first and second parts here (although you should use open instead of file to open a file), what next? The first thing to do is to work out your algorithm: what do you want the computer to do?
Your last line starts looping over the lines in the file, which sounds like not what your teacher wants -- they want you to read a single character. File objects have a .read() method that lets you specify how many bytes to read, so:
c = datain.read(1)
will read a single character into a string. You can then call .isdigit() on that to determine if it's a digit or not:
c.isdigit()
It sounds like you're supposed to keep reading a digit until you run out, and then concatenate them all together; if the first thing you read isn't a digit (c.isdigit() is False) you should just error out
Your datain variable is a file object. Use its .read(1) method to read 1 character at a time. Take a look at the string methods and find one that will tell you if a string is a number.
Why is reading 1 character at a time a better algorithm than calling .read() once and then processing the resulting string using a loop?
Define "better".
In this case, it's "better" because it makes you think.
In some cases, it's "better" because it can save reading an entire line when reading the first few bytes is enough.
In some cases, it's "better" because the entire line may not be sitting around in the input buffer.
You could use regex like (searching for an integer or a float):
import re
with open(filename, 'r') as fd:
match = re.match('([-]?\d+(\.\d+|))', fd.read())
if match:
print 'My first number is', match.groups()[0]
This with with anything like: "Hello 111." => will output 111.

Categories