Python extract substring from between parenthesis - python

I have a string, that is formatted like that:
"Name Surname (ID), Name2 Surname2 (ID2)"
ID starts with letter that is followed by few digits. We can have various number of people in that string (I mean there can be only one person, 2 as in provided example or even more). Also, people can have few names or surnames, so it's not consistent.
I want to extract a substring built of ID's divided by colons, so for this example it would look like that:
"ID, ID2"
Right now i tried this approach:
import re
string = "Bob Rob Smith (L1234567), John Doe (k12345678)"
result = re.findall(r'[a-zA-Z][0-9]+', string)
','.join(result)
And it works perfectly fine, but I wonder if there's simpler approach that doesn't require any additional modules. Do you guys have any ideas?

I also think using re is good approach, if you have to NOT use re AT ANY PRICE, then you might do:
s = "Bob Rob Smith (L1234567), John Doe (k12345678)"
result = s.replace(')','(').split('(')[1::2]
print(result)
Output:
['L1234567', 'k12345678']
Explanation: I want to split at ( and ), but .split method of str accepts only one delimiter, so I firstly replace ) with (, then I split and get odd elements. This method will work if: ( and ) are used solely around IDs, s does not starts with (, s does not starts with ), there is at least one character between any two brackets.

You could split on ), and take the last 8 characters from each element in the split list but regex is the correct approach
[s[-8:] for s in mystring[:-1].split('),')]

to me, the RegEx approach seems the best approach.
Assuming that you do not know exactly how many digits your IDs have (quote: followed by a few digits), you could through the whole string and catch what's inside parenthesis:
s = "Bob Rob Smith (L1234567), John Doe (k12345678)"
res = []
word = ''
open = False
for x in s:
if x == '(':
open = True
continue
if x == ')':
open = False
res.append(word)
word = ''
if open:
word += x
print(res)
OUTPUT:
['L1234567', 'k12345678']

Related

How to extract 3 and or more words after a specific word

I've been trying to extract 3 and or more words after Diagnosis: or diagnosis: to no avail.
This is the code I've been trying:
'diagnosis: \s+((?:\w+(?:\s+|$)){2})'
prints empty.
I have managed to make this code work:
"Diagnosis: (\w+)",
"diagnosis: (\w+)",
which gives me the immediate word after Diagnosis: or diagnosis:.
How can I make it work for 3 or more words?
##title Extract Diagnosis { form-width: "20%" }
def extract_Diagnosis(clinical_information):
PATTERNS = [
"diagnosis: (\w+).",
"Diagnosis: (\w+).",
]
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches if t.isalpha()])
return Diagnosis
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
what I'm looking for is 3 or more words that come after diagnosis: or Diagnosis: in 20 pdfs. I've already turned the pdf to text and extracted the paragraph which "diagnosis:" is in (clinical information).
Ok, a new answer that focuses more on the problems with your code than problems with your regular expression. So first of all, your regular expression needs to be tweaked just a little bit by removing the initial space character and changing 2 to 3:
diagnosis:\s+((?:\w+(?:\s+|$)){3})
Your code has a number of issues. Here's a version of your code that kinda works, although it may not be doing exactly what you want:
import re
def extract_Diagnosis(clinical_information):
PATTERNS = [r"diagnosis:\s+((?:\w+(?:\s+|$)){3})"]
matches = []
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches])
return Diagnosis
texts = ["diagnosis: a b c blah blah blah diagnosis: asdf asdf asdf x x x "]
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
Result:
a b c asdf asdf asdf.
Here are the things I fixed with your code:
I replaced the two regular expressions with the one expression in your question, with the modifications mentioned above.
I added a r to the front of the string constant containing the regular expression. This specifies a "raw string" in Python. You need to either do this or double up your backslashes.
You were filtering your results with the expression if t.isalpha(). Given your expression, this will always be False because what you are matching will always contain spaces as well as word characters. I see no reason for this test anyway, since you know exactly what you're getting because what you get matched your regular expression.
I fixed indentation so that everything worked. It may be that you had that right in your original code and it just got messed up moving it into your question.
I hope this helps!

Hidding values of column with ****xy using python

I am stuck in a coding problem, in Python, I have a CSV file having two columns Flag | Customer_name, I am using data frames so if flag is "0" I want to print complete name and if Flag=1 then I want to hide first n-2 alphabets of Customer name with "*" for example,
if flag=1 then,
display *********th (for john smith)
Thanks in advance
You can create the number of '*' needed and then add the last two letters:
name = 'john smith'
name_update = '*' * (len(name)-2) + name[-2:]
print(name_update)
output:
********th
As you used dataframe as tag, I assume that you are working with pandas.DataFrame - in such case you might harness regular expression for that task.
import pandas as pd
df = pd.DataFrame({'name':['john smith']})
df['redacted'] = df['name'].str.replace(r'.(?=..)', '*')
print(df)
Output:
name redacted
0 john smith ********th
Explanation: I used here positive lookahead (kind of zero-length assertion) and I replace any character with * if and only if two any characters follows - which is true for all but 2 last characters.

Python regex preference if multiple matches

I am searching for city names in a string:
mystring = 'SDM\Austin'
city_search = r'(SD|Austin)'
mo_city = re.search(city_search,mystring,re.IGNORECASE)
city = mo_city.group(1)
print(city)
This will return city as 'SD'.
Is there a way to make 'Austin' the preference?
Switching the order to (Austin|SD) doesn't work.
The answer is the same as How can I find all matches to a regular expression in Python?, but the use case is a little different since one match is preferred.
You're using re.search, instead use re.findall which returns a lists of all matches.
So if you modify your code to:
mystring = 'SDM\Austin'
city_search = r'(SD|Austin)'
mo_city = re.findall(city_search,mystring,re.IGNORECASE)
city = mo_city[1]
print(city)
it will work find, outputting:
Austin
So, mo_city is a list: ['SD', 'Austin'] and since we want to assign the second element (Austin) to city, we take index 1 with mo_city[1].
Brief
You already have a great answer here (using findall instead of search with regex). This is another alternative (without using regex) that checks a string against a list of strings and returns matches. Based on the sample code you provided, this should work for you and is probably easier than the regex method.
Code
See code in use here
list = ['SD', 'Austin']
s = 'SDM\Austin'
for l in list:
if l in s:
print '"{}" exists in "{}"'.format(l, s);

Python Regex: how to not select whitespace before last string?

I am (a newbie,) struggling with separating a database in columns with regex.findall().
I want to separate these Dutch street names into name and number.
Roemer Visscherstraat 15
Vondelstraat 102-huis
For the number I use
\S*$
Which works just fine. For the street name I use
^\S.+[^\S$]
Or: use everything but the last element, which may be a number or a combination of a number and something else.
Problem is: Python then also keeps the last whitespace after the last name, so I get:
'Roemer Visscherstraat '
Any way I can stop this from happening?
Also, Findall returns a list consisting of the bit of database I wanted, and an empty string. How does this happen and can i prevent it somehow?
Thanks so much in advance for you help.
You can rstrip() the name to remove any spaces at the end of it:
>>>'Roemer Visscherstraat '.rstrip()
'Roemer Visscherstraat'
But if the input is similar to the one you posted, you can simply use split() instead of regex, for example:
st = 'Roemer Visscherstraat 15'
data = st.split()
num = st[-1]
name = ' '.join(st[:-1])
print 'Name: {}, Number: {}'.format(name, num)
output:
Name: Roemer Visscherstraat, Number: 15
For the number you should use the following:
\S+$
Using a + instead of a * will ensure that you have at least one character in the match.
For the street name you can use the following:
^.+(?=\s\S+$)
What this does is selects text up until the number.
However, what you may consider doing is using one regex match with capture groups instead. The following would work:
^(.+(?=\s\S+$))\s(\S+$)
In this case, the first capture group gives you the street name, and the second gives you the number.
([^\d]*)\s+(\d.*)
In this regex the first group captures everything before a space and a number and the 2nd group gives the desired number
my assumption is that number would begin with a digit and the name would not have a digit in it
take a look at https://regex101.com/r/eW0UP2/1
Roemer Visscherstraat 15
Full match 0-24 `Roemer Visscherstraat 15`
Group 1. 0-21 `Roemer Visscherstraat`
Group 2. 22-24 `15`
Vondelstraat 102-huis
Full match 24-46 `Vondelstraat 102-huis`
Group 1. 24-37 `Vondelstraat`
Group 2. 38-46 `102-huis`

How do you effectively use regular expressions to find alliterative expressions?

I have an assignment that requires me to use regular expressions in python to find alliterative expressions in a file that consists of a list of names. Here are the specific instructions:
" Open a file and return all of the alliterative names in the file.
For our purposes a "name" is a two sequences of letters separated by
a space, with capital letters only in the leading positions.
We call a name alliterative if the first and last names begin
with the same letter, with the exception that s and sh are considered
distinct, and likewise for c/ch and t/th.The names file will contain a list of strings separated by commas.Suggestion: Do this in two stages." This is my attempt so far:
def check(regex, string, flags=0):
return not (re.match("(?:" + regex + r")\Z", string, flags=flags)) is None
def alliterative(names_file):
f = open(names_file)
string = f.read()
lst = string.split(',')
lst2 = []
for i in lst:
x=lst[i]
if re.search(r'[A-Z][a-z]* [A-Z][a-z]*', x):
k=x.split(' ')
if check('{}'.format(k[0][0]), k[1]):
if not check('[cst]', k[0][0]):
lst2.append(x)
elif len(k[0])==1:
if len(k[1])==1:
lst2.append(x)
elif not check('h',k[1][1]):
lst2.append(x)
elif len(k[1])==1:
if not check('h',k[0][1]):
lst2.append(x)
return lst2
There are two issues that I have: first, what I coded seems to make sense to me, the general idea behind it is that I first check that the names are in the correct format (first name, last name, all letters only, only first letters of first and last names capitalized), then check to see if the starting letters of the first and last names match, then see if those first letters are not c s or t, if they aren't we add the name to the new list, if they are, we check to see that we aren't accidentally matching a [cst] with an [cst]h. The code compiles but when I tried to run it on this list of names:
Umesh Vazirani, Vijay Vazirani, Barbara Liskov, Leslie Lamport, Scott Shenker, R2D2 Rover, Shaq, Sam Spade, Thomas Thing
it returns an empty list instead of ["Vijay Vazirani", "Leslie Lamport", "Sam Spade", "Thomas Thing"] which it is supposed to return. I added print statements to alliterative so see where things were going wrong and it seems that the line
if check('{}'.format(k[0][0]), k[1]):
is an issue.
More than the issues with my program though, I feel like I am missing the point of regular expressions: am I overcomplicating this? Is there a nicer way to do this with regular expressions?
Please consider improving your question.
Especially the question is only useful for those who want to answer to the exactly the same question, which I think is almost no chance.
Please think how to improve so that it can be generallized to the point where this QA can be helpful to others.
I think your direction is about right.
It's a good idea to check the input rightness using regular
expression. r'[A-Z][a-z]* [A-Z][a-z]*' is a good expression.
You can group the output by parentheses. So that you can easily get first and last name later on
Keep in mind the difference between re.match and re.search. re.search(r'[A-Z][a-z]* [A-Z][a-z]*', 'aaRob Smith') returns a MatchObject. See this.
Also comment on general programming style
Better to name variables first and last for readability, rather than k[0] and k[1] (and how is the letter k picked!?)
Here's one way to do:
import re
FULL_NAME_RE = re.compile(r'^([A-Z][a-z]*) ([A-Z][a-z]*)$')
def is_alliterative(name):
"""Returns True if it matches the alliterative requirement otherwise False"""
# If not matches the name requirement, reject
match = FULL_NAME_RE.match(name)
if not match:
return False
first, last = match.group(1, 2)
first, last = first.lower(), last.lower() # easy to assume all lower-cases
if first[0] != last[0]:
return False
if first[0] in 'cst': # Check sh/ch/th
# Do special check
return _is_cst_h(first) == _is_cst_h(last)
# All check passed!
return True
def _is_cst_h(text):
"""Returns true if text is one of 'ch', 'sh', or 'th'."""
# Bad (?) assumption that the first letter is c, s, or t
return text[1:].startswith('h')
names = [
'Umesh Vazirani', 'Vijay Vazirani' , 'Barbara Liskov',
'Leslie Lamport', 'Scott Shenker', 'R2D2 Rover', 'Shaq' , 'Sam Spade', 'Thomas Thing'
]
print [name for name in names if is_alliterative(name)]
# Ans
print ['Vijay Vazirani', 'Leslie Lamport', 'Sam Spade', 'Thomas Thing']
Try this regular expression:
[a[0] for a in re.findall('((?P<caps>[A-Z])[a-z]*\\s(?P=caps)[a-z]*)', names)]
Note: It does not handle the sh/ch/th special case.

Categories