The namelist is:
[J. A. Rubiño-Martín, R. Rebolo, M. Aguiar, R. Génova-Santos, F. Gómez-Reñasco, J. M. Herreros, R.J. Hoyland, C. López-Caraballo, A. E. Pelaez Santos, V. Sanchez de la Rosa]
and I need to split it into
[[J. A.], [Rubiño-Martín], [R.], [Rebolo], [M.], [Aguiar], [R.], [Génova-Santos], [F.], [Gómez-Reñasco], [J. M.], [Herreros], [R.J.], [Hoyland], [C.], [López-Caraballo], [A. E.], [Pelaez Santos], [V.], [Sanchez de la Rosa]
using python regex
For the given input, this regex works. The first group will match any number of tokens followed by a dot, multiple times in greedy fashion. The second group matches everything after the last dot followed by one ore more spaces.
^(.+\.)+\s+(.+)$
https://regex101.com/r/Jxy3Un/1
Here is a visualization:
But as pointed out in the comments, it could easily break if you get names that don't follow this rather strict pattern.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a problem concerning the search for words for the purpose of a text.
In my code I look for words within an Italian text (this is divided into strings, based on the paragraphs) but when I have words like "e", "in", "ad", it tells me that it finds them many times but in reality, these are words like "begin", "adduce" and any word that contains the e. Is there an efficient way to avoid this "mistake"? I have searched everywhere but I just can't find anything, I think it's a simple problem but I'm not an expert at all, thanks to those who will help me. I would like to do it without importing any libraries
sample text:
['sostanza di cieli ed astri cercai per oceani. di donarmi il diluvio ti dissi io, o musa, scorgendo il destino.', " o zeus che infiniti addurre volle, principiando con stormi arditi fulmini di ira molto funesta laddove si alzasse eccessivamente il volare negato all'uomo.", 'imperterrita irrefrenabile poiché poiché memore di ciò, da qualunque principio, memore di di di ciò di ciò, da qualunque principio, ad ogni costo, dea figlia di zeus, narrane cagione e spirito. ']
i had to find these words (there is a possibility that not all of them are in the text, for example 'e' is missing):
uomo,
dissi io,
o musa,
molto,
eccessivamente,
e,
in,
di ciò
expected output: uomo, dissi io, o musa, molto, eccessivamente, di ciò
You likely want something more advanced which understands the grammar of the language you're trying to parse, but this may work for you
split each paragraph up into individual words
check each word for closeness to your word (ie Levenshtein distance or another metric)
Perhaps
import difflib
def iter_test_words(source_paragraph, words_to_check):
for word_test in source_paragraph.split(): # split by whitespace:
yield difflib.get_close_matches(word_test, words_to_check, n=1, cutoff=0.9)
Some further help
you could try/except and find the first index in the returned list [0] to find anomalous words (IndexError)
you likely need to tune your cutoff as-needed (or even dynamically; ie re-try for anomalies) to get good results
again, using and configuring a library for your needs will probably give better results .. ideally something which
understands the grammar
understands subtle (for computers) word variations (ie. for your case, are Italian tenses of "to go" andando and andato the same? but that ondato "wave" is another concept despite being a better textual match)
>>> import difflib
>>> difflib.get_close_matches("andato", ["andando", "ondato"])
['ondato', 'andando']
>>> difflib.SequenceMatcher(None, "andato", "andando").ratio()
0.7692307692307693
>>> difflib.SequenceMatcher(None, "andato", "ondato").ratio()
0.8333333333333334
You can use regular expression for this purpose. The special sequence \b matches word boundaries. For example, searching for the pattern \bin\b will search for the beginning of a word, followed by "in", followed by the end of a word.
Here is the code:
>>> import re
>>> len(re.findall(r'\bin\b', 'begin in begin end'))
1
Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".
What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo
The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.
I am trying to extract some sub-strings from another string, and I have identified patterns that should yield the correct results, however I think there are some small flaws in my implementation.
s = 'Arkansas BaseballMiami (Ohio) at ArkansasFeb 17, 2017 at Fayetteville, Ark. (Baum Stadium)Score by Innings123456789RHEMiami (Ohio)000000000061Arkansas60000010X781Miami (Ohio) starters: 1/lf HALL, D.; 23/3b YACEK; 36/1b HAFFEY; 40/c SENGER; 7/dh HARRIS; 8/rf STEPHENS; 11/ss TEXIDOR; 2/2b VOGELGESANG; 5/cf SADA; 32/p GNETZ;Arkansas starters: 8/dh E. Cole; 9/ss J. Biggers; 17/lf L. Bonfield; 33/c G. Koch; 28/cf D. Fletcher; 20/2b C. Shaddy; 24/1b C Spanberger; 15/rf J. Arledge; 6/3b H. Wilson; 16/p B. Knight;Miami (Ohio) 1st - HALL, D. struck out swinging.'
Here is my attempt at regex formulas to achieve my desired outputs:
teams = re.findall(r'(;|[0-9])(.*?) starters', s)
pitchers = re.findall('/p(.*?);', s)
The pitchers search seems to work, however the teams outputs the following:
[('1', '7, 2017 at Fayetteville, Ark. (Baum Stadium)Score by Innings123456789RHEMiami (Ohio)000000000061Arkansas60000010X781Miami (Ohio)'), ('1', '/lf HALL, D.; 23/3b YACEK; 36/1b HAFFEY; 40/c SENGER; 7/dh HARRIS; 8/rf STEPHENS; 11/ss TEXIDOR; 2/2b VOGELGESANG; 5/cf SADA; 32/p GNETZ;Arkansas')]
DESIRED OUTPUTS:
['Miami (Ohio)', 'Arkansas']
[' GNETZ', ' B. Knight']
I can worry about stripping out the leading spaces in the pitchers names later.
(;|[0-9]) can be replaced with [;0-9]. Then what I think you're trying to express is "get me the string before starters and immediately after the last number/semicolon that comes before the starters", for which you can say "there must be no other numbers/semicolons in between", i.e.
teams = re.findall(r'[;0-9]([^;0-9]*) starters', s)
I'm trying to parse a string containing a name and a degree. I have a long list of these. Some contain no degrees, some contain one, and some contain multiple.
Example strings:
Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
As far as I can tell, the degrees come in the following patterns:
x.x.
x.x.x.
x.x.xx.
x.xx.
xx.x.
x.xxx.
two caps (ex: 'MA')
How would I parse this?
I'm new to regex and breaking down this problem has proved very time-consuming. I've been using this post and tried split = re.split('\s+|([.])',s) and split = re.split('\s+|\.',s) but these still split on the first space.
I have thought, in response to the first comment, about the degree designations. I've been trying to make a regex that recognizes 'x.x' and then a wildcard afterwards because there are several patterns within the degrees which look like this: x.x(something):
x.x.
x.x.x.
x.x.xx.
and then I'd have a few more to classify.
Alternatively, classifying the name might be easier?
Or even listing the degrees in a collection and searching for them?
{'M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.", 'RN', 'B.S.Ed.'}
Try to change your "Jr.", "Sr.", ... replacing them with something like this: "Jr~", "Sr~", ...
This is the the regular expression for doing that:
/ (Jr|Sr)\. / $1~ /g
(See here )
You obtain this string:
Sam da Man J.D.
Green Eggs Jr~ Ed.M.
Argle Bargle Sr~ MA
Cersei Lannister M.A. Ph.D.
Now you can easily capture degrees with this regular expression:
/ (MA|RN|([A-Z][a-z]?[a-z]?\.)+) /g
(See here )
you can use this:
'[ ](MA|RN|([A-Z][a-z]?[a-z]?\.){2,3})'
it doesn't take any word with one dot
I think the best approach is either creating a list or regex of specific degrees you're looking for, instead of trying to define patterns like x.x. that will match several different degrees. A pattern like this is too general, and may match many other values in free text (in this case, people's initials).
import re
s = """Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
Albus Dumbledore M.A.T.
"""
pattern = r"M.A.T.|Ph.D.|MA|J.D.|Ed.M.|M.A.|M.B.A.|Ed.S.|M.Div.|M.Ed.|RN|B.S.Ed."
degrees = re.findall(pattern, s, re.MULTILINE)
print(degrees)
Output:
['J.D.', 'Ed.M.', 'MA', 'M.A.', 'Ph.D.', 'M.A.T.']
If you're looking to get the names that appear between the degrees in a block of text like the one above, you can use re.split.
names = re.split(pattern, s)
names = [n.strip() for n in names if n.strip()]
print(names)
Output:
['Sam da Man', 'Green Eggs Jr.', 'Argle Bargle Sr.', 'Cersei Lannister', 'Albus Dumbledore']
Note that I had to strip the remaining strings and remove empty strings from the results to capture just the names. Doing that operation on the result allows the regex to be much simpler.
Note also that this can still fail when a specific degree could also be someone's initials, (e.g., J.D. Salinger). You may need to make adjustments or other allowances based on your real data.
I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)