I would like to pickup whole words in a string that separated by space, comma or period.
text = 'OTC GLUCOSAM-CHOND-MSM1-C-MANG-BOR test, dosage uncertain'
p = r"(?i)\b([A-Za-z]+[\s*|\,|\.]+)\b"
for m in regex.finditer(p, str(text)):
print (m.group())
I expect to get:
OTC
GLUCOSAM-CHOND-MSM1-C-MANG-BOR
test
dosage
uncertain
but what I got:
OTC
BOR
test,
dosage
To get a list of the words that you want, you can use the findall() function of the remodule. Also, try changing the regular expressions to the one showed below:
text = 'OTC GLUCOSAM-CHOND-MSM1-C-MANG-BOR test, dosage uncertain'
result = re.findall('[\w]+[-?[\w]+]*', text)
print(result)
# outputs: ['OTC', 'GLUCOSAM-CHOND-MSM1-C-MANG-BOR', 'test', 'dosage', 'uncertain']
import re
text = 'OTC GLUCOSAM-CHOND-MSM1-C-MANG-BOR test, dosage uncertain'
p = r"[a-zA-Z-\d]*"
for m in re.finditer(p, str(text)):
if len(m.group().strip()) > 0:
print(m.group())
Related
I have the following string for which I want to extract data:
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n
Every variable I want to extract starts with \n
The value I want to get starts with a colon ':' followed by more than 1 dot
When it doesnt start with a colon followed by dots, I dont want to extract that value.
For example my preferred output looks like:
LOA = 189.9
LBP = 176.0
BM = 26.4
DM = 9.2
import re
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n'
# capture all the characters BEFORE the ':' character
variables = re.findall(r'(.*?):', text_example)
# matches all floats and integers (does not account for minus signs)
values = re.findall(r'(\d+(?:\.\d+)?)', text_example)
# zip into dictionary (this is assuming you will have the same number of results for both regex expression.
result = dict(zip(variables, values))
print(result)
--> {'Length oa': '189.9', 'Breadth moulded': '26.4', 'Length bp': '176', 'Depth moulded to main deck': '9.2'}
You can create a regex and workaround the solution-
re.findall(r'(\\n|\n)([A-Za-z\s]*)(?:(\:\s*\.+))(\d*\.*\d*)',text_example)[2]
('\n', 'Breadth moulded', ': .......', '26.4')
I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!
Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]
Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.
findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech
Given a file like this:
# For more information about CC-CEDICT see:
# http://cc-cedict.org/wiki/
A A [A] /(slang) (Tw) to steal/
AA制 AA制 [A A zhi4] /to split the bill/to go Dutch/
AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable/
A咖 A咖 [A ka1] /class "A"/top grade/
A圈兒 A圈儿 [A quan1 r5] /at symbol, #/
A片 A片 [A pian4] /adult movie/pornography/
I want to build a json object that:
skip lines that starts with #
breaks lines into 4 parts
tradition character (spans from start ^ until the next space)
simplified character (spans from the first space to the second)
pinyin (spans between the square brackets [...])
the gloss space between the first / till the last / (note there are cases where there can be slashes within the gloss, e.g. /adult movie/pornography/
I am currently doing it as such:
>>> for line in text.split('\n'):
... if line.startswith('#'): continue;
... line = line.strip()
... simple, _, line = line.partition(' ')
... trad, _, line = line.partition(' ')
... print simple, trad
...
A A
AA制 AA制
AB制 AB制
A咖 A咖
A圈兒 A圈儿
A片 A片
To get the [...], I had to do:
>>> import re
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> simple, _, line = line.partition(' ')
>>> trad, _, line = line.partition(' ')
>>> re.findall(r'\[.*\]', line)[0].strip('[]')
'A pian4'
And to find the /.../, I had to do:
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> re.findall(r'\/.*\/$', line)[0].strip('/')
'adult movie/pornography'
How do I use regex groups to catch all of them at once which doing multiple partitions/splits/findall?
I could extract the info using regular expressions instead. This way, you can catch blocks in groups and then handle them as desired:
import re
with open("myfile") as f:
data = f.read().split('\n')
for line in data:
if line.startswith('#'): continue
m = re.search(r"^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$", line)
if m:
print(m.groups())
That is regular expression splits the string in the following groups:
^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$
^^^^^ ^^^^^ ^^^^^ ^^
1) 2) 3) 4)
That is:
the first word.
the second word.
the text within [ and ].
the text from / up to the / before the end of the line.
It returns:
('A', 'A', 'A', '(slang) (Tw) to steal')
('AA制', 'AA制', 'A A zhi4', 'to split the bill/to go Dutch')
('AB制', 'AB制', 'A B zhi4', 'to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable')
('A咖', 'A咖', 'A ka1', 'class "A"/top grade')
('A圈兒', 'A圈儿', 'A quan1 r5', 'at symbol, #')
('A片', 'A片', 'A pian4', 'adult movie/pornography')
p = re.compile(ru"(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+/(.*)/$")
m = p.match(line)
if m:
simple, trad, pinyin, gloss = m.groups()
See https://docs.python.org/2/howto/regex.html#grouping for more details.
This might help:
preg = re.compile(r'^(?<!#)(\w+)\s(\w+)\s(\[.*?\])\s/(.+)/$',
re.MULTILINE | re.UNICODE)
with open('your_file') as f:
for line in f:
match = preg.match(line)
if match:
print(match.groups())
Take a look here for a detailed explanation of the used regular expression.
I created following regex to match all the four groups:
REGEX DEMO
^(.*)\s(.*)\s(\[.*\])\s(\/.*\/)
This does assume that there is only one space in between the groups however if you have more you can just add a modifier.
Here is a demo of how this works with python with the lines provided in the question:
IDEONE DEMO
I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]
The purpose of this code is to make a program that searches a persons name (on Wikipedia, specifically) and uses keywords to come up with reasons why that person is significant.
I'm having issues with this specific line "if fact_amount < 5 and (terms in sentence.lower()):" because I get this error ("TypeError: coercing to Unicode: need string or buffer, list found")
If you could offer some guidance it would be greatly appreciated, thank you.
import requests
import nltk
import re
#You will need to install requests and nltk
terms = ['pronounced'
'was a significant'
'major/considerable influence'
'one of the (X) most important'
'major figure'
'earliest'
'known as'
'father of'
'best known for'
'was a major']
names = ["Nelson Mandela","Bill Gates","Steve Jobs","Lebron James"]
#List of people that you need to get info from
for name in names:
print name
print '==============='
#Goes to the wikipedia page of the person
r = requests.get('http://en.wikipedia.org/wiki/%s' % (name))
#Parses the raw html into text
raw = nltk.clean_html(r.text)
#Tries to split each sentence.
#sort of buggy though
#For example St. Mary will split after St.
sentences = re.split('[?!.][\s]*',raw)
fact_amount = 0
for sentence in sentences:
#I noticed that important things came after 'he was' and 'she was'
#Seems to work for my sample list
#Also there may be buggy sentences, so I return 5 instead of 3
if fact_amount < 5 and (terms in sentence.lower()):
#remove the reference notation that wikipedia has
#ex [ 33 ]
sentence = re.sub('[ [0-9]+ ]', '', sentence)
#removes newlines
sentence = re.sub('\n', '', sentence)
#removes trailing and leading whitespace
sentence = sentence.strip()
fact_amount += 1
#sentence is formatted. Print it out
print sentence + '.'
print
You should be checking it the other way
sentence.lower() in terms
terms is list and sentence.lower() is a string. You can check if a particular string is there in a list, but you cannot check if a list is there in a string.
you might mean if any(t in sentence_lower for t in terms), to check whether any terms from terms list is in the sentence string.