I have several strings from which I need to extract the block numbers. The block numbers are of the format type "3rd block" , "pine block" ,"block 2" and "block no 4". Please note that is just the format type and the numbers could change. I have added them in OR conditions .
The problem is that at times the regex extracts the previous word connected to something else like "main phase block 2" would mean I need "block 2" to be extracted . Using re.search causes the 1st result to turn up and there are even limitations of "OR".
What I want is to add exceptions or condition my regex with something like
if 1 or 2 digits (like 23 , 3 ,6 ,7 etc) occur before the word "block", extract "block" with the word following "block".
Eg :
string = "rmv clusters phase 2 block 1 , flat no 209 dev." #extract "block 1" and not "2 block".
if words "phase , apartment or building" come before "block", extract word that follows block (irrespective of whether its a number or word)
Eg:
string 2 = "sky line apartments block 2 chandra layout" #extract "block 2" and not "apartments block"
Here is what I have done. But I've got no idea about adding conditions.
p = re.compile(r'(block[^a-z]\s\d*)|(\w+\sblock[^a-z])|(block\sno\s\d+)')
q = p.search(str)
this is a part of an entire function.
Tested on Python 2.7 and 3.3.
import re
strings = ("rmv clusters phase 2 block 1 , flat no 209 dev."
"sky line apartments block 2 chandra layout"
"foo bar 99 block baz") # tests rule 1.
Here's the rules you stated you wanted:
if 1 or 2 digits (like 23 , 3 ,6 ,7 etc) occur before the word "block", extract "block" with the word following "block".
if words "phase , apartment or building" come before "block", extract word that follows block (irrespective of whether its a number or word). * I'm inferring you want the word block too.
So
regex = re.compile(r'''
(?:\d{1,2}\s)(block\s\w*) # rule 1
| # or
(?:(phase|apartment|building).*?)(block\s\w+) # rule 2
''', re.X)
found = regex.finditer(strings)
for i in found:
print(i.groups())
prints:
(None, 'phase', '1')
(None, 'apartment', '2')
('block baz', None, None)
None is the default for a group if not found, so, you can pick a preference and allow the short-cutting or to return the first if it's non-empty, or the second if the first is empty (i.e. evaluates as False in Python's boolean contexts).
>>> found = regex.finditer(strings)
>>> for i in found:
... print(i.group(1) or i.group(3))
...
1
2
block baz
So to put this thing into a simple function:
def block(str):
regex = re.compile(r'''
(?:\d{1,2}\s)(block\s\w*) # rule 1
| # or
(?:(phase|apartment|building).*?)(block\s\w+) # rule 2
''', re.X)
match = regex.search(str)
if not match:
return ''
else:
return match.group(1) or match.group(3) or ''
usage:
>>> block("foo bar 99 block baz")
'block baz'
>>> block("sky line apartments block 2 chandra layout")
'block 2'
Why don't you write multiple regexes? See the following snippet in python3
def getBlockMatch(string):
import re
p1Regex = re.compile('block\s+\d+')
p2Regex = re.compile('(block[^a-z]\s\d*)|(\w+\sblock[^a-z])|(block\sno\s\d+)')
if p1Regex.search(string) is not None:
return p1Regex.findall(string)
else:
return p2Regex.findall(string)
string = "rmv clusters phase 2 block 1 , flat no 209 dev."
print(getBlockMatch(string))
string = "sky line apartments block 2 chandra layout"
print(getBlockMatch(string))
Outputs:
['block 1']
['block 2']
>> import re
>>> string = "rmv clusters phase 2 block 1 , flat no 209 dev."
>>> string2 = "sky line apartments block 2 chandra layout"
>>> print re.findall(r'block\s+\d+', string)
['block 1']
>>> print re.findall(r'block\s+\d+', string2)
['block 2']
Related
I have a long list of entries in a file in the following format:
<space><space><number><space>"<word/phrase/sentence>"
e.g.
12345 = "Section 3 is ready for review"
24680 = "Bob to review Chapter 4"
I need to find a way of inserting additional text at the beginning of the word/phrase/sentence, but only if it doesn't start with one of several key words.
Additional text: 'Complete: '
List of key words: key_words_list = ['Section', 'Page', Heading']
e.g.
12345 = "Section 3 is ready for review" (no changes needed - sentence starts with 'Section' which is in the list)
24680 = "Complete: Bob to review Chapter 4" ('Complete: ' added to start of sentence because first word wasn't in list)
This could be done with a lot of string splitting and if statements but regex seems like it should be a more concise and much neater solution. I have the following that doesn't take account of the list:
for line in lines:
line = re.sub('(^\s\s[0-9]+\s=\s")', r'\1Complete: ', line)
I also have some code that manages to identify the lines that require changes:
print([w for w in re.findall('^\s\s[0-9]+\s=\s"([\w+=?\s?,?.?]+)"', line) if w not in key_words_list])
Is regex the best option for what I need and if so, what am I missing?
Example inputs:
12345 = "Section 3 is ready for review"
24680 = "Bob to review Chapter 4"
Example outputs:
12345 = "Section 3 is ready for review"
24680 = "Complete: Bob to review Chapter 4"
You can use a regex like
^\s{2}[0-9]+\s=\s"(?!(?:Section|Page|Heading)\b)
See the regex demo. Details:
^ - start of string
\s{2} - two whitespaces
[0-9]+ - one or more digits
\s=\s - a = enclosed with a single whitespace on both ends
" - a " char
(?!(?:Section|Page|Heading)\b) - a negative lookahead that fails the match if there is Section, Page or Heading whole word immediately to the right of the current location.
See the Python demo:
import re
texts = [' 12345 = "Section 3 is ready for review"', ' 24680 = "Bob to review Chapter 4"']
add = 'Complete: '
key_words_list = ['Section', 'Page', 'Heading']
pattern = re.compile(fr'^\s{{2}}[0-9]+\s=\s"(?!(?:{"|".join(key_words_list)})\b)')
for text in texts:
print(pattern.sub(fr'\g<0>{add}', text))
# => 12345 = "Section 3 is ready for review"
# 24680 = "Complete: Bob to review Chapter 4"
For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. The named capture groups in regex is extremely useful. So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. An example of how this might be achieved is given below:
import numpy as np
import pandas as pd
import re
myDF = pd.DataFrame(['Here is some text',
'We all love TEXT',
'Where is the TXT or txt textfile',
'Words and words',
'Just a few works',
'See the text',
'both words and text'],columns=['origText'])
print("Original dataframe\n------------------")
print(myDF)
# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)
# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)
# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)
myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)
This produces the following output:
Original dataframe
------------------
origText
0 Here is some text
1 We all love TEXT
2 Where is the TXT or txt textfile
3 Words and words
4 Just a few works
5 See the text
6 both words and text
Dataframe of matches (with multi-index)
--------------------
textOcc wordOcc
match
0 0 text NaN
1 0 TEXT NaN
2 0 TXT NaN
1 txt NaN
2 text NaN
3 0 NaN Word
1 NaN word
5 0 text NaN
6 0 NaN word
1 text NaN
Collapsed and concatenated matches
----------------------------------
textOcc wordOcc
0 text
1 TEXT
2 TXT///txt///text
3 Word///word
5 text
6 text word
Final joined dataframe
----------------------
origText textOcc wordOcc
0 Here is some text text
1 We all love TEXT TEXT
2 Where is the TXT or txt textfile TXT///txt///text
3 Words and words Word///word
4 Just a few works NaN NaN
5 See the text text
6 both words and text text word
I've printed each stage to try to make it easy to follow.
The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms).
I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. The code I have so far (using the same dataframe and regex as in the Python example above) is:
origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text" "We all love TEXT" "Where is the TXT or txt textfile" "Words and words"
[5] "See the text" "both words and text"
myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7
The regex seems to be working and the correct rows are identified as containing a match (i.e. all except row 5 in the above example). However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex?
Base R does capture the information about the names but it doesn't have a good helper to extract them by name. I write a wrapper to help called regcapturedmatches. You can use it with
myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)
Which returns
textOcc wordOcc
[1,] "text" ""
[2,] "TEXT" ""
[3,] "TXT" ""
[4,] "" "Word"
[5,] "" ""
[6,] "text" ""
[7,] "" "word"
I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]
I want to find the number of times a substring occurred in a string. I was doing this
termCount = content.count(term)
But if i search like "Ford" it returned result set like
"Ford Motors" Result: 1 Correct
"cannot afford Ford" Result: 2 Incorrect
"ford is good" Result: 1 Correct
The search term can have multiple terms like "Ford Motors" or "Ford Auto".
For example if i search "Ford Motor"
"Ford Motors" Result: 1 Correct
"cannot afford Ford Motor" Result: 1 Correct
"Ford Motorway" Result: 1 InCorrect
What i want is to search them case insensitive and as a whole. Mean if I search a substring it should be contained as a whole as a word or a phrase (In case of multiple terms) not part of the word. And also I need the count of the terms. How do I achieve it.
You can use regex, and in this case use re.findall then get the length of matched list :
re.findall(r'\byour_term\b',s)
Demo
>>> s="Ford Motors cannot afford Ford Motor Ford Motorway Ford Motor."
>>> import re
>>> def counter(str,term):
... return len(re.findall(r'\b{}\b'.format(term),str))
...
>>> counter(s,'Ford Motor')
2
>>> counter(s,'Ford')
4
>>> counter(s,'Fords')
0
I would split the strings by spaces so that we have independent words and then from there I would carry out the count.
terms = ['Ford Motors', 'cannot afford Ford', 'ford is good'];
splitWords = [];
for term in terms:
#take each string in the list and split it into words
#then add these words to a list called splitWords.
splitWords.extend(term.lower().split())
print(splitWords.count("ford"))
Hello all…I want to pick up the texts ‘DesingerXXX’ from a text file which contains below contents:
C DesignerTEE edBore 1 1/42006
Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
EngineBore 11/16 DesignerTDT 8Length 3Width 3
EngineCy DesignerHEE Inline2008Bore 1
Height 4TheChallen DesignerTET e 1Stroke 1P 305
Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
Height DesignerEQE C 60150ccGas2007
Anidea is to use the ‘Designer’ as a key, to consider each line into 2 parts, before the key, and after the key.
file_object = open('C:\\file.txt')
lines = file_object.readlines()
for line in lines:
if 'Designer' in line:
where = line.find('Designer')
before = line[0:where]
after = line[where:len(line)]
file_object.close()
In the ‘before the key’ part, I need to find the LAST space (‘ ’), and replace to another symbol/character.
In the ‘after the key’ part, I need to find the FIRST space (‘ ’), and replace to another symbol/character.
Then, I can slice it and pick up the wanted according to the new symbols/characters.
is there a better way to pick up the wanted texts? Or not, how can I replace the appointed key spaces?
In the string replace function, I can limit the times of replacing but not exactly which I can replace. How can I do that?
thanks
Using regular expressions, its a trivial task:
>>> s = '''C DesignerTEE edBore 1 1/42006
... Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
... EngineBore 11/16 DesignerTDT 8Length 3Width 3
... EngineCy DesignerHEE Inline2008Bore 1
... Height 4TheChallen DesignerTET e 1Stroke 1P 305
... Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
... Height DesignerEQE C 60150ccGas2007'''
>>> import re
>>> exp = 'Designer[A-Z]{3}'
>>> re.findall(exp, s)
['DesignerTEE', 'DesignerHHJ', 'DesignerTDT', 'DesignerHEE', 'DesignerTET', 'DesignerQBG', 'DesignerEQE']
The regular expression is Designer[A-Z]{3} which means the letters Designer, followed by any letter from capital A to capital Z that appears 3 times, and only three times.
So, it won't match DesignerABCD (4 letters), it also wont match Desginer123 (123 is not valid letters).
It also won't match Designerabc (abc are small letters). To make it ignore the case, you can pass an optional flag re.I as a third argument; but this will also match designerabc (you have to be very specific with regular expressions).
So, to make it so that it matches Designer followed by exactly 3 upper or lower case letters, you'd have to change the expression to Designer[Aa-zZ]{3}.
If you want to search and replace, then you can use re.sub for substituting matches; so if I want to replace all matches with the word 'hello':
>>> x = re.sub(exp, 'hello', s)
>>> print(x)
C hello edBore 1 1/42006
Cylinder SingleVerticalB hello e 1 1/8Cooling 1
EngineBore 11/16 hello 8Length 3Width 3
EngineCy hello Inline2008Bore 1
Height 4TheChallen hello e 1Stroke 1P 305
Height 8C 606Wall15ccG hello ccGasEngineJ 142
Height hello C 60150ccGas2007
and what if both before and after 'Designer', there are characters,
and the length of character is not fixed. I tried
'[Aa-zZ]Designer[Aa-zZ]{0~9}', but it doesn't work..
For these things, there are special characters in regular expressions. Briefly summarized below:
When you want to say "1 or more, but at least 1", use +
When you want to say "0 or any number, but there maybe none", use *
When you want to say "none but if it exists, only repeats once" use ?
You use this after the expression you want to be modified with the "repetition" modifiers.
For more on this, have a read through the documentation.
Now your requirements is "there are characters but the length is not fixed", based on this, we have to use +.
Try with re.sub. The regular expression match with your keyword surrounded by spaces. The second parameter of sub, replace the surrounder spaces by your_special_char (in my script a hyphen)
>>> import re
>>> with open('file.txt') as file_object:
... your_special_char = '-'
... for line in file_object:
... formated_line = re.sub(r'(\s)(Designer[A-Z]{3})(\s)', r'%s\2%s' % (your_special_char,your_special_char), line)
... print formated_line
...
C -DesignerTEE-edBore 1 1/42006
Cylinder SingleVerticalB-DesignerHHJ-e 1 1/8Cooling 1
EngineBore 11/16-DesignerTDT-8Length 3Width 3
EngineCy-DesignerHEE-Inline2008Bore 1
Height 4TheChallen-DesignerTET-e 1Stroke 1P 305
Height 8C 606Wall15ccG-DesignerQBG-ccGasEngineJ 142
Height-DesignerEQE-C 60150ccGas2007
Maroun Maroun mentioned 'Why not simply split the string'. so guessing one of the working way is:
import re
file_object = open('C:\\file.txt')
lines = file_object.readlines()
b = []
for line in lines:
a = line.split()
for aa in a:
b.append(aa)
for bb in b:
if 'Designer' in bb:
print bb
file_object.close()