I have a footer extracted out using regex from a PDF. The footer example is as below
footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 2 Copyright 2001-2019 some relevant text here'
I want to find this string across all my text and replace it with a space since I dont need this in the middle of my text extraction. However I have the page number inbetween the text which changes each time so it is not a simple find and replace. I am able to find the page number using
result = re.search(r"\s[\d]\s", footer_text)
But I dont know how to replace this 2 with any number during my find and replace. Any pointers?
Assuming that footer text does contain something that matches r'\s\d+\s` (I am allowing for page numbers >= 10), then first you want to create a regex by replacing the page number with the regex that matches it:
regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
Now you can match any footer regardless of page number. The code then is:
>>> import re
...
... footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text h
... ere'
...
... regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
... replacement = ' ' # a single space (should this instead be '' for an empty string?)
...
... some_text = "abc" + footer_text + "def"
... print(regex)
... print(some_text)
... print(re.sub(regex, replacement, some_text))
...
company\ name\.\ \(ABC\)\ Q1\s\d+\sHere\ is\ some\ text\ 01\-Jan\-2019\ 1\-888\-1234567\ www\.company\.com\s\d+\sCopyright\ 2001\-2019some\ relevant\ text\ here
abccompany name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text heredef
abc def
For simpler copying:
import re
footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text here'
regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
replacement = ' ' # a single space (should this instead be '' for an empty string?)
some_text = "abc" + footer_text + "def"
print(regex)
print(some_text)
print(re.sub(regex, replacement, some_text))
Related
I have multiple strings to postprocess, where a lot of the acronyms have a missing closing bracket. Assume the string text below, but also assume that this type of missing bracket happens often.
My code below only works by adding the closing bracket to the missing acronym independently, but not to the full string/sentence. Any tips on how to do this efficiently, and preferably without needing to iterate ?
import re
#original string
text = "The dog walked (ABC in the park"
#Desired output:
desired_output = "The dog walked (ABC) in the park"
#My code:
acronyms = re.findall(r'\([A-Z]*\)?', text)
for acronym in acronyms:
if ')' not in acronym: #find those without a closing bracket ')'.
print(acronym + ')') #add the closing bracket ')'.
#current output:
>>'(ABC)'
You may use
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
With this approach, you can also get rid of the check if the text has ) in it before, see a demo on regex101.com.
In full:
import re
#original string
text = "The dog walked (ABC in the park"
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
print(text)
This yields
The dog walked (ABC) in the park
See a working demo on ideone.com.
For the typical example you have provided, I don't see the need of using regex
You can just use some string methods:
text = "The dog walked (ABC in the park"
withoutClosing = [word for word in text.split() if word.startswith('(') and not word.endswith(')') ]
withoutClosing
Out[45]: ['(ABC']
Now you have the words without closing parenthesis, you can just replace them:
for eachWord in withoutClosing:
text = text.replace(eachWord, eachWord+')')
text
Out[46]: 'The dog walked (ABC) in the park'
I have a long list of entries in a file in the following format:
<space><space><number><space>"<word/phrase/sentence>"
e.g.
12345 = "Section 3 is ready for review"
24680 = "Bob to review Chapter 4"
I need to find a way of inserting additional text at the beginning of the word/phrase/sentence, but only if it doesn't start with one of several key words.
Additional text: 'Complete: '
List of key words: key_words_list = ['Section', 'Page', Heading']
e.g.
12345 = "Section 3 is ready for review" (no changes needed - sentence starts with 'Section' which is in the list)
24680 = "Complete: Bob to review Chapter 4" ('Complete: ' added to start of sentence because first word wasn't in list)
This could be done with a lot of string splitting and if statements but regex seems like it should be a more concise and much neater solution. I have the following that doesn't take account of the list:
for line in lines:
line = re.sub('(^\s\s[0-9]+\s=\s")', r'\1Complete: ', line)
I also have some code that manages to identify the lines that require changes:
print([w for w in re.findall('^\s\s[0-9]+\s=\s"([\w+=?\s?,?.?]+)"', line) if w not in key_words_list])
Is regex the best option for what I need and if so, what am I missing?
Example inputs:
12345 = "Section 3 is ready for review"
24680 = "Bob to review Chapter 4"
Example outputs:
12345 = "Section 3 is ready for review"
24680 = "Complete: Bob to review Chapter 4"
You can use a regex like
^\s{2}[0-9]+\s=\s"(?!(?:Section|Page|Heading)\b)
See the regex demo. Details:
^ - start of string
\s{2} - two whitespaces
[0-9]+ - one or more digits
\s=\s - a = enclosed with a single whitespace on both ends
" - a " char
(?!(?:Section|Page|Heading)\b) - a negative lookahead that fails the match if there is Section, Page or Heading whole word immediately to the right of the current location.
See the Python demo:
import re
texts = [' 12345 = "Section 3 is ready for review"', ' 24680 = "Bob to review Chapter 4"']
add = 'Complete: '
key_words_list = ['Section', 'Page', 'Heading']
pattern = re.compile(fr'^\s{{2}}[0-9]+\s=\s"(?!(?:{"|".join(key_words_list)})\b)')
for text in texts:
print(pattern.sub(fr'\g<0>{add}', text))
# => 12345 = "Section 3 is ready for review"
# 24680 = "Complete: Bob to review Chapter 4"
I have Python strings that follow one of two formats:
"#gianvitorossi/ FALL 2012 #highheels ..."
OR:
"#gianvitorossi FALL 2012 #highheels ..."
I want to extract just the #gianvitorossi portion.
I'm trying the following:
...
company = p['edge_media_to_caption']['edges'][0]['node']['text']
company = company.replace('/','')
company = company.replace('\t','')
company = company.replace('\n','')
c = company.split(' ')
company = c[0]
This works in some of the names. However, in the example below:
My code is returning #gianvitorossi FALL rather than just #gianvitorossi as expected.
You should split with the '/' character
company = "mystring"
c = company.split('/')
company = c[0]
well it worked on my machine. for ending characters such as slash, you can use rstrip(your_symbols).
you could do that using regular expression, here what you could do
import re
text1 = "#gianvitorossi/ FALL 2012 #highheels ..."
text2 = "#gianvitorossi FALL 2012 #highheels ..."
patt = "#[A-Za-z]+"
print(re.findall(patt, text1))
if your text might include numbers you could modify the code to be as follows
import re
text1 = "#gianvitorossi/ FALL 2012 #highheels ..."
text2 = "#gianvitorossi FALL 2012 #highheels ..."
patt = "#[A-Za-z0-9]+"
print(re.findall(patt, text1))
You can get it by using split and replace, which if your requirements above are exhaustive, should be enough:
s.split(' ')[0].replace('/','')
An example:
s = ["#gianvitorossi/ FALL 2012 #highheels ...","#gianvitorossi FALL 2012 #highheels ..."]
for i in s:
print(i.split(' ')[0].replace('/',''))
#gianvitorossi
#gianvitorossi
If you don‘t want to use regular expressions, you could use this:
original = "#gianvitorossi/ FALL 2012 #highheels ..."
extract = original.split(' ')[0]
if extract[-1] == "/":
extract = extract[:-1]
I have a following text which I want in a desired format using python regex
text = "' PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'"
I used following code
reg = re.compile("[^\w']")
text = reg.sub(' ', text)
However it gives output as text = "'PowerPoint PresentationOctober 11th 2011 Visit to Lap Chec1Edit or delete me in â viewâ then â slide masterâ'" which is not a desired output.
My desired output should be text = '"PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in view then slide master.'"
I want to remove special characters except following []()-,.
Rather than removing the chars, you may fix them using the right encoding:
text = text.encode('windows-1252').decode('utf-8')
// => ' PowerPoint PresentationOctober 11th, 2011Visit to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'
See the Python demo
If you want to remove them later, it will become much easier, like text.replace('‘', '').replace('’', ''), or re.sub(r'[’‘]+', '', text).
I got the answer though it was simple as follows, thanks for replies.
reg = re.compile("[^\w'\,\.\(\)\[\]]")
text = reg.sub(' ', text)
I need match this regular expression pattern in the given text with python.
The text is :
"""
2010 Toyota FJ Cruiser FJ CRUISER
Int. Color:
Ext. Color:
Black
Trans:
Automatic
VIN:
JTEZU4BF7AK009445
Stock:
122821B
DIFFERENTIALBLACK
Status:
Body Style:
SUV
Engine:
Gas V6 4.0L/241
Dealership: Universal Toyota
$29,988*
Price
View More Information
Compare?
"""
From this text i need to extract "JTEZU4BF7AK009445" (length is 17) this pattern after vin: probably
I used this pattern
vin_pattern = re.compile('([A-Z0-9]{17})')
vin = re.findall(vin_pattern,text)
["JTEZU4BF7AK009445","DIFFERENTIALBLACK"]
But DIFFERENTIALBLACK should not be matched
As well as I used also the pattern
price_pat = re.compile('(\$[0-9\,\.]+)')
to match the price range ("$"sign+value)
Here I need to check this price matching pattern only before and after 50 characters of VIN_PATTERN appears.
Because in some cases i have more price values.So, i need to filter the text before 50 characters and after 50 characters of that VIN pattern exists
Plz How it should supposed to do?
Let's first simplify your text a bit by normailizing all whitespaces to a single space symbol:
t2 = re.sub(r'[\n\t\ ]+', ' ', t) # t is your original text
It makes looking for a VIN much easier task:
re.findall('[A-Z]{3}[A-Z0-9]{10}[0-9]{4}', t2)
Out[2]: ['JTEZU4BF7AK009445']
Then you can get position of VIN: in your string and pass vin_position - 50, vin_position + 50 into .findall method:
r2 = re.compile('(\$[0-9\,\.]+)')
r2.findall(t2, t2.find('VIN:') - 50, t2.find('VIN:') + 50)
Out[4]: []
In your text the price is more than 50 chars from VIN, i.e. you need to extend this boundary (100 works just fine):
r2.findall(t2, t2.find('VIN:') - 100, t2.find('VIN:') + 100)
Out[5]: ['$29,988']
A dirty hack but it will work.
import re
st = "....your string...."
x = re.findall(r"VIN:([^Stock]+)",st)
y = "".join(x)
y.strip(" \n")
print y
output = 'JTEZU4BF7AK009445'
If you dont have to use regexes (they are a pain in the a**) i would recommend following solution:
yourstr = """ ... whatever ... """
lst = yourstr.split()
vin = lst[lst.index('VIN:') + 1]
price = [i for i in lst if '$' in i][0]
I hope this will be sufficient!