I am trying to get the character on a new line after a specific letter using regex. My raw data looks like the below:
Total current charges (please see Current account details) $38,414.69
ID Number
1001166UNBEB
ACCOUNT SUMMARY
SVL0
BALANCE OVERDUE - PLEASE PAY IMMEDIATELY $42,814.80
I want to get the ID Number
My attempt is here:
ID_num = re.compile(r'[^ID Number[\r\n]+([^\r\n]+)]{12}')
The length of ID num is always 12, and always after ID Number which is why I am specifying the length in my expression and trying to detect the elements after that.
But this is not working as desired.
Would anyone help me, please?
Your regex is not working because of the use of [ ] at the beginning of the pattern, these are used for character sets.
So replace it with ( ).
Your pattern would look like: r'^ID Number[\r\n]+([^\r\n]+){12}'
But you can simplify your pattern to: ID Number[\s]+(\w+)
\r\n will be matched in \s and numbers and alpha chars in \w.
import re
s = """
Total current charges (please see Current account details) $38,414.69
ID Number
1001166UNBEB
ACCOUNT SUMMARY
SVL0
BALANCE OVERDUE - PLEASE PAY IMMEDIATELY $42,814.80
"""
print(re.findall(r"ID Number[\s]+(\w+)", s))
# ['1001166UNBEB']
Related
I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!
I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.
I am looking for a way to count the occurrences found in the string based on my regex. I used findall() and it returns a list but then the len() of the list is only 1? shouldn't the len() of the list be 2?
import re
string1 = r'Total $200.00 Total $900.00'
regex = r'(.*Total.*|.*Invoice.*|.*Amount.*)?(\s+?\$\s?[1-9]{1,10}.*(?:
[.,]\d{3})*(?:[.,]\d{2})?)'
patt = re.findall(regex,string1)
print(patt)
print(len(patt))
Resut:
> [('Total $200.00 Total', ' $900.00')]
> 1
not sure if my regex is causing it to miscalculate. I am looking to get the Total from a file but there are many combinations of this.
Examples:
Total $900.00
Invoice Amt $500.00
Total 800.00
etc.
I am looking to count this because there could be multiple invoice details in one file.
First off, because that's a common misconception:
There is no need to match "all text up to the match" or "all the text after a match". You can drop those .* in your regex. Start with what you actually want to match.
import re
string1 = 'Total $200.00 Total $900.00'
amount_pattern = r'(?:Total|Amt|Invoice Amt|Others)[:\s]*\$([\d\.,]*\d)'
amount_expr = re.compile(amount_pattern, re.IGNORECASE)
amount_expr.findall(string1)
# -> ['200.00', '900.00']
\$([\d\.,]*\d) is a half-way reasonable approximation of prices ("things that start with a $ and then contain a bunch of digits and possibly dots and commas"). The final \d makes sure we are not accidentally matching sentence punctuation. It might be good enough, but you know what data you are working with. Feel free to come up with a more specific sub-expression. Include an optional leading - if you expect to see negative amounts.
Try:
>>> re.findall(r'(\w*\s+\$\d+\.\d+)', string1)
['Total $200.00', 'Total $900.00']
The issue you are having is your regex has two capture groups so re.findall returns a tuple of those two matches. One tuple with two matches inside has a length of 1.
I am trying to extract information from PDF.
Simple search worked:
filecontent = ReadDoc.getContent("c:\\temp\\pdf_1.pdf")
match = re.search('Document ID: (\d+)', filecontent)
if match:
docid = match.group(1)
But when I want to search a long phrase, e.g.
I want to extract '$999,999.00', which may appear in the document like "Total Cumulative Payment (USD) $999,999.00" or "Total cumulative payment $55587323.23". Note that there is a difference in the text and I need to use some kind of fuzzy search, find the sentence, somehow extract the $ from there.
Similarly I also need to extract some date, number, amount, money in between phrases/words.
Appreciate your help!
I think this should do what you want:
import re
textlist = ["some other amount as $32,4545.34 and Total Cumulative Payment (USD) $999,999.00 and such","Total cumulative payment $55587323.23"]
matchlist = []
for text in textlist:
match = re.findall("(\$[.\d,]+)", text)
if match:
matchlist.extend(match)
print(matchlist)
results:
['$32,4545.34', '$999,999.00', '$55587323.23']
The regex is look for a $ and grab ., and numbers up to the next space. Depending on what other kind of data you are parsing it may need to be tweaked, I assuming you only want to capture periods, commas and numbers.
update:
it will now find any number of occurrences and put them all in a list
Well something like this can be done with regular expressions:
import re
source = 'total cumulative payment $2000.00; some other amount $1234.56. Total Cumulative Payment (USD) $5,600,000.06'
matches = re.findall( r'total\s+cumulative\s+payment[^$0-9]+\$([0-9,.]+)', source, re.IGNORECASE )
amounts = [ float( x.replace( ',', '' ).rstrip('.') ) for x in matches ]
This will match the two specific examples you've given. But you haven't given much of an idea of how loose the matching criteria should be, or what the rules are. The solution above will miss amounts if the source document has a spelling mistake in the word "cumulative". Or if the amount appears without the dollar sign. It also allows any amount of intervening text between "total cumulative payment" and the dollar amount (so you'll get a false positive from source = "This document contains information about total cumulative payment values, (...3 more pages of introductory material...) and by the way you owe me $20.") Now, these things can be tweaked and improved - but only if you know what is going to be important and what is not, and tighten the specification of the question accordingly.
I am (a newbie,) struggling with separating a database in columns with regex.findall().
I want to separate these Dutch street names into name and number.
Roemer Visscherstraat 15
Vondelstraat 102-huis
For the number I use
\S*$
Which works just fine. For the street name I use
^\S.+[^\S$]
Or: use everything but the last element, which may be a number or a combination of a number and something else.
Problem is: Python then also keeps the last whitespace after the last name, so I get:
'Roemer Visscherstraat '
Any way I can stop this from happening?
Also, Findall returns a list consisting of the bit of database I wanted, and an empty string. How does this happen and can i prevent it somehow?
Thanks so much in advance for you help.
You can rstrip() the name to remove any spaces at the end of it:
>>>'Roemer Visscherstraat '.rstrip()
'Roemer Visscherstraat'
But if the input is similar to the one you posted, you can simply use split() instead of regex, for example:
st = 'Roemer Visscherstraat 15'
data = st.split()
num = st[-1]
name = ' '.join(st[:-1])
print 'Name: {}, Number: {}'.format(name, num)
output:
Name: Roemer Visscherstraat, Number: 15
For the number you should use the following:
\S+$
Using a + instead of a * will ensure that you have at least one character in the match.
For the street name you can use the following:
^.+(?=\s\S+$)
What this does is selects text up until the number.
However, what you may consider doing is using one regex match with capture groups instead. The following would work:
^(.+(?=\s\S+$))\s(\S+$)
In this case, the first capture group gives you the street name, and the second gives you the number.
([^\d]*)\s+(\d.*)
In this regex the first group captures everything before a space and a number and the 2nd group gives the desired number
my assumption is that number would begin with a digit and the name would not have a digit in it
take a look at https://regex101.com/r/eW0UP2/1
Roemer Visscherstraat 15
Full match 0-24 `Roemer Visscherstraat 15`
Group 1. 0-21 `Roemer Visscherstraat`
Group 2. 22-24 `15`
Vondelstraat 102-huis
Full match 24-46 `Vondelstraat 102-huis`
Group 1. 24-37 `Vondelstraat`
Group 2. 38-46 `102-huis`
I am new to using regex.
I have a string in the form
Waco, Texas
Unit Dose 13 and
SECTION 011100 SUMMARY OF WORK
INDEX PAGE
PART 1. - GENERAL 1
1.1. RELATED DOCUMENTS 1
1.2. PROJECT DESCRIPTION 1
1.3. OWNER 1
1.4. ARCHITECT/ENGINEER 2
1.5. PURCHASE CONTRACTS 2
1.6. OWNER-FURNISHED ITEMS 2
1.7. CONTRACTOR-FURNISHED ITEMS 3
1.8. CONTRACTOR USE OF PREMISES 3
1.9. OWNER OCCUPANCY 3
1.10. WORK RESTRICTIONS 4
PART 2. - PRODUCTS - NOT APPLICABLE 4
PART 3. - EXECUTION - NOT APPLICABLE 4
I apologize for the extra white space, but this is the form of the word document I parsed to obtain the string.
I need to capture all of the heading between PART 1 PART 2 and PART 3 and store them in a different list. So far I have
matchedtext = re.findall('(?<=PART) (.*?) (?=PART)', text, re.DOTALL)
If I understand correctly, these look arounds should use PART as a sort of base point and grab the text in between. However, matchedtext does not fill with anything when I run the code.
The second part of my problem is once I have the text in between the different occurrences of PART how can I save just the capitalized headings in a list with a string for each heading. Some of my strings from the word documents contain lowercase words, but I just want the words that are all in caps.
So to summarize how can I grab the text between specific words in a string and once I have them how can I save the words as individual strings in a list.
Thanks for the help! :D
You don't even need to use regex, just use the split function for strings. If s is the name of your string, it would be:
s.split('PART')
This will include the text before the first PART, so don't use the first element of the list:
texts_between_parts = s.split('PART')[1:]
You can later check if a word is all upper case using the string method isupper.
I would forget grabbing everything between Part 1 and Part 2,etc. I would parse each line with the following regex and use Group 1 to determine the grouping of the headings.
^(\d)(\.|\d)+\s+([^a-z]+?)\s+\d$
Group 1 is the Part Number/Section
Group 2 is the Sub Section
Group 3 is the Heading
import re
p = re.compile('^(\d)(\.|\d)+\s+([^a-z]+?)\s+\d$')
m = p.match( '1.4. ARCHITECT/ENGINEER 2' )
if m:
print('Match found: ', m.groups())
else:
print('No match')
Match found: ('1', '.', 'ARCHITECT/ENGINEER')
import re
p = re.compile('^(\d)(\.|\d)+\s+([^a-z]+?)\s+\d$')
m = p.match( '1.4. ARCHITECT/ENGINEER 2' )
if m:
print('Section: ', m.group(1))
print('Heading: ', m.group(3))
else:
print('No match')
# Output
# Section: 1
# Heading: ARCHITECT/ENGINEER