Finding an unkown string within a string - python

This question is going to be similar, but looking for something completely different than one I asked a few days ago.
I have a string that is known, but is a portion of code, and it varies, I currently find it by using:
for num, line in enumerate(code, 1):
if re.match("function (.*) {", line):
That gets me through a good portion of what I need, as I need to know the line number that it starts at. My problem starts here. What I need is just the part where I am using the (.*) regular expression.

You mean the text between ( and )?
Use capturing groups:
m = re.match("function (.*) {", line):
if m:
print m.group(1)

The match object object which is returned contains all contents of groups. I would use re.search if 'function' isn't always at the beginning of a line and '.+' to match functions with at least one character.
line_to_fn = {}
for num, line in enumerate(code, 1):
match = re.search("function (.+) {", line)
if match:
matches = match.groups()
assert len(matches) == 1, repr(matches)
line_to_fn[num] = matches[0]
# line_to_fn: {1: 'something', 5: 'something_else'}here

Related

Get last part after number regex python

I have always 2 numbers in between and I want to extract everything before 3 so Salvatore and everything after 2 Abdulla
For example I have the following:
txt = "Salvatore32Abdulla"
first = re.findall("^\D+", txt)
last = re.search(,txt)
Expected result:
first = 'Salvatore'
last = 'Abdulla'
I can get the first part, but after 2 I can't get the last part
You could also do this in a single line by slightly changing the solution suggested by #ctwheels as follows. I would suggest you to use re.findall as that gets the job done with a single blow.
import re
txt = "Salvatore32Abdulla"
Option-1
Single line extraction of the non-numeric parts.
first, last = re.findall("\D+", txt)
print((first, last))
('Salvatore', 'Abdulla')
Option-2
If you would (for some reason) also want to keep track of the number in between:
first, num, last = re.findall("(\D+)(\d{2})(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Option-3
As an extension of Option-2 and considering the text with a form 'Salvatore####...###Abdulla', where ####...### denotes a continuous block of digits separating the non-numeric parts and you may or may not have any idea of how many digits could be in-between, you could use the following:
first, num, last = re.findall("(\D+)(\d*)(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Why am I not getting the expected results?
You currently have one issue with your regex and one with your code.
Your regex contains ^, which anchors it to the start of the string. This will only allow you to match Salvatore. You're using findall (which is the appropriate choice if you change the regex to simply \D+), but right now it's only getting one result.
The second re.search call is not needed as you can capture first and last with the findall given an appropriate pattern (see below).
How do I fix it?
See code in use here
import re
txt = "Salvatore32Abdulla"
x = re.findall("\D+", txt)
print(x)
Result:
['Salvatore', 'Abdulla']
You could use a regex like this:
txt = "Salvatore32Abdulla"
regex = r"(\D+)\d\d(\D+)"
match = re.match(regex, txt)
first = match.group(1)
last = match.group(2)
Part after last digit:
match = re.search(r'\D+$',txt)
if match:
print(match.group())
See Python proof | regex proof.
Results: Abdulla
EXPLANATION
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

How to use regex to tell if first and last character of a string match?

I'm relatively new to using Python and Regex, and I wanted to check if strings first and last characters are the same.
If first and last characters are same, then return 'True' (Ex: 'aba')
If first and last characters are not same, then return 'False' (Ex: 'ab')
Below is the code, I've written:
import re
string = 'aba'
pattern = re.compile(r'^/w./1w$')
matches = pattern.finditer(string)
for match in matches
print (match)
But from the above code, I don't see any output
if and only if you really want to use regex (for learning purpose):
import re
string = 'aba'
string2 = 'no match'
pattern = re.compile(r'^(.).*\1$')
if re.match(pattern, string):
print('ok')
else:
print('nok')
if re.match(pattern, string2):
print('ok')
else:
print('nok')
output:
ok
nok
Explanations:
^(.).*\1$
^ start of line anchor
(.) match the first character of the line and store it in a group
.* match any characters any time
\1 backreference to the first group, in this case the first character to impose that the first char and the last one are equal
$ end of line anchor
Demo: https://regex101.com/r/DaOPEl/1/
Otherwise the best approach is to simply use the comparison string[0] == string[-1]
string = 'aba'
if string[0] == string[-1]:
print 'same'
output:
same
Why do you overengineer with an regex at all? One principle of programming should be keeping it simple like:
string[0] is string[-1]
Or is there a need for regex?
The above answer of #Tobias is perfect & simple but if you want solution using regex then try the below code.
Try this code !
Code :
import re
string = 'abbaaaa'
pattern = re.compile(r'^(.).*\1$')
matches = pattern.finditer(string)
for match in matches:
print (match)
Output :
<_sre.SRE_Match object; span=(0, 7), match='abbaaaa'>
I think this is the regex you are trying to execute:
Code:
import re
string = 'aba'
pattern = re.compile(r'^(\w).(\1)$')
matches = pattern.finditer(string)
for match in matches:
print (match.group(0))
Output:
aba
if you want to check with regex use below:
import re
string = 'aba is a cowa'
pat = r'^(.).*\1$'
re.findall(pat,string)
if re.findall(pat,string):
print(string)
this will match first and last character of line or string if they match then it returns matching character in that case it will print string of line otherwise it will skip

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

Find and extract two substrings from string

I have some strings (in fact they are lines read from a file). The lines are just copied to some other file, but some of them are "special" and need a different treatment.
These lines have the following syntax:
someText[SUBSTRING1=SUBSTRING2]someMoreText
So, what I want is: When I have a line on which this "mask" can be applied, I want to store SUBSTRING1 and SUBSTRING2 into variables. The braces and the = shall be stripped.
I guess this consists of several tasks:
Decide if a line contains this mask
If yes, get the positions of the substrings
Extract the substrings
I'm sure this is a easy task with regex, however, I'm not used to it. I can write a huge monster function using string manipulation, but I guess this is not the "Python Way" to do this.
Any suggestions on this?
re.search() returns None if it doesn't find a match. \w matches an alphanumeric, + means 1 or more. Parenthesis indicate the capturing groups.
s = """
bla bla
someText[SUBSTRING1=SUBSTRING2]someMoreText"""
results = {}
for line_num, line in enumerate(s.split('\n')):
m = re.search(r'\[(\w+)=(\w+)\]', line)
if m:
results.update({line_num: {'first': m.group(0), 'second': m.group(1)}})
print(results)
^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$
You can try this.Group 1and Group 2 has the two string you want.See demo.
https://regex101.com/r/pT4tM5/26
import re
p = re.compile(r'^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
test_str = "someText[SUBSTRING1=SUBSTRING2]someMoreText\nsomeText[SUBSTRING1=SUBSTRING2someMoreText\nsomeText[SUBSTRING1=SUBSTRING2]someMoreText"
re.findall(p, test_str)

Regex - matching all text between two strings

I'm currently parsing a log file that has the following structure:
1) timestamp, preceded by # character and followed by \n
2) arbitrary # of events that happened after that timestamp and all followed by \n
3) repeat..
Here is an exmaple:
#100
04!
03!
02!
#1299
0L
0K
0J
0E
#1335
06!
0X#
0[#
b1010 Z$
b1x [$
...
Please forgive the seemingly cryptic values, they are encodings representing certain "events".
Note: Event encodings may also use the # character.
What I am trying to do is to count the number of events that happen at a certain time.
In other words, at time 100, 3 events happened.
I am trying to match all text between two timestamps - and count the number of events by simply counting the number of newlines enclosed in the matched text.
I'm using Python's regex engine, and I'm using the following expression:
pattern = re.compile('(#[0-9]{2,}.*)(?!#[0-9]+)')
Note: The {2,} is because I want timestamps with at least two digits.
I match a timestamp, continue matching any other characters until hitting another timestamp - ending the matching.
What this returns is:
#100
#1299
#1335
So, I get the timestamps - but none of the events data - what I really care about!
I'm thinking the reason for this is that the negative-lookbehind is "greedy" - but I'm not completely sure.
There may be an entirely different regex that makes this much simpler - open to any suggestions!
Any help is much appreciated!
-k
I think a regex is not a good tool for the job here. You can just use a loop..
>>> import collections
>>> d = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
... t = 'initial'
... for line in f:
... if line.startswith('#'):
... t = line.strip()
... else:
... d[t].append(line.strip())
...
>>> for k,v in d.iteritems():
... print k, len(v)
...
#1299 4
#100 3
#1335 6
If you insist on a regex-based solution, I propose this:
>>> pat = re.compile(r'(^#[0-9]{2,})\s*\n((?:[^#].*\n)*)', re.MULTILINE)
>>> for t, e in pat.findall(s):
... print t, e.count('\n')
...
#100 3
#1299 4
#1335 6
Explanation:
(
^ anchor to start of line in multiline mode
#[0-9]{2,} line starting with # followed by numbers
)
\s* skip whitespace just in case (eg. Windows line separator)
\n new line
(
(?: repeat non-capturing group inside capturing group to capture
all repetitions
[^#].*\n line not starting with #
)*
)
You seemed to have misunderstood what negative lookahead does. When it follows .*, the regex engine first tries to consume as many characters as possible and only then checks the lookahead pattern. If the lookahead does not match, it will backtrack character by character until it does.
You could, however, use positive lookahead together with the non-greedy .*?. Here the .*? will consume characters until the lookahead sees either a # at start of a line, or the end of the whole string:
re.compile(r'(^#[0-9]{2,})\s*\n(.*?)(?=^#|\Z)', re.DOTALL | re.MULTILINE)
The reason is that the dot doesn't match newlines, so your expression will only match the lines containing the timestamp; the match won't go across multiple lines. You could pass the "dotall" flag to re.compile so that your expression will match across multiple lines. Since you say the "event encodings" might also contain a # character, you might also want to use the multiline flag and anchor your match with ^ at the beginning so it only matches the # at the beginning of a line.
You could just loop through the data line by line and have a dictionary that just stores the number of events associated with each timestamp; no regex required. For example:
with open('exampleData') as example:
eventCountsDict = {}
currEvent = None
for line in example:
if line[0] == '#': # replace this line with more specific timestamp details if event encodings can start with a '#'
eventCountsDict[line] = 0
currEvent = line
else:
eventCountsDict[currEvent] += 1
print eventCountsDict
That code prints {'#1299\n': 4, '#1335\n': 5, '#100\n': 3} for your example data (not counting the ...).

Categories