How can I get only the words that match my regex in python? Because everything I tried also prints the full line where the string was found.
The regex is the following:
\b([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2})\b
It matched IP + CIDR (e.g 12.0.0.0/8)
The text in which I am searching this is as follows:
04/30","172.18.186.0/24","172.18.185.0/24","172.18.177.16/28","dwefwf-1.RI-nc_wefwfwefwefpat_intweb_fe","172.18.176.16/28","edefwfwf
t_pat_infwef_fe","172.18.178.16/28","dwefwefwef-wefwffwefwefwef_dr_efwefeb_fe","172.18.176.80/28","DSwefwfH2.
RI-nc_rat_dr_fweweb_fe","172.18.178.48/28","172.18.177.208/28","wefwef
wefwtfweapp_fe","172.18.176.208/28","wfwfwefwefwefH2.RI-nwefwefdr_app_fe","172.18.177.192/28","de1dfwwf-1.wefewf","172.18.176.1
92/28","
You should modify your regex as follows:
\b(([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2}))\b
and then extract the first matched group: \1
Demo: http://repl.it/R0W/1 (It takes a while to run)
I think your regexp work correctly. If you want to get matched string use group function. Like this:
import re
regexp = r'\b([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2})\b'
text = '''04/30","172.18.186.0/24","172.18.185.0/24","172.18.177.16/28","dwefwf-1.RI-nc_wefwfwefwefpat_intweb_fe","172.18.176.16/28","edefwfwf
t_pat_infwef_fe","172.18.178.16/28","dwefwefwef-wefwffwefwefwef_dr_efwefeb_fe","172.18.176.80/28","DSwefwfH2.
RI-nc_rat_dr_fweweb_fe","172.18.178.48/28","172.18.177.208/28","wefwef
wefwtfweapp_fe","172.18.176.208/28","wfwfwefwefwefH2.RI-nwefwefdr_app_fe","172.18.177.192/28","de1dfwwf-1.wefewf","172.18.176.1
92/28","'''
for i in re.finditer(regexp, text):
print i.group(0)
Related
I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1
I am facing a challenge in Python where I have a list that contains multiple strings. I want to use a Regex (findall) to search for any occurrence of each of the list's elements in a text file.
import re
name_list = ['friend', 'boy', 'man']
example_string = "friend"
file= open('file.txt', 'r')
lines= file.read()
Then comes the re.findall expression. I configured it such that it finds any occurrence in the text file where a desired string is found between a number in parentheses (\d) and a period. It works perfectly when I place a string variable inside the regular expression, as seen below.
find = re.findall(r"([^(\d)]*?"+example_string+r"[^.]*)", lines)
However, I want to be able to replace example_string with some sort of mechanism that returns each of the elements in name_list as individual strings to be placed and searched for in the regular expression. The lists I work with can get much larger than the list Iin this example, so please keep that in mind.
As a beginner, I tried simply replacing the string in re.findall with the list I have, only to quickly realize that that would result in an error. The solution to this must allow me to use re.findall in the aforementioned manner, so most of the challenge lies in manipulating the list so that it can produce each of its elements as individual strings to be placed within re.findall.
Thank you for your insights.
for name in name_list:
find = re.findall(r"([^(\d)]*?"+name+r"[^.]*)", lines)
# ... do stuff with the results
this iterates through each item in name_list, and runs the same regex as before.
The pattern that you use ([^(\d)]*?[^.]*) for this match is not correct, see the match here.
I configured it such that it finds any occurrence in the text file
where a desired string is found between a number in parentheses (\d)
and a period.
It is due to this construct [^(\d)] that is a negated character class matching any character except for what is in between the square brackets.
The next negated character class [^.]* matches any char except a dot, but the final dot is not matched.
The pattern to find all between a number in parenthesis and a dot can be using a capture group that will be returned by re.findall.
\(\d+\)([^.]*(?:friend|boy|man)[^.]*)\.
See a regex 101 demo
For example, if the content of file.txt is:
this is (10) with friend and a text.
Example code, assembling the words in a non capture group using .join(name_list)
import re
name_list = ['friend', 'boy', 'man']
pattern = rf"\(\d+\)([^.]*(?:{'|'.join(name_list)})[^.]*)\."
file = open('file.txt', 'r')
lines = file.read()
print(re.findall(pattern, lines))
Output
[' with friend and a text']
I have a webpage's source. It's just a ton of random numbers and letters and function names, saved as a string in python3. I want to find the text that says \"followerCount\": in the source code of this string, but I also want to find a little bit of the text that follows it (n characters). This would hopefully have the piece of text I'm looking for. Can I search for a specific part of a string and the n characters that follow it in python3?
Use .find() to get the position:
html = "... lots of html source ..."
position = html.find('"followerCount":')
Then use string slicing to extract that part of the string:
n = 50 # or however many characters you want
print(html[position:position+n])
A standard way of looking for text based on a pattern is a regex. For example here you can ask for any three characters following "followerCount:"
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'(?<="followerCount":).{3}', s)
if match:
print(match.group(0))
#prints '123'
Alternatively you can make a regex without the lookbehind and capture the three characters in a group:
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'"followerCount":(.{3})', s)
if match:
print(match.group(1))
#prints '123'
I have this string:
'Is?"they'
I want to find the question mark (?) in the string, and put it at the end of the string. The output should look like this:
'Is"they?'
I am using the following regular expression in python 2.7. I don't know why my regex is not working.
import re
regs = re.sub('(\w*)(\?)(\w*)', '\\1\\3\\2', 'Is?"they')
print regs
Is?"they # this is the output of my regex.
Your regex doesn't match because " is not in the \w character class. You would need to change it to something like:
regs = re.sub('(\w*)(\?)([^"\w]*)', '\\1\\3\\2', 'Is?"they')
As shown here, " is not captured by \w. Hence, it would probably be best to just use a .:
>>> import re
>>> re.sub("(.*)(\?)(.*)", r'\1\3\2', 'Is?"they')
'Is"they?'
>>>
. captures anything/everything in Regex (except newlines).
Also, you'll notice that I used a raw-string for the second argument of re.sub. Doing so is cleaner than having all those backslashes.
I have a text file and every time that the word "get" occurs I need to insert an # sign after it.
In Python how do I add a character after a specific word using regex? Right now I am parsing the line word by word and I don't understand regex enough to write the code.
Use re.sub() to provide replacements, using a backreference to re-use matched text:
import re
text = re.sub(r'(get)', r'\1#', text)
The (..) parenthesis mark a group, which \1 refers to when specifying a replacement. So get is replaced by get#.
Demo:
>>> import re
>>> text = 'Do you get it yet?'
>>> re.sub(r'(get)', r'\1#', text)
'Do you get# it yet?'
The pattern will match get anywhere in the string; if you need to limit it to whole words, add \b anchors:
text = re.sub(r'(\bget\b)', r'\1#', text)