I am trying to extract path from a given file which meet some criteria:
Example:
I have a small file with contents something like :
contentsaasdf /net/super/file-1.txt othercontents...
data is in /sample/random/folder/folder2/file-2.txt otherdata...
filename /otherfile/other-3.txt somewording
I want to extract the path's from file which contain file-*.txt in it.
In above example, I need the below path's as output
/net/super/file-1.txt
/sample/random/folder/folder2/file-2.txt
Any suggestions with Python code ?
I am trying regex. But facing issues with multiple folder's, etc. Something like:
FileRegEx = re.compile('.*(file-\\d.txt).*', re.IGNORECASE|re.DOTALL)
You don't need .* just use character classes properly:
r'[\/\w]+file-[^.]+\.txt'
[\/\w]+ will match any combinations of word characters and /. And [^.]+ will match any combination of characters except dot.
Demo:
https://regex101.com/r/ytsZ0D/1
Note that this regex might be kind of general, In that case, if you want to exclude some cases you can use ^ within character class or another proper pattern, based on your need.
Assuming your filenames are white-space separated ...
\\s(\\S+/file-\\d+\\.txt)\\s
\\s - match a white-space character
\\S+ - matches one or more non-whitespace characters
\\d+ - matches one or more digits
\\. - turns the . into a non-interesting period, instead of a match any character
You can avoid the double backslashes using r'' strings:
r'\s(\S+/file-\d+\.txt)\s'
Try this:
import re
re.findall('/.+\.txt', s)
# Output: ['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']
Output:
>>> import re
>>>
>>> s = """contentsaasdf /net/super/file-1.txt othercontents...
... data is in /sample/random/folder/folder2/file-2.txt otherdata...
... filename /otherfile/other-3.txt somewording"""
>>>
>>> re.findall('/.+\.txt', s)
['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']
Related
I'm trying to match exactly match one file in the Printer Spooler folder with RegEx. Basically what I'm going for is match the one file called *[filename].SPL in %windir%\spool\PRINTERS using Python. I thought using a dynamically generated RegEx along the lines of: [matching none or many-zeros] + [filename].SPL
Tried a few Regular Expressions, but always had the issue that the linebreak of the previous file matches as well at regex101.com
File format:
02980.SPL
20980.SPL
00011.SPL
00001.SPL
Expressions I came up with:
[\r\n][^1-9]+1.SPL
[^1-9].*1.SPL
You may use
r'^0*{}\.SPL$'.format(filename)
See the regex demo online.
If filename is 1, the pattern will look like ^0*1\.SPL$ and will match:
^ - start of string
0* - zero or more 0 chars
1\.SPL - 1.SPL substring
$ - end of string.
See the Python demo:
import re
l = ['02980.SPL','20980.SPL','00011.SPL','00001.SPL']
filename=1
rx = re.compile(r'^0*{}\.SPL$'.format(filename))
print([f for f in l if rx.search(f) ])
# => ['00001.SPL']
for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
I have web URLs that look like this:
http://example.com/php?id=2/*
http://example.com/php?id=2'
http://example.com/php?id=2*/"
What I need to do is grab the last characters of the string, I've tried:
for urls in html_page:
syntax = list(url)[-1]
# <= *
# <= '
# etc...
However this will only grab the last character of the string, is there a way I could grab the last characters as long as they are special characters?
Use a regex. Assuming that by "special character" you mean "anything besides A-Za-z0-9":
>>> import re
>>> re.search(r"\W+$", "http://example.com/php?id=2*/'").group()
"*/'"
\W+ matches one or more "non-word" characters, and $ anchors the search to the end of the string.
Use a regular expression?
import re
addr = "http://example.com/php?id=2*/"
chars = re.search(addr, "[\*\./_]{0,4}$").group()
Characters you want to match are the ones between the [] brackets. You may want to add or remove characters depending on what you expect to encounter.
For example, you would (probably) not want to match the '=' character in your example URLs, which the other answer would match.
{0,4} means to match 0-4 characters (defaults to being greedy)
I have a python raw string, that has five backslash characters followed by a double quote. I am trying to pattern-match using python re.
The output must print the matching pattern. In addition, two characters before/after the pattern.
import re
command = r'abc\\\\\"abc'
search_string = '.{2}\\\\\\\\\\".{2}'
pattern = re.compile(search_string)
ts_name = pattern.findall(command)
print ts_name
The output shows,
['\\\\\\\\"ab']
I expected
['bc\\\\\"ab']
Anomalies:
1) Extra characters at the front - ab are missing
2) Magically, it prints eight backslashes when the input string contains just five backslashes
You can simplify (shorten) your regex and use search function to get your output:
command = r'abc\\\\\"abc'
search_string = r'.{2}(?:\\){5}".{2}'
print re.compile(search_string).search(command).group()
Output:
bc\\\\\"ab
Your regex should also use r prefix.
just add a capturing group around the part you want:
command = r'a(bc\\\\\"ab)c'
and access it with:
match.group(1)
I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?
Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().
You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)
This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')