Extract path from lines in a file using python - python

I am trying to extract path from a given file which meet some criteria:
Example:
I have a small file with contents something like :
contentsaasdf /net/super/file-1.txt othercontents...
data is in /sample/random/folder/folder2/file-2.txt otherdata...
filename /otherfile/other-3.txt somewording
I want to extract the path's from file which contain file-*.txt in it.
In above example, I need the below path's as output
/net/super/file-1.txt
/sample/random/folder/folder2/file-2.txt
Any suggestions with Python code ?
I am trying regex. But facing issues with multiple folder's, etc. Something like:
FileRegEx = re.compile('.*(file-\\d.txt).*', re.IGNORECASE|re.DOTALL)

You don't need .* just use character classes properly:
r'[\/\w]+file-[^.]+\.txt'
[\/\w]+ will match any combinations of word characters and /. And [^.]+ will match any combination of characters except dot.
Demo:
https://regex101.com/r/ytsZ0D/1
Note that this regex might be kind of general, In that case, if you want to exclude some cases you can use ^ within character class or another proper pattern, based on your need.

Assuming your filenames are white-space separated ...
\\s(\\S+/file-\\d+\\.txt)\\s
\\s - match a white-space character
\\S+ - matches one or more non-whitespace characters
\\d+ - matches one or more digits
\\. - turns the . into a non-interesting period, instead of a match any character
You can avoid the double backslashes using r'' strings:
r'\s(\S+/file-\d+\.txt)\s'

Try this:
import re
re.findall('/.+\.txt', s)
# Output: ['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']
Output:
>>> import re
>>>
>>> s = """contentsaasdf /net/super/file-1.txt othercontents...
... data is in /sample/random/folder/folder2/file-2.txt otherdata...
... filename /otherfile/other-3.txt somewording"""
>>>
>>> re.findall('/.+\.txt', s)
['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']

Related

Match File Names with Preceding Zeros

I'm trying to match exactly match one file in the Printer Spooler folder with RegEx. Basically what I'm going for is match the one file called *[filename].SPL in %windir%\spool\PRINTERS using Python. I thought using a dynamically generated RegEx along the lines of: [matching none or many-zeros] + [filename].SPL
Tried a few Regular Expressions, but always had the issue that the linebreak of the previous file matches as well at regex101.com
File format:
02980.SPL
20980.SPL
00011.SPL
00001.SPL
Expressions I came up with:
[\r\n][^1-9]+1.SPL
[^1-9].*1.SPL
You may use
r'^0*{}\.SPL$'.format(filename)
See the regex demo online.
If filename is 1, the pattern will look like ^0*1\.SPL$ and will match:
^ - start of string
0* - zero or more 0 chars
1\.SPL - 1.SPL substring
$ - end of string.
See the Python demo:
import re
l = ['02980.SPL','20980.SPL','00011.SPL','00001.SPL']
filename=1
rx = re.compile(r'^0*{}\.SPL$'.format(filename))
print([f for f in l if rx.search(f) ])
# => ['00001.SPL']

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

Get the last 4 characters of a string as long as they are special characters

I have web URLs that look like this:
http://example.com/php?id=2/*
http://example.com/php?id=2'
http://example.com/php?id=2*/"
What I need to do is grab the last characters of the string, I've tried:
for urls in html_page:
syntax = list(url)[-1]
# <= *
# <= '
# etc...
However this will only grab the last character of the string, is there a way I could grab the last characters as long as they are special characters?
Use a regex. Assuming that by "special character" you mean "anything besides A-Za-z0-9":
>>> import re
>>> re.search(r"\W+$", "http://example.com/php?id=2*/'").group()
"*/'"
\W+ matches one or more "non-word" characters, and $ anchors the search to the end of the string.
Use a regular expression?
import re
addr = "http://example.com/php?id=2*/"
chars = re.search(addr, "[\*\./_]{0,4}$").group()
Characters you want to match are the ones between the [] brackets. You may want to add or remove characters depending on what you expect to encounter.
For example, you would (probably) not want to match the '=' character in your example URLs, which the other answer would match.
{0,4} means to match 0-4 characters (defaults to being greedy)

Python regex, pattern-match multiple backslash characters

I have a python raw string, that has five backslash characters followed by a double quote. I am trying to pattern-match using python re.
The output must print the matching pattern. In addition, two characters before/after the pattern.
import re
command = r'abc\\\\\"abc'
search_string = '.{2}\\\\\\\\\\".{2}'
pattern = re.compile(search_string)
ts_name = pattern.findall(command)
print ts_name
The output shows,
['\\\\\\\\"ab']
I expected
['bc\\\\\"ab']
Anomalies:
1) Extra characters at the front - ab are missing
2) Magically, it prints eight backslashes when the input string contains just five backslashes
You can simplify (shorten) your regex and use search function to get your output:
command = r'abc\\\\\"abc'
search_string = r'.{2}(?:\\){5}".{2}'
print re.compile(search_string).search(command).group()
Output:
bc\\\\\"ab
Your regex should also use r prefix.
just add a capturing group around the part you want:
command = r'a(bc\\\\\"ab)c'
and access it with:
match.group(1)

extracting multiple instances regex python

I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?
Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().
You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)
This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')

Categories