Ignore specific words in path string - python

I have to process long paths and I'd like to ignore specific words:
'/home/me/data/dataset/images/dark-side_23---83971436re.jpg'
'/home/me/data/dataset/images/medium-side_23---83971436re.jpg'
'/home/me/data/dataset/images/others_23---83971436re.jpg'
So the output should be:
side
side
others
I'm using this regex:
pat = re.compile(r'/([^/]+)_\d+---.*.jpg$')
re.search(pat, path_string).groups()
And I've tried something with negative lookup but doesn't work:
pat = re.compile(r'/(?!dark|medium)([^/]+)_\d+---.*.jpg$')
Any ideas?
Edit: Sorry, I forgot to mention that they could exist another strings like:
'/home/me/data/dataset/images/light-side_23---83971436re.jpg'
Where it should return:
light-side
So using the "-" character won't be useful in this case.

You may use
(?:(?:dark|medium)-)?([^/]+)_\d+---[^/]*\.jpg$
See the regex demo
Details
(?:(?:dark|medium)-)? - an optional group matching 1 or 0 repetitions of
(?:dark|medium) - dark or medium words (if you want to only avoid matching them as whole words use (?:\b(?:dark|medium)-)?)
- - a hyphen
([^/]+) - Group 1: any one or more chars other than /
_ - an underscore
\d+ - 1+ digits
--- - three hyphens
[^/]* - 0+ chars other than /
\.jpg - .jpg substring (. is special, thus, must be escaped)
$ - end of string.
Python demo:
import re
strs = ['/home/me/data/dataset/images/dark-side_23----83971436re.jpg',
'/home/me/data/dataset/images/medium-side_23---83971436re.jpg',
'/home/me/data/dataset/images/others_23---83971436re.jpg',
'/home/me/data/dataset/images/light-side_23---83971436re.jpg']
rx = re.compile(r'(?:(?:dark|medium)-)?([^/]+)_\d+---[^/]*\.jpg$')
for s in strs:
m = rx.search(s)
if m:
print(m.group(1))
Output:
side
side
others
light-side
NOTE that you may simplify it a bit if you first grab the last subpart by using os.path.basename(os.path.normpath(s)). Then, you may use r'^(?:(?:dark|medium)-)?(.+)_\d+---.*\.jpg$'. See this Python demo.

Using ([^/]+)_\d+---.*\.jpg$ with a condition:
import re
str_list = ['/home/me/data/dataset/images/dark-side_23----83971436re.jpg',
'/home/me/data/dataset/images/medium-side_23---83971436re.jpg',
'/home/me/data/dataset/images/others_23---83971436re.jpg',
'/home/me/data/dataset/images/light-side_23---83971436re.jpg']
pat = re.compile(r'([^/]+)_\d+---.*\.jpg$')
for s in str_list:
if "light" in s:
print(re.search(pat, s).group(1))
else:
print(re.search(pat, s).group(1).rpartition('-')[2])
OUTPUT:
side
side
others
light-side

Related

Cannot seem to figure out this regex involving forward slash

I am trying to capture instances in my dataframe where a string has the following format:
/random a/random b/random c/capture this/random again/random/random
Where a string is preceded by four instances of /, and more than two / appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None.
In this instance capture this should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^ - start of string
(?:[^/]*/){4} - four occurrences of any zero or more chars other than / and then a / char
([^/]+) - Capturing group 1:one or more chars other than a / char
(?:/[^/]*){2} - two occurrences of a / char and then any zero or more chars other than /.
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3} - match the part before capture this, matching any but as few as possible characters between each pair of /s (use noncapturing group to ignore the contents)
(.*?) - capture capture this (since this is a capturing group, we can fetch the contents from <match_object>.group(1)
(?:/.*?){2,} - same as the first part, match as few characters as possible in between each pair of /s

how to extract the front and back of a designated special token using regex?

How to extract the front and back of a designated special token(in this case, -, not #)?
And if those that are connected by - are more than two, I want to extract those too. (In the example, Bill-Gates-Foundation)
e.g)
from 'Meinda#Bill-Gates-Foundation#drug-delivery' -> ['Bill-Gates-Foundation', 'drug-delivery']
I tried p = re.compile('#(\D+)\*(\D+)')
but that was not what I wanted.
You can exclude matchting the # char and repeat 1 or more times the -
#([^\s#-]+(?:-[^\s#-]+)+)
Explanation
# Match literally
( Capture group 1 (returned by re.findall)
[^\s#-]+ Match 1+ non whitespace chars except - and #
(?:-[^\s#-]+)+ Repeat 1+ times matching - and again 1+ non whitespace chars except - and #
) Close group 1
Regex demo
import re
pattern = r"#([^\s#-]+(?:-[^\s#-]+)+)"
s = r"Meinda#Bill-Gates-Foundation#drug-delivery"
print(re.findall(pattern, s))
Output
['Bill-Gates-Foundation', 'drug-delivery']
#ahmet-buğra-buĞa gave an answer with regex.
If you don't have to use regex, then it is easier way is to just use split.
test_str = "Meinda#Bill-Gates-Foundation#drug-delivery"
test_str.split("#")[1:]
This outputs
['Bill-Gates-Foundation', 'drug-delivery']
You can make it a function like so
def get_list_of_strings_after_first(original_str, token_to_split_on):
return original_str.split("#")[1:]
get_list_of_strings_after_first("Meinda#Bill-Gates-Foundation#drug-delivery", "#")
This give the same output
['Bill-Gates-Foundation', 'drug-delivery']

Pandas regex to remove digits before consecutive dots

I have a string Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23.
Removing all the numbers that are before the dot and after the word.
Ignoring the first part of the string i.e. "Node57Name123".
Should not remove the digits if they are inside words.
Tried re.sub(r"\d+","",string) but it removed every other digit.
The output should look like this "Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape"
Can you please point me to the right direction.
You can use
re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text)
See the regex demo.
Details:
^([^.]*\.) - zero or more chars other than a dot and then a . char at the start of the string captured into Group 1 (referred to with \1 from the replacement pattern)
| - or
\d+(?![^.]) - one or more digits followed with a dot or end of string (=(?=\.|$)).
See the Python demo:
import re
text = r'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
print( re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text) )
## => Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
Just to give you a non-regex alternative' using rstrip(). We can feed this function a bunch of characters to remove from the right of the string e.g.: rstrip('0123456789'). Alternatively we can also use the digits constant from the string module:
from string import digits
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = '.'.join([s.split('.')[0]] + [i.rstrip(digits) for i in s.split('.')[1:]])
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
EDIT:
If you must use a regular pattern, it seems that the following covers your sample:
(\.[^.]*?)\d+\b
Replace with the 1st capture group, see the online demo
( - Open capture group:
\.[^.]*? - A literal dot followed by 0+ non-dot characters (lazy).
) - Close capture group.
\d+\b - Match 1+ digits up to a word-boundary.
A sample:
import re
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = re.sub(r'(\.[^.]*?)\d+\b', r'\1', s)
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape

Python Regex: Capture overlapping parts

Given a string s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c" I'm trying to write a regular expression which will extract portions of: angle brackets with the text inside and the surrounding text. Like this:
<foo>abcaaa
abcaaa<bar>a
a<foo>cbacba
cbacba<foo>c
So expected output should look like this:
["<foo>abcaaa", "abcaaa<bar>a", "a<foo>cbacba", "cbacba<foo>c"]
I found this question How to find overlapping matches with a regexp? which brought me little bit closer to the desired result but still my regex doesn't work.
regex = r"(?=([a-c]*)\<(\w+)\>([a-c]*))"
Any ideas how to solve this problem?
You can match overlapping content with standard regex syntax by using capturing groups inside lookaround assertions, since those may match parts of the string without consuming the matched substring and hence precluding it from further matches. In this specific example, we match either the beginning of the string or a > as anchor for the lookahead assertion which captures our actual targets:
(?:\A|>)(?=([a-c]*<\w+>[a-c]*))
See regex demo.
In python we then use the property of re.findall() to only return matches captured in groups when capturing groups are present in the expression:
text = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
expr = r'(?:\A|>)(?=([a-c]*<\w+>[a-c]*))'
captures = re.findall(expr, text)
print(captures)
Output:
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
You need to set the left- and right-hand boundaries to < or > chars or start/end of string.
Use
import re
text = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"
print( re.findall(r'(?=(?<![^<>])([a-c]*<\w+>[a-c]*)(?![^<>]))', text) )
# => ['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
See the Python demo online and the regex demo.
Pattern details
(?= - start of a positive lookahead to enable overlapping matches
(?<![^<>]) - start of string, < or >
([a-c]*<\w+>[a-c]*) - Group 1 (the value extracted): 0+ a, b or c chars, then <, 1+ word chars, > and again 0+ a, b or c chars
(?![^<>]) - end of string, < or > must follow immediately
) - end of the lookahead.
You may use this regex code in python:
>>> s = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
>>> reg = r'([^<>]*<[^>]*>)(?=([^<>]*))'
>>> print ( [''.join(i) for i in re.findall(reg, s)] )
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
RegEx Demo
RegEx Details:
([^<>]*<[^>]*>): Capture group #1 to match 0 or more characters that are not < and > followed by <...> string.
(?=([^<>]*)): Lookahead to assert that we have 0 or more non-<> characters ahead of current position. We have capture group #2 inside this lookahead.

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Categories