insert char with regular expression - python

I have a string '(abc)def(abc)' and I would like to turn it into '(a|b|c)def(a|b|c)'. I can do that by:
word = '(abc)def(abc)'
pattern = ''
while index < len(word):
if word[index] == '(':
pattern += word[index]
index += 1
while word[index+1] != ')':
pattern += word[index]+'|'
index += 1
pattern += word[index]
else:
pattern += word[index]
index += 1
print pattern
But I want to use regular expression to make it shorter. Can you show me how to insert char '|' between only characters that are inside the parentheses by regular expression?

How about
>>> import re
>>> re.sub(r'(?<=[a-zA-Z])(?=[a-zA-Z-][^)(]*\))', '|', '(abc)def(abc)')
'(a|b|c)def(a|b|c)'
(?<=[a-zA-Z]) Positive look behind. Ensures that the postion to insert is preceded by an alphabet.
(?=[a-zA-Z-][^)(]*\)) Postive look ahead. Ensures that the postion is followed by alphabet
[^)(]*\) ensures that the alphabet within the ()
[^)(]* matches anything other than ( or )
\) ensures that anything other than ( or ) is followed by )
This part is crutial, as it does not match the part def since def does not end with )

I dont have enough reputation to comment, but the regex you are looking for will look like this:
"(.*)"
For each string you find, insert the parentheses between each pair of characters.
let me explain each part of the regex:
( - *represends the character.*
. - A dot in regex represends any possible character.
\* - In regex, this sign represends zero to infinite appearances of the previous character.
) - *represends the character.*
This way, you are looking for any appearance of "()" with characters between them.
Hope I helped :)

([^(])(?=[^(]*\))(?!\))
Try this.Replace with \1|.See demo.
https://regex101.com/r/sH8aR8/13
import re
p = re.compile(r'([^(])(?=[^(]*\))(?!\))')
test_str = "(abc)def(abc)"
subst = "\1|"
result = re.sub(p, subst, test_str)

If you have only single characters in your round brackets, then what you could do would be to simply replace the round brackets with square ones. So the initial regex will look like this: (abc)def(abc) and the final regex will look like so: [abc]def[abc]. From a functional perspective, (a|b|c) has the same meaning as [abc].

A simple Python version to achieve the same thing. Regex is a bit hard to read and often hard to debug or change.
word = '(abc)def(abc)'
split_w = word.replace('(', ' ').replace(')', ' ').split()
split_w[0] = '|'.join( list(split_w[0]) )
split_w[2] = '|'.join( list(split_w[2]) )
print "(%s)%s(%s)" % tuple(split_w)
We split the given string into three parts, pipe-separate the first and the last part and join them back.

Related

find and replace the word on a string based on the condition

hello dear helpful ppl at stackoverflow ,
I have couple questions about manipulating a string in python ,
first question:-
if I have a string like :
'What's the use?'
and I want to locate the first letter after 'the'
like (What's the use?) the letter is u
how I could do it in the best way possible ?
second question:-
if I want to change something on this string based on the first letter i found in the (First question)
how I could do it ?
and thanks for helping !
You could use a regex replacement to remove all content up and including the first the (along with any following whitespace). Then, just access the first character from that output.
inp = 'What''s the use?'
inp = re.sub(r'^.*?\bthe\b\s*', '', inp)
print("First character after first 'the' is: " + inp[0])
This prints:
First character after first 'the' is: u
Another re take:
import re
sample = "What is the use?"
pattern = r"""
(?<=\bthe\b) # look-behind to ensure 'the' is there. This is non-capturing.
\s+ # one or more whitespace characters
(\w) # Only one alphanumeric or underscore character
"""
# re.X is for verbose, which handles multi-line patterns
m = re.search(pattern, sample, flags = re.X).groups(1)
if not m is None:
print(f"First character after first 'the' is: {m[0]}")
You can find the index of 'u' by using the str.index() method. Then you can extract string before and after using slice operation.
s = "What's the use?"
character_index = s.lower().index('the ') + 4
print(character_index)
# 11
print(s[:character_index] + '*' + s[character_index+1:])
# What's the *se?

Python, RegEx, Replace a certain part of a match

I am trying to replace a certain part of a match that a regex found.
The relevant strings have the following format:
"<Random text>[Text1;Text2;....;TextN]<Random text>"
So basically there can be N Texts seperated by a ";" inside the brackets.
My goal is to change the ";" into a "," (but only for the strings which are in this format) so that I can keep the ";" as a seperator for a CSV file. So the result should be:
"<Random text>[Text1,Text2,...,TextN]<Random text>"
I can match the relevant strings with something like
re.compile(r'\[".*?((;).*?){1,4}"\]')
but if I try to use the sub method it replaces the whole string.
I have searched stackoverflow and I am pretty sure that "capture groups" might be the solution but I am not really getting there.
Can anyone help me?
I ONLY want to change the ";" in the ["Text1;...;TextN"]-parts of my text file.
Try this regex:
;(?=(?:(?!\[).)*])
Replace each match with a ,
Click for Demo
Explanation:
; - matches a ;
(?=(?:(?!\[).)*]) - makes sure that the above ; is followed by a closing ] somewhere later in the string but before any opening bracket [
(?=....) - positive lookahead
(?:(?!\[).)* - 0+ occurrences of any character which does not start with [
] - matches a ]
If you want to match a ; before a closing ] and not matching [ in between you could use:
;(?=[^[]*])
; Match literally
(?= Positive lookahead, assert what is on the right is
[^[]* Negated character class, match 0+ times any char except [
] Match literally
) Close lookahead
Regex demo
Note that this will also match if there is no leading [
If you also want to make sure that there is a leading [ you could make use of the PyPi regex module and use \G and \K to match a single ;
(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;
Regex demo | Python demo
import regex
pattern = r"(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;"
test_str = ("[\"Text1;Text2;....;TextN\"];asjkdjksd;ajksdjksad[\"Text1;Text2;....;TextN\"]\n\n"
".[\"Text1;Text2\"]...long text...[\"Text1;Text2;Text3\"]....long text...[\"Text1;...;TextN\"]...long text...\n\n"
"I ONLY want to change the \";\" in the [\"Text1;...;TextN\"]")
result = regex.sub(pattern, ",", test_str)
print (result)
Output
["Text1,Text2,....,TextN"];asjkdjksd;ajksdjksad["Text1,Text2,....,TextN"]
.["Text1,Text2"]...long text...["Text1,Text2,Text3"]....long text...["Text1,...,TextN"]...long text...
I ONLY want to change the ";" in the ["Text1,...,TextN"]
You can try this code sample:
import re
x = 'anbhb["Text1;Text2;...;TextN"]nbgbyhuyg["Text1;Text2;...;TextN"][]nhj,kji,'
for i in range(len(x)):
if x[i] == '[' and x[i + 1] == '"':
while x[i+2] != '"':
list1 = list(x)
if x[i] == ';':
list1[i] = ','
x = ''.join(list1)
i = i + 1
print(x)

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Python: Ignore a # / and random numbers in a string

I use part of code to read a website and scrap some information and place it into Google and print some directions.
I'm having an issue as some of the information. the site i use sometimes adds a # followed by 3 random numbers then a / and another 3 numbers e.g #037/100
how can i use python to ignore this "#037/100" string?
I currently use
for i, part in enumerate(list(addr_p)):
if '#' in part:
del addr_p[i]
break
to remove the # if found but I'm not sure how to do it for the random numbers
Any ideas ?
If you find yourself wanting to remove "three digits followed by a forward slash followed by three digits" from a string s, you could do
import re
s = "this is a string #123/234 with other stuff"
t = re.sub('#\d{3}\/\d{3}', '', s)
print t
Result:
'this is a string with other stuff'
Explanation:
# - literal character '#'
\d{3} - exactly three digits
\/ - forward slash (escaped since it can have special meaning)
\d{3} - exactly three digits
And the whole thing that matches the above (if it's present) is replaced with '' - i.e. "removed".
import re
re.sub('#[0-9]+\/[0-9]+$', '', addr_p[i])
I'm no wizzard with regular expressions but i'd imagine you could so something like this.
You could even handle '#' in the regexp as well.
If the format is always the same, then you could check if the line starts with a #, then set the string to itself without the first 8 characters.
if part[0:1] == '#':
part = part[8:]
if the first letter is a #, it sets the string to itself, from the 8th character to the end.
I'd double your problems and match against a regular expression for this.
import re
regex = re.compile(r'([\w\s]+)#\d+\/\d+([\w\s]+)')
m = regex.match('This is a string with a #123/987 in it')
if m:
s = m.group(1) + m.group(2)
print(s)
A more concise way:
import re
s = "this is a string #123/234 with other stuff"
t = re.sub(r'#\S+', '', s)
print(t)

Regex divide with upper-case

I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki
#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X
Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.
one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))
So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.
May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing

Categories