Removing variable length characters from a string in python

Removing variable length characters from a string in python - python

I have strings that are of the form below:
<p>The is a string.</p>
<em>This is another string.</em>
They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().
Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.
I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.
For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:
x = x.replace("<..>", "")

Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:
>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>
[^>]* matches zero or more characters that are not >.

No Need for a 2-Step Solution
You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.
Option 1: Match All Instead of Splitting
Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:
<[^>]+>|(\w+)
The words will be in Group 1.
Use it like this:
subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)
Output
['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']
Discussion
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Option 2: One Single Split
<[^>]+>|[ .]
On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.
Output
This
is
a
string

Related

Need Regex that matches all patterns with format as `{word}{.,#}{word}` with strict matching

So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word} and seperate it into [word,',' (or '.','#'), word].
But i am not able to create one that does strict matching for this pattern and ignores everything else.
I used the following regex
r"[\w]+|[.]"
this one is doing well , but it doesnt do strict matching, as in if (,, # or .) characters dont occur in text, it will still give me words, which i dont want.
I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall) and if not returns the whole word as it is.
Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present
Some example text for reference:
no.16 would give me ['no','.','16']
#400 would give me ['#,'400']
word1.word2 would give me ['word1','.','word2']
Looking forward to some help and assistance from all regex gurus out there
EDIT:
I forgot to add this. #viktor's version works as needed with only one problem, It ignores ALL other words during re.findall
eg. ONE TWO THREE #400 with the viktor's regex gives me ['','#','400']
but what was expected was ['ONE','TWO','THREE','#',400]
this can be done with NLTK or spacy, but use of those is a limitation.

I suggest using
(\w+)?([.,#])((?(1)\w*|\w+))
See the regex demo.
Details
(\w+)? - An optional group #1: one or more word chars
([.,#]) - Group #2: ., , or #
((?(1)\w*|\w+)) - Group #3: if Group 1 matched, match zero or more word chars (the word is optional on the right side then), else, match one or more word chars (there must be a word on the right side of the punctuation chars since there is no word before them).
See the Python demo:
import re
pattern = re.compile(r'(\w+)?([.,#])((?(1)\w*|\w+))')
strings = ['no.16', '#400', 'word1.word2', 'word', '123']
for s in strings:
print(s, ' -> ', pattern.findall(s))
Output:
no.16 -> [('no', '.', '16')]
#400 -> [('', '#', '400')]
word1.word2 -> [('word1', '.', 'word2')]
word -> []
123 -> []
The answer to your edit is
if re.search(r'\w[.,#]|[.,#]\w', text):
print( re.findall(r'[.,#]|[^\s.,#]+', text) )
If there is a word char, then any of the three punctuation symbols, and then a word char again in the input string, you can find and extract all occurrences of the [.,#]|[^\s.,#]+ pattern, namely a ., , or #, or one or more occurrences of any one or more chars other than whitespace, ., , and #.

I hope this code will solve your problem if you want to split the string by any of the mentioned special characters:
a='no.16'
b='#400'
c='word1.word2'
lst=[a, b, c]
for elem in lst:
result= re.split('(\.|#|,)',elem)
while('' in result):
result.remove('')
print(result)

You could do something like this:
import re
str = "no.16"
pattern = re.compile(r"(\w+)([.|#])(\w+)")
result = list(filter(None, pattern.split(str)))
The list(filter(...)) part is needed to remove the empty strings that split returns (see Python - re.split: extra empty strings that the beginning and end list).
However, this will only work if your string only contains these two words separated by one of the delimiters specified by you. If there is additional content before or after the pattern, this will also be returned by split.

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.

Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.

Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Find and extract two substrings from string

I have some strings (in fact they are lines read from a file). The lines are just copied to some other file, but some of them are "special" and need a different treatment.
These lines have the following syntax:
someText[SUBSTRING1=SUBSTRING2]someMoreText
So, what I want is: When I have a line on which this "mask" can be applied, I want to store SUBSTRING1 and SUBSTRING2 into variables. The braces and the = shall be stripped.
I guess this consists of several tasks:
Decide if a line contains this mask
If yes, get the positions of the substrings
Extract the substrings
I'm sure this is a easy task with regex, however, I'm not used to it. I can write a huge monster function using string manipulation, but I guess this is not the "Python Way" to do this.
Any suggestions on this?

re.search() returns None if it doesn't find a match. \w matches an alphanumeric, + means 1 or more. Parenthesis indicate the capturing groups.
s = """
bla bla
someText[SUBSTRING1=SUBSTRING2]someMoreText"""
results = {}
for line_num, line in enumerate(s.split('\n')):
m = re.search(r'\[(\w+)=(\w+)\]', line)
if m:
results.update({line_num: {'first': m.group(0), 'second': m.group(1)}})
print(results)

^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$
You can try this.Group 1and Group 2 has the two string you want.See demo.
https://regex101.com/r/pT4tM5/26
import re
p = re.compile(r'^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
test_str = "someText[SUBSTRING1=SUBSTRING2]someMoreText\nsomeText[SUBSTRING1=SUBSTRING2someMoreText\nsomeText[SUBSTRING1=SUBSTRING2]someMoreText"
re.findall(p, test_str)

Best way to split a string for the last space

I'm wondering the best way to split a string separated by spaces for the last space in the string which is not inside [, {, ( or ". For instance I could have:
a = 'a b c d e f "something else here"'
b = 'another parse option {(["gets confusing"])}'
For a it should parse into ['a', 'b', 'c', 'd', 'e', 'f'], ["something else here"]
and b should parse into ['another', 'parse', 'option'], ['([{"gets confusing"}])']
Right now I have this:
def getMin(aList):
min = sys.maxint
for item in aList:
if item < min and item != -1:
min = item
return min
myList = []
myList.append(b.find('['))
myList.append(b.find('{'))
myList.append(b.find('('))
myList.append(b.find('"'))
myMin = getMin(myList)
print b[:myMin], b[myMin:]
I'm sure there's better ways to do this and I'm open to all suggestions

Matching vs. Splitting
There is an easy solution. The key is to understand that matching and splitting are two sides of the same coin. When you say "match all", that means "split on what I don't want to match", and vice-versa. Instead of splitting, we're going to match, and you'll end up with the same result.
The Reduced, Simple Version
Let's start with the simplest version of the regex so you don't get scared by something long:
{[^{}]*}|\S+
This matches all the items of your second string—the same as if we were splitting (see demo)
The left side of the | alternation matches complete sets of {braces}.
The right side of the | matches any characters that are not whitespace characters.
It's that simple!
The Full Regex
We also need to match "full quotes", (full parentheses) and [full brackets]. No problem: we just add them to the alternation. Just for clarity, I'm throwing them together in a non-capture group (?: so that the \S+ pops out on its own, but there is no need.
(?:{[^{}]*}|"[^"]*"|\([^()]*\)|\[[^][]*\])|\S+
See demo.
Notes Potential Improvements
We could replace the quoted string regex by one that accepts escaped quotes
We could replace the brace, brackets and parentheses expressions by recursive expressions to allow nested constructions, but you'd have to use Matthew Barnett's (awesome) regex module instead of re
The technique is related to a simple and beautiful trick to Match (or replace) a pattern except when...
Let me know if you have questions!

You can use regular expressions:
import re
def parse(text):
m = re.search(r'(.*) ([[({"].*)', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
The first part (.*) catches everything up to the section in quotes or parenthesis, and the second part catches anything starting at a character in ([{".
If you need something more robust, this has a more complicated regular expression, but it will make sure that the opening token is matched, and it makes the last expression optional.
def parse(text):
m = re.search(r'(.*?)(?: ("[^"]*"|\([^)]*\)|\[[^]]*\]|\{[^}]*\}))?$', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]

Perhaps this link will help:
Split a string by spaces -- preserving quoted substrings -- in Python
It explains how to preserve quoted substrings when splitting a string by spaces.

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')

The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']

I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary

^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.

Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing variable length characters from a string in python - python

Related

Need Regex that matches all patterns with format as `{word}{.,#}{word}` with strict matching

Python regular expression to replace everything but specific words

Find and extract two substrings from string

Best way to split a string for the last space

Match single quotes from python re

Categories

Resources