I need to truncate string by special characters '-', '(', '/' with one leading whitespace, i.e. ' -', ' (', ' /'.
how to do that?
patterns=r'[-/()]'
try:
return row.split(re.findall(patterns, row)[0], 1)[0]
except:
return row
the above code picked up all special characters but without the leading space.
patterns=r'[s-/()]'
this one does not work.
Try this pattern
patterns=r'^\s[-/()]'
or remove ^ depending on your needs.
It looks like you want to get a part of the string before the first occurrence of \s[-(/] pattern.
Use
return re.sub(r'\s[-(/].*', '', row)
This code will return a part of row string without all chars after the first occurrence of a whitespace (\s) followed with -, ( or / ([-(/]).
See the regex demo.
Please try this pattern patterns = r'\s+-|\s\/|\s\(|\s\)'
I don't know how to write a regex statement in order to replace all underlines to ' ' except if the underline is a part of an hashtag statement.
For example if we have a text, we want to replace all of underlines except for cases like #please_help_me.
The simplest way would probably be to match all contiguous words with underscores in them, then pass a function/lambda to re.sub to remove the underscores the old-fashioned way, only if the first character is not #:
sample = 'Here is_a_sample string #with_a_hashtag'
rstr = r'(#?(?:\w*_)+)'
# in this case, this matches like so:
# 'is_a_'
# '#with_a_'
new_sample = re.sub(rstr,
lambda s: s.group(0) if s.group(0).startswith('#') else s.group(0).replace('_', ' '),
sample)
print(new_sample)
# 'Here is a sample string #with_a_hashtag'
The regex involved here is pretty simple:
as a match group (()),
zero or one # symbols (#?)
followed by the non-matching group, repeated at least once, of ((?: )+)
any number of word-like characters followed by an underscore (\w*_)
I have the following string:
>>> repr(s)
" NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp
I want to match the string before the \\n -- everything before a whitespace character. The output should be:
['NBCUniversal', 'VOLGAFILMINC']
Here is what I have so far:
re.findall(r'[^s].+\\n\d{1,2}', s)
What would be the correct regex for this?
EDIT: sorry I haven't read carefully your question
If you want to find all groups of letters immediatly before a literal \n, re.findall is appropriate. You can obtain the result you want with:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> re.findall(r'(?i)[a-z]+(?=\\n)', s)
['NBCUniversal', 'VOLGAFILMINC']
OLD ANSWER:
re.findall is not the appropriate method since you only need one result (that is a pair of strings). Here the re.search method is more appropriate:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> res = re.search(r'^(?i)[^a-z\\]*([a-z]+)\\n[^a-z\\]*([a-z]+)', s)
>>> res.groups()
('NBCUniversal', 'VOLGAFILM')
Note: I have assumed that there are no other characters between the first word and the literal \n, but if it isn't the case, you can add [^a-z\\]* before the \\n in the pattern.
If you want to fix your existing code instead of replace it, you're on the right track, you've just got a few minor problems.
Let's start with your pattern:
>>> re.findall(r'[^s].+\\n\d{1,2}', s)
[' NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64']
The first problem is that .+ will match everything that it can, all the way up to the very last \\n\d{1,2}, rather than just to the next \\n\d{1,2}. To fix that, add a ? to make it non-greedy:
>>> re.findall(r'[^s].+?\\n\d{1,2}', s)
[' NBCUniversal\\n63', ' VOLGAFILM, INC VOLGAFILMINC\\n64']
Notice that we now have two strings, as we should. The problem is, those strings don't just have whatever matched the .+?, they have whatever matched the entire pattern. To fix that, wrap the part you want to capture in () to make it a capturing group:
>>> re.findall(r'[^s](.+?)\\n\d{1,2}', s)
[' NBCUniversal', ' VOLGAFILM, INC VOLGAFILMINC']
That's nicer, but it still has a bunch of extra stuff on the left end. Why? Well, you're capturing everything after [^s]. That means any character except the letter s. You almost certainly meant [\s], meaning any character in the whitespace class. (Note that \s is already the whitespace class, so [\s], meaning the class consisting of the whitespace class, is unnecessary.) That's better, but that's still only going to match one space, not all the spaces. And it will match the earliest space it can that still leaves .+? something to match, not the latest. So if you want to suck all all the excess spaces, you need to repeat it:
re.findall(r'\s+(.+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILM, INC VOLGAFILMINC']
Getting closer… but the .+? matches anything, including the space between VOLGAFILM and VOLGAFILMINC, and again, the \s+ is going to match the first run of spaces it can, leaving the .+? to match everything after that.
You could fiddle with the prefix , but there's an easier solution. If you don't want spaces in your capture group, just capture a run of nonspaces instead of a run of anything, using \S:
re.findall(r'\s+(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
And notice that once you've done that, the \s+ isn't really doing anything anymore, so let's just drop it:
re.findall(r'(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
I've obviously made some assumptions above that are correct for your sample input, but may not be correct for real data. For example, if you had a string like Weyland-Yutani\\n…, I'm assuming you want Weyland-Yutani, not just Yutani. If you have a different rule, like only letters, just change the part in parentheses to whatever fits that rule, like (\w+?) or ([A-Za-z]+?).
Assuming that the input actually has the sequence \n (backslash followed by letter 'n') and not a newline, this will work:
>>> re.findall(r'(\S+)\\n', s)
['NBCUniversal', 'VOLGAFILMINC']
If the string actually contains newlines then replace \\n with \n in the regular expression.
I simply want to add string after (0 or more) tabs in the beginning of a string.
i.e.
a = '\t\t\tHere is the next part of string. More garbage.'
(insert Added String here.)
to
b = '\t\t\t Added String here. Here is the next part of string. More garbage.'
What is the easiest/simplest way to go about it?
Simple:
re.sub(r'^(\t*)', r'\1 Added String here. ', inputtext)
The ^ caret matches the start of the string, \t a tab character, of which there should be zero or more (*). The parenthesis capture the matched tabs for use in the replacement string, where \1 inserts them again in front of the string you need adding.
Demo:
>>> import re
>>> a = '\t\t\tHere is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', a)
'\t\t\t Added String here. Here is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', 'No leading tabs.')
' Added String here. No leading tabs.'
Here us what I'm trying to do... I have a string structured like this:
stringparts.bst? (carriage return)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99 (carriage return)
SPAM /198975/
I need it to match or return this:
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
What RegEx will do the trick?
I have tried this, but to no avail :(
bst\?(.*)\n
Thanks in advc
I tried this. Assuming the newline is only one character.
>>> s
'stringparts.bst?\n765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchks
yttsutcuan99\nSPAM /198975/'
>>> m = re.match('.*bst\?\s(.+)\s', s)
>>> print m.group(1)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
Your regex will match everything between the bst? and the first newline which is nothing. I think you want to match everything between the first two newlines.
bst\?\n(.*)\n
will work, but you could also use
\n(.*)\n
although it may not work for some other more specific cases
This is more robust against different kinds of line breaks, and works if you have a whole list of such strings. The $ and ^ represent the beginning and end of a line, but not the actual line break character (hence the \s+ sequence).
import re
BST_RE = re.compile(
r"bst\?.*$\s+^(.*)$",
re.MULTILINE
)
INPUT_STR = r"""
stringparts.bst?
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
SPAM /198975/
stringparts.bst?
another
SPAM /.../
"""
occurrences = BST_RE.findall(INPUT_STR)
for occurrence in occurrences:
print occurrence
This pattern allows additional whitespace before the \n:
r'bst\?\s*\n(.*?)\s*\n'
If you don't expect any whitespace within the string to be captured, you could use a simpler one, where \s+ consumes whitespace, including the \n, and (\S+) captures all the consecutive non-whitespace:
r'bst\?\s+(\S+)'