I'm not getting the desire output, re.sub is only replacing the last occurance using python regular expression, please explain me what i"m doing wrong
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
re.sub("http://.*[#]", "", srr)
'image-1CE005XG03'
Desire output without http://www.google.com/#image from the above string.
image-1CCCC|image-1VVDD|image-123|image-1CE005XG03
I would use re.findall here, rather than trying to do a replacement to remove the portions you don't want:
src = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
matches = re.findall(r'https?://www\.\S+#([^|\s]+)', src)
output = '|'.join(matches)
print(output) # image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Note that if you want to be more specific and match only Google URLs, you may use the following pattern instead:
https?://www\.google\.\S+#([^|\s]+)
>>> "|".join(re.findall(r'#([^|\s]+)', srr))
'image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03'
Here is another solution,
"|".join(i.split("#")[-1] for i in srr.split("|"))
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Using correct regex in re.sub as suggested in comment above:
import re
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
print (re.sub(r"\s*https?://[^#\s]*#", "", srr))
Output:
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
RegEx Details:
\s*: Match 0 or more whitespaces
https?: Match http or https
://: Match ://
[^#\s]*: Match 0 or more of any characters that are not # and whitespace
#: Match a #
I want to remove special characters from start or end of string,
#Can't&
Using regular expression and I've tired,
`[^\w\s]`
But this regular expression removes ' which is inside the word and return below word,
Cant
Can't seem to wrap my head around this any ideas would be highly appreciated.
Can be simplified like this:
res = re.sub(r'^\W+|\W+$', '', txt)
Use the following approach (using regex alternation ..|..):
import re
s = "#Can't&"
res = re.sub(r'^[^\w\s]+|[^\w\s]+$', '', s)
print(res) # Can't
I have a dataset that includes Tweets from Twitter. Some of them also have user mentions such as #thisisauser. I try to remove that text at the same time I do other cleaning processes.
def clean_text(row, options):
if options['lowercase']:
row = row.lower()
if options['decode_html']:
txt = BeautifulSoup(row, 'lxml')
row = txt.get_text()
if options['remove_url']:
row = row.replace('http\S+|www.\S+', '')
if options['remove_mentions']:
row = row.replace('#[A-Za-z0-9]+', '')
return row
clean_config = {
'remove_url': True,
'remove_mentions': True,
'decode_utf8': True,
'lowercase': True
}
df['tweet'] = df['tweet'].apply(clean_text, args=(clean_config,))
However, when I run the above code, all the Twitter mentions are still on the text. I verified with a Regex online tool that my Regex is working correctly, so the problem should be on the Pandas's code.
You are misusing replace method on a string because it does not accept regular expressions, only fixed strings (see docs at https://docs.python.org/2/library/stdtypes.html#str.replace for more).
The right way of achieving your needs is using re module like:
import re
re.sub("#[A-Za-z0-9]+","", "#thisisauser text")
' text'
the problem is with the way you used replace method & not pandas
see output from the REPL
>>> my_str ="#thisisause"
>>> my_str.replace('#[A-Za-z0-9]+', '')
'#thisisause'
replace doesn't support regex. Instead do use regular expressions library in python as mentioned in the answer
>>> import re
>>> my_str
'hello #username hi'
>>> re.sub("#[A-Za-z0-9]+","",my_str)
'hello hi'
Removing Twitter mentions, or words that start with a # char, in Pandas, you can use
df['tweet'] = df['tweet'].str.replace(r'\s*#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*\B#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+\b', '', regex=True)
If you need to remove remaining leading/trailing spaces after the replacement, add .str.strip() after .str.replace call.
Details:
\s*#\w+ - zero or more whitespaces, # and one or more "word" chars (letters, digits, underscores (and other connector punctuation in Python 3.x), also some diacritics (in Python 3.x) (see regex demo)
\s*\B#\w+ - zero or more whitespaces, a position other than word boundary, # and one or more "word" chars (see regex demo)
\s*#\S+ - zero or more whitespaces, # and one or more non-whitespace chars (see regex demo)
\s*#\S+\b - zero or more whitespaces, # and one or more non-whitespace chars (as many as possible) followed with a word boundary (see regex demo).
Without Pandas, use one of the the above regexps in re.sub:
text = re.sub(r'...pattern here...', '', text)
## or
text = re.sub(r'...pattern here...', '', text).strip()
I have string like below
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
I want to get invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967 into list using regex with this pattern
result = re.findall(r'INV[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}',string)
but the result is
[('XVII', '', '','', '', '', '', '', 'X', 'VII', '', '', '', 'V','','','', '', '', '', '', '', '', '', '', 'V')]
I tried this pattern in http://regexr.com/, the result is appropriately but in python not
You should modify your pattern, add normal brackets around whole regular expression, and afterwards access that text with first back-reference. You can read more about back-references here.
invoices = []
# Your pattern was slightly incorrect
pattern = re.compile(r'IVR[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})|(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}')
# For each invoice pattern you find in string, append it to list
for invoice in pattern.finditer(string):
invoices.append(invoice.group(1))
Note:
You should also use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
results = []
matches = re.finditer(regexpattern, string)
for matchNum, match in enumerate(matches):
results.append(match.group())
You need to add ?: before all the groups so that you can use non-capturing groups
Try with this regex:
IVR[/]\d{8}[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/]\d{8}
Basically you need to add ?: for each group.
You can try this one to retrieve number, roman, roman and number values:
IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})
Demo
Snippet
import re
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
pattern = r"IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})"
for match in re.findall(pattern, string):
print(match)
Run online
I have a string like this:
----------
FT Weekend
----------
Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
I want to delete the part from ---------- to the next ---------- included.
I have been using re.sub:
pattern =r"-+\n.+\n-+"
re.sub(pattern, '', thestring)
pattern =r"-+\n.+?\n-+"
re.sub(pattern, '', thestring,flags=re.DOTALL)
Just use DOTALL flag.The problem with your regex was that by default . does not match \n.So you need to explicitly add a flag DOTALL making it match \n.
See demo.
https://regex101.com/r/hR7tH4/24
or
pattern =r"-+\n[\s\S]+?\n-+"
re.sub(pattern, '', thestring)
if you dont want to add a flag
Your regex doesn't match the expected part because .+ doesn't capture new line character. you can use re.DOTALL flag to forced . to match newlines or re.S.but instead of that You can use a negated character class :
>>> print re.sub(r"-+[^-]+-+", '', s)
''
Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
>>>
Or more precise you can do:
>>> print re.sub(r"-+[^-]+-+[^\w]+", '', s)
'Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
>>>
The problem with your regex (-+\n.+\n-+) is that . matches any character but a newline, and that it is too greedy (.+), and can span across multiple ------- entities.
You can use the following regex:
pattern = r"(?s)-+\n.+?\n-+"
The (?s) singleline option makes . match any character including newline.
The .+? pattern will match 1 or more characters but as few as possible to match up to the next ----.
See IDEONE demo
For a more profound cleanup, I'd recommend:
pattern = r"(?s)\s*-+\n.+?\n-+\s*"
See another demo