regex search remove word - python

I want remove first 4 words from paragraph
Original : Mywebsite 21 12 34 have 10000 traffic
What i want result : have 10000 traffic
i have 1000 of line same as original paragraph ( Mywebsite 21 12 34 have 10000 traffic)
i have regex search code which is work like this :
Below code is remove first word from sentence :
^\w+\s+(.*) = replace with $1
Following code will remove all numbers from line :
[0-9 ]+ = replace with space
I want combine above code, and make one regex search code work as i explain above, but not to affect any other words same line .

If your lines are all in the exact same format, i.e. if you always need to remove the first 4 words you can do something like this which is way simpler to understand than a RegEx:
# Iterate through all your lines
for line in lines:
# Split the line string on spaces to create an array of words.
words = line.split(' ')
# Exclude the 4 first words and re-join the string with the remaining words.
line = ' '.join(words[4:])

The pattern that you tried ^\w+\s+(.*) matches 1+ word chars, 1+ whitespace chars and then any char except a newline until the end of the string so that will match the whole string.
To remove the first word and the following 3 times 2 digits, you might use:
^\s*\w+(?: \d{2}){3}\s*
^ Start of string
\s* Match 0+ whitespace chars
\w+ Match 1+ word chars
(?: \d{2}){3} Repeat 3 times matching a space and 2 digits
\s* Match 0+ whitespace chars
Regex demo | Python demo
Note that \s also matches a newline. If you only want to match spaces or tabs you could use [ \t] instead.

You may use
re.sub(r'^(\w+\s)[\d\s]+', r'\1', text)
See the regex demo a
The pattern will match
^ - start of string
(\w+\s) - Capturing group 1: one or more word chars and a whitespace
[\d\s]+ - 1+ whitespace or digit chars.
Python demo:
import re
rx = re.compile(r"^(\w+\s)[\d\s]+")
s = "Mywebsite 21 12 34 have 10000 traffic"
print( rx.sub(r"\1", s) ) # => Mywebsite have 10000 traffic

Related

changing a word surrounded by numbers using re.sub

I want to change "to" to "-" only when it is surrounded by numbers or numbers and spaces e.g. 34to55 or 34 to 55 both to be changed to 34-55. Also, I don't want to change it when the word begins or ends with alphabets e.g. abc34to55def
I tried
re.sub(?<![a-z])(\d)to(\d)(?!\[a-z]), \\1-\\2, 34to55)
but it doesn't give me what I want.
Any help would be greatly appreciated
You can use 2 capture groups with word boundaries and use the groups in the replacement.
\b(\d+)\s*to\s*(\d+)\b
\b A word boundary to prevent a partial match
(\d+) Capture group 1, match 1+ digits
\s*to\s* Match to between optional whitespace chars
(\d+) Capture group 2, match 1+ digits
\b A word boundary
and replace with
\1-\2
Regex demo | Python demo
import re
pattern = r"\b(\d+)\s*to\s*(\d+)\b"
s = ("34to55 or 34 to 55\n"
"abc34to55def")
result = re.sub(pattern, r"\1-\2", s)
if result:
print (result)
Result
34-55 or 34-55
abc34to55def

Regex to extract first 5 digit+character from last hyphen

I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/

Split String into alpha & punctuation with exceptions regex

I am trying to split a string into 2 parts : alphanum chars & special chars. I want to limit the occurence of the escape character
b.sc.... = ['b.sc.','...'] (Preserve "." inside word & outside word just once)
really???? = ['really','????'] (split when any other special char encountered)
I went through a lot of SO questions before posting here. I have come up with this till now: re.findall(r"[\w+|\-.+\w]+|\W+,text)`
How to proceed further?
You can use
[re.sub(r'([.-])+', r'\1', x) for x in re.findall(r'\w+(?:-+\w+)+|\w+(?:\.+\w+)*\.?|[^\w\s]+', text)]
See this regex demo
Details
\w+(?:-+\w+)+ - one or more word chars followed with one or more occurrences of - and one or more word chars
| - or
\w+(?:\.+\w+)*\.? - one or more word chars followed with one or more occurrences of . and one or more word chars and then an optional dot
| - or
[^\w\s]+ - one or more non-word and non-whitespace chars.
The re.sub(r'([.-])+', r'\1', x) part is a post-processing step to replace one or more consecutive . or - chars with a single occurrence.

How to exclude two words from regex?

I have this regex:
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*([\w\s][^cui]+)
That should match
] AN 1 words 2 words 3 words
or
] AV 1 words 2 words 3 words
The words after 3 should exclude "da cui", so "da\scui", but it doesn't work. Try it here: https://regex101.com/r/kI7Tan/1
What am I doing wrong?
Sample string:
campo] AN 1 campo 2 prato con penna B sps a 1 3 da cui campo con penna C as a 1 cfr Nota filologica
Expected output: it won't match it because of the "da cui". So basically I want to match all words without the string "da cui".
The final capture group of the regex ( ([\w\s][^cui]+) ) matches ...
Exactly 1 word character due to the first character class.
This class does not match a whitespace due to the preceding \s* in the regex.
Any number of characters other than c, u, i.
If you want to exclude matches contingent on the word(s) da cui, use a negative lookahead.
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*(?!.*da cui)(.*)
See the demo (regex101).
Update
Capture group reintroduced to the regex.
You may use either of the two:
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*((?:(?!cui).)*)
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*(.*?)(?=cui|$)
See the regex demo
The (?:(?!cui).)* is a tempered greedy token that matches any char, 0 or more occurrences, as many as possible, that does not start a cui char sequence. The (.*?)(?=cui|$) pattern captures 0+ chars other than line break chars, as few as possible, up to the cui char sequence or end of string.
My interpretation of the question, as it concerns the string that follows one or more spaces after 3 (to the end of the line), is that if the string da cui is present in that string an empty string is to be saved to capture group 4, else that string is to be saved to capture group 4.
You could use the following regular expression.
\]\s*(AN|AV)\s+1\s+([\w\s]+)\s+2\s+([\w\s]+)\s+3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*)
Demo
This replaces 3\s*([\w\s][^cui]+) in the OP's regex with 3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*).
Python's regex engine performs the following steps after matching 3.
\s+ match 1+ spaces
( begin capture group 4
(?=.*\bda cui\b) match 0+ chars, then 'da cui' in a positive lookahead
| or
(?!=.*\bda cui\b) match 0* chars, then 'da cui' in a negative lookahead
.* match rest of line
) end capture group 4
If the positive lookahead succeeds an empty string is saved to the capture group.

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Categories