i have a sample string from a text file. i want to find all the words before colon till whitespace.
i have written code like this:
import re
text = 'From: mathew <mathew#mantis.co.uk>\nSubject: Alt.Atheism FAQ: Atheist Resources\n\nArchive-
name: atheism/resources\nAlt-atheism-archive-name:'
email_data = re.findall("[^\s].*(?=:)", text)
print(email_data)
Output:
['From', 'Subject: Alt.Atheism FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Desired Output:
['From', 'Subject', 'FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Code is picking up data till newline charater because of (.*) used. i want to restrict it till whitespace so i put [^\s] but its not working. What could i do instead?
You may use
email_data = re.findall(r"\S[^:\s]+(?=:)", text)
See the Python demo and the regex demo.
Details
\S - a non-whitespace char
[^:\s]+ - 1+ chars other than : and whitespace
(?=:) - immediately to the right, there must be a : char (it is not consumed, not added to the match value).
Use re.IGNORECASE flag with the regex pattern
\b[a-z-]+(?=:(?:\s|$))
https://regex101.com/r/0UHsbo/1
https://ideone.com/oz91bP
Related
I have a text where I would like to remove all uppercase consecutive characters up to a colon. I have only figured out how to remove all characters up to the colon itself; which results in the current output shown below.
Input Text
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
Desired output:
'This is a text. This is a second text. This is a third text'
Current code & output:
re.sub(r'^.+[:]', '', text)
#current output
'This is a third text'
Can this be done with a one-liner regex or do I need to iterate through every character.isupper() and then implement regex ?
You can use
\b[A-Z]+:\s*
\b A word boundary to prevent a partial match
[A-Z]+: Match 1+ uppercase chars A-Z and a :
\s* Match optional whitespace chars
Regex demo
import re
text = 'ABC: This is a text. CDEFG: This is a second text. HIJK: This is a third text'
print(re.sub(r'\b[A-Z]+:\s*', '', text))
Output
This is a text. This is a second text. This is a third text
I try to split a text if a newline starts with a digit.
a="""1.pharagraph1
text1
2.pharagraph2
text2
3.pharagraph3
text3
"""
The expected result would be:
['1.pharagraph1 text1' , '2.pharagraph2 text2', '3.pharagraph3 text3']
I tried: re.split('\n\d{1}',a) and it doesn't work for this task.
You can use a lookahead to only split when the newline and spaces are followed by a digit:
import re
result = re.split('\n\s+(?=\d)', a)
If you really have leading spaces and you did not make a typo when creating a sample string, you can use
[re.sub(r'[^\S\n]*\n[^\S\n]*', ' ', x).strip() for x in re.split(r'\n[^\S\n]*(?=\d)', a)]
# => ['1.pharagraph1 text1', '2.pharagraph2 text2', '3.pharagraph3 text3']
See the Python demo.
The \n[^\S\n]*(?=\d) pattern matches a newline and then any zero or more horizontal whitespaces ([^\S\n]*) followed with a digit. Then, inside each match, every sequence of 0+ horizontal whitespaces, newline and 0+ horizontal whitespaces is replaced with a space.
If the string has no leading whitespace, you can use a simpler approach:
import re
a="""1.pharagraph1
text1
2.pharagraph2
text2
3.pharagraph3
text3"""
print( [x.replace("\n"," ") for x in re.split(r'\n(?=\d)', a)] )
# => ['1.pharagraph1 text1', '2.pharagraph2 text2', '3.pharagraph3 text3']
See the online Python demo. Here, the string is simply split at a newline that is followed with a digit (\n(?=\d)) and then all newlines are replaced with a space.
I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \s in [A-Z]+\. Name do not split
You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo. Details:
(?<=[.?!]) - a positive lookbehind that requires ., ? or ! immediately to the left of the current location
(?<![A-Z]\.(?=\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead, that is why it works with Python re, and \s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \s+ pattern below)
\s+ - one or more whitespace chars.
See the Python demo:
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
s = 'hello "ok and #com" name'
s.split()
Is there a way to split this into a list that splits whitespace characters but as well not split white characters in quotes and allow special characters in the quotes.
["hello", '"ok and #com"', "name"]
I want it to be able to output like this but also allow the special characters in it no matter what.
Can someone help me with this?
(I've looked at other posts that are related to this, but those posts don't allow the special characters when I have tested it.)
You can do it with re.split(). Regex pattern from: https://stackoverflow.com/a/11620387/42346
import re
re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)
Returns:
['hello', '"ok and #com"', 'name']
Explanation of regex:
\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
One option is to use regular expressions to capture the strings in quotes, delete them, and then to split the remaining text on whitespace. Note that this won't work if the order of the resulting list matters.
import re
items = []
s = 'hello "ok and #com" name'
patt = re.compile(r'(".*?")')
# regex to find quoted strings
match = re.search(patt, s)
if match:
for item in match.groups():
items.append(item)
# split on whitespace after removing quoted strings
for item in re.sub(patt, '', s).split():
items.append(item)
>>>items
['"ok and #com"', 'hello', 'name']
I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)