Split python string if newline starts with digit - python

I try to split a text if a newline starts with a digit.
a="""1.pharagraph1
text1
2.pharagraph2
text2
3.pharagraph3
text3
"""
The expected result would be:
['1.pharagraph1 text1' , '2.pharagraph2 text2', '3.pharagraph3 text3']
I tried: re.split('\n\d{1}',a) and it doesn't work for this task.

You can use a lookahead to only split when the newline and spaces are followed by a digit:
import re
result = re.split('\n\s+(?=\d)', a)

If you really have leading spaces and you did not make a typo when creating a sample string, you can use
[re.sub(r'[^\S\n]*\n[^\S\n]*', ' ', x).strip() for x in re.split(r'\n[^\S\n]*(?=\d)', a)]
# => ['1.pharagraph1 text1', '2.pharagraph2 text2', '3.pharagraph3 text3']
See the Python demo.
The \n[^\S\n]*(?=\d) pattern matches a newline and then any zero or more horizontal whitespaces ([^\S\n]*) followed with a digit. Then, inside each match, every sequence of 0+ horizontal whitespaces, newline and 0+ horizontal whitespaces is replaced with a space.
If the string has no leading whitespace, you can use a simpler approach:
import re
a="""1.pharagraph1
text1
2.pharagraph2
text2
3.pharagraph3
text3"""
print( [x.replace("\n"," ") for x in re.split(r'\n(?=\d)', a)] )
# => ['1.pharagraph1 text1', '2.pharagraph2 text2', '3.pharagraph3 text3']
See the online Python demo. Here, the string is simply split at a newline that is followed with a digit (\n(?=\d)) and then all newlines are replaced with a space.

Related

how can I perform conditional splitting with exceptions in python

I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \s in [A-Z]+\. Name do not split
You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo. Details:
(?<=[.?!]) - a positive lookbehind that requires ., ? or ! immediately to the left of the current location
(?<![A-Z]\.(?=\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead, that is why it works with Python re, and \s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \s+ pattern below)
\s+ - one or more whitespace chars.
See the Python demo:
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

regex to match string before colon till whitespace

i have a sample string from a text file. i want to find all the words before colon till whitespace.
i have written code like this:
import re
text = 'From: mathew <mathew#mantis.co.uk>\nSubject: Alt.Atheism FAQ: Atheist Resources\n\nArchive-
name: atheism/resources\nAlt-atheism-archive-name:'
email_data = re.findall("[^\s].*(?=:)", text)
print(email_data)
Output:
['From', 'Subject: Alt.Atheism FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Desired Output:
['From', 'Subject', 'FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Code is picking up data till newline charater because of (.*) used. i want to restrict it till whitespace so i put [^\s] but its not working. What could i do instead?
You may use
email_data = re.findall(r"\S[^:\s]+(?=:)", text)
See the Python demo and the regex demo.
Details
\S - a non-whitespace char
[^:\s]+ - 1+ chars other than : and whitespace
(?=:) - immediately to the right, there must be a : char (it is not consumed, not added to the match value).
Use re.IGNORECASE flag with the regex pattern
\b[a-z-]+(?=:(?:\s|$))
https://regex101.com/r/0UHsbo/1
https://ideone.com/oz91bP

Splitting whitespace string into list but not splitting whitespace in quotes and also allow special characters (like $, %, etc) in quotes in Python

s = 'hello "ok and #com" name'
s.split()
Is there a way to split this into a list that splits whitespace characters but as well not split white characters in quotes and allow special characters in the quotes.
["hello", '"ok and #com"', "name"]
I want it to be able to output like this but also allow the special characters in it no matter what.
Can someone help me with this?
(I've looked at other posts that are related to this, but those posts don't allow the special characters when I have tested it.)
You can do it with re.split(). Regex pattern from: https://stackoverflow.com/a/11620387/42346
import re
re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)
Returns:
['hello', '"ok and #com"', 'name']
Explanation of regex:
\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
One option is to use regular expressions to capture the strings in quotes, delete them, and then to split the remaining text on whitespace. Note that this won't work if the order of the resulting list matters.
import re
items = []
s = 'hello "ok and #com" name'
patt = re.compile(r'(".*?")')
# regex to find quoted strings
match = re.search(patt, s)
if match:
for item in match.groups():
items.append(item)
# split on whitespace after removing quoted strings
for item in re.sub(patt, '', s).split():
items.append(item)
>>>items
['"ok and #com"', 'hello', 'name']

Regex - Remove space between two punctuation marks but not between punctuation mark and letter

I have the following regex for removing spaces between punctuation marks.
re.sub(r'\s*(\W)\s*', r'\1', s)
which works fine in almost all of my test cases, except for this one:
This is! ? a test! ?
For which I need to have
This is!? a test!?
and get
This is!?a test!?
How do I NOT remove the space between that ? and 'a'? What am I missing?
This should work:
import re
str = 'This is! ? a test! ?'
res = re.sub(r'(?<=[?!])\s+(?=[?!])', '', str)
print(res)
Output:
This is!? a test!?
Explanation:
(?<=[?!]) # positive lookbehind, make sure we have a punctuation before (you can add all punctuations you want to check)
\s+ # 1 or more spaces
(?=[?!]) # positive lookahead, make sure we have a punctuation after
Try this:
string = "This is! ? a test! ?"
string = re.sub(r"(\W)\s*(\W)", r"\1\2", string)
print(string)
Output:
This is!? a test!?
In order to match a punctuation char with a regex in Python, you may use (?:[^\w\s]|_) pattern, it matches any char but a letter, digit or whitespace.
So, you need to match one or more whitespaces (\s+) that is immediately preceded with a punctuation char ((?<=[^\w\s]|_)) and is immediately followed with such a char ((?=[^\w\s]|_)):
(?<=[^\w\s]|_)\s+(?=[^\w\s]|_)
See the online regex demo.
Python demo:
import re
text = "This is! ? a test! ?"
print( re.sub(r"(?<=[^\w\s]|_)\s+(?=[^\w\s]|_)", "", text) )
# => This is!? a test!?
Another option is to make use of the PyPi regex module use \p{Punct} inside positive lookarounds to match the punctuation marks.
Python demo
For example
import regex
pattern = r"(?<=\p{Punct})\s+(?=\p{Punct})"
s = 'This is! ? a test! ?'
print(regex.sub(pattern, '', s))
Output
This is!? a test!?
Note that \s could also match a newline. You could also use [^\S\r\n] to match a whitespace char except newlines.

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

Categories