I have a large file (f) with a lot of dialogue. I need a regex that will concatenate the split quotes (i.e. "Hello," Josh said enthusiastically, "I have a question!"), but not delete the middle portion. So, for this example, the output would be, "Hello, I have a question!" and then "Josh said enthusiastically" would be retained somewhere. I think I am on the right track, but haven't found something that works for these specifications. Here is the code I have already tried out:
for line in f:
re.findall(r'"(.*?)"', line)
output_file.write(line)
and
split = re.compile(r'''
(,\")
(.*?)
(,)
( )
(")''', re.VERBOSE)
for line in f:
m = split_quote.match(split)
if m:
output_file.write(m.group(1) + m.group(5))
Thank you for any help!
How about something like this?
/(".+?)"(.+?),\s+?"(.+?[.?!]+")/g
Then replace the capture groups in this order:
$1 $3$2.
like so:
m.group(1) + " " + m.group(3) + m.group(2) + "."
Example:
"Hello," Josh said enthusiastically, "I have a question!"
to
"Hello, I have a question!" Josh said enthusiastically.
Explanation:
http://bsite.cc/inoD/Screen%20Shot%202017-01-18%20at%206.01.22%20PM.png
First part matches a ", and then any characters until it sees another quote.
"Hello,"| Josh said enthusiastically, "I have a question!"
Second part matches text in the middle of the quotes, until it reaches a comma (also matches whitespace after comma and the first quote)
"Hello," Josh said enthusiastically, | "I have a question!"
Third group matches until the next quote
"Hello," Josh said enthusiastically, "I have a question!"
Try this regex:
(?<=\")([^\s].*?[^\s])(?=\")|(?<=\")\s(.*?)\s(?=\")
The regex above will match these two strings: Hello, and I have a question! in group 1, which will make you able to print them together. The same regex will distinguish this portion Josh said enthusiastically, and match it in group 2 which will be handy in case you've decided to use it later.
Check out demo: https://regex101.com/r/m7nqnu/1
This is a working Python code:
import re
text = '''"Hello," Josh said enthusiastically, "I have a question!"'''
print ('Group 1: ')
for m in re.finditer(r"(?<=\")([^\s].*?[^\s])(?=\")|(?<=\")\s(.*?)\s(?=\")", text):
if m.group(1) is not None:
print('%s ' % (m.group(1)))
print ('<br />Group 2: ')
for m in re.finditer(r"(?<=\")([^\s].*?[^\s])(?=\")|(?<=\")\s(.*?)\s(?=\")", text):
if m.group(2) is not None:
print('%s ' % (m.group(2)))
Output:
Group 1: Hello, I have a question!
Group 2: Josh said enthusiastically,
As long as there are no quotes within quotes, and all quotes properly match, and the phrase always consists of two quoted parts with an unquoted part in the middle:
parts = [x.strip() for x in re.findall(r'"([^"]+)', text)]
print(parts[0] + " " + parts[2])
# Hello, I have a question!
print(parts[1])
# Josh said enthusiastically,
Related
I have been trying to learn how I can remove special characters on random given strings. A random given string could be something like:
uh\n haha - yes 'nope' \t tuben\xa01337
and I have used both regex and string.translate to try what could work out for me:
import re
random_string = "uh\n haha - yes 'nope' \t tuben\xa01337"
print(re.sub(r"/[' \n \t\r]|(\xa0)/g", '', random_string))
print("-------")
print(random_string.translate(str.maketrans({c: "" for c in "\n \xa0\t\r"})))
The output of that returns:
uh
haha - yes 'nope' tuben 1337
-------
uhhaha-yes'nope'tuben1337
The problem is that it does not work as I wanted since I want a output to be:
uh haha - yes nope tuben 1337
I wonder how I could be able to do that?
\n\t\xa0 or any similar should be replaced as one whitespace
' and " should be replaced with no whitespace, just remove the ' and "
double whitespaces or more should be replaced with only one whitespace total. Meaning that if there are two or more whitespaces in a text they should be replaced with one.
Any special characters should be removed as well
You can use
import re
random_string = "uh\n haha - yes 'nope' \t tuben\xa01337"
random_string = re.sub(r"\s+", " ", random_string).strip().replace('"', '').replace("'", '')
print(random_string)
See the Python demo.
Notes:
re.sub(r"\s+", " ", random_string) - shrinks any chunks of one or more whitespace chars into a single regular space char
.strip() - removes leading/trailing whitespace
.replace('"', '').replace("'", '') - removes " and ' chars.
/[' \n \t\r]|(\xa0)/g
This is syntax that is used by tools like sed or Vim, not Python's re module.
The equivalent would be
print(re.sub(r"[' \n \t\r]|(\xa0)", '', random_string))
which prints
uhhaha-yesnopetuben1337
which is not far off, but you also removed all spaces.
If you don't remove the spaces,
print(re.sub(r"['\n\t\r]|(\xa0)", '', random_string))
you get
uh haha - yes nope tuben1337
which has too many spaces.
A solution is to use the inverse regular expression (which matches runs of characters you want to keep) with re.findall to get a list of words, which you can then re-join:
result = re.findall(r"[^' \n\t\r\xa0]+", random_string)
print(' '.join(result))
which prints
uh haha - yes nope tuben 1337
This regular expression will do the trick:
>>> print(re.sub(" +", ' ', re.sub(r'''/|[^\w\s]|\n|\t|\r|(\xa0)/g''', '', random_string)))
uh haha yes nope tuben 1337
The outer re.sub matches multiple whitespace and replaces it with one whitespace.
The inner re.sub is almost identical to the one you're using, I just found it more readable to have them all as choices with |.
I have a regex like this: "[a-z|A-Z|0-9]: " that will match one alphanumeric character, colon, and space. I wonder how to split the string but keeping the alphanumeric character in the first result of splitting. I cannot change the regex because there are some cases that the string will have special character before colon and space.
Example:
line = re.split("[a-z|A-Z|0-9]: ", "A: ") # Result: ['A', '']
line = re.split("[a-z|A-Z|0-9]: ", ":: )5: ") # Result: [':: )5', '']
line = re.split("[a-z|A-Z|0-9]: ", "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
Update:
Actually, my problem is splitting from a review file. Suppose I have a file that every line has this pattern: [title]: [review]. I want to get the title and review, but some of the titles have a special character before a colon and space, and I don't want to match them. However, it seems that the character before a colon and space that I want to match apparently is an alphanumeric one.
You could split using a negative lookbehind with a single colon or use a character class [:)] where you can specify which characters should not occur directly to the left.
(?<!:):[ ]
In parts
(?<!:) Negative lookbehind, assert what is on the left is not a colon
:[ ] Match a colon followed by a space (Added square brackets only for clarity)
Regex demo | Python demo
For example
import re
pattern = r"(?<!:): "
line = re.split(pattern, "A: ") # Result: ['A', '']
print(line)
line = re.split(pattern, ":: )5: ") # Result: [':: )5', '']
print(line)
line = re.split(pattern, "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
print(line)
Output
['A', '']
[':: )5', '']
['Delicious :)', 'I want to eat this again']
Solution
First of all, as you show in your examples, you need to match characters other than a-zA-Z0-9, so we should just use the . matcher, it will match every character.
So I think the expression you're looking for might be this one:
(.*?):(?!.*:) (.*)
You can use it like so:
import re
pattern = r"(.*?):(?!.*:) (.*)"
matcher = re.compile(pattern)
txt1 = "A: "
txt2 = ":: )5: "
txt3 = "Delicious :): I want to eat this again"
result1 = matcher.search(txt1).groups() # ('A', '')
result2 = matcher.search(txt2).groups() # (':: )5', '')
result3 = matcher.search(txt3).groups() # ('Delicious :)', 'I want to eat this again')
Explanation
We use capture groups (the parentheses) to get the different parts in the string into different groups, search then finds these groups and outputs them in the tuple.
The (?!.*:) part is called "Negative Lookahead", and we use it to make sure we start capturing from the last : we find.
Edit
BTW, if, as you mentioned, you have many lines each containing a review, you can use this snippet to get all of the reviews separated by title and body at once:
import re
pattern = r"(.*?):(?!.*:) (.*)\n?"
matcher = re.compile(pattern)
reviews = """
A:
:: )5:
Delicious :): I want to eat this again
"""
parsed_reviews = matcher.findall(reviews) # [('A', ''), (':: )5', ''), ('Delicious :)', 'I want to eat this again')]
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
In a sentence, How can I remove apostrophe, double quotes, comma and so on for all words excluding words like it's, what's etc.. and at end of the sentence there must be a space between word and full stop.
For example
Input Sentence :
"'This has punctuation, and it's hard to remove. ?"
Desired Output Sentence :
This has punctuation and it's hard to remove .
Use a negative look-behind
(?<!\w)["'?]|,(?= )
REmove the matched '"? characters through re.sub.
DEMO
And your code would be,
>>> s = '\"\'This has punctuation, and it\'s hard to remove. ?\" '
>>> m = re.sub(r'(?<!\w)[\"\'\?]|,(?= )', r'', s)
>>> m
"This has punctuation and it's hard to remove. "
I propose this code:
import re
sentences = [""""'This has punctuation, and it's hard to remove. ?" """,
"Did you see Cress' haircut?.",
"This 'thing' hasn't a really bad habit, you know?.",
"'I bought this for $30 from Best Buy it's. What a waste of money! The ear gels are 'comfortable at first, but what's after an hour."]
for s in sentences:
# Remove the specified characters
new_s = re.sub(r"""["?,$!]|'(?!(?<! ')[ts])""", "", s)
# Deal with the final dot
new_s = re.sub(r"\.", " .", new_s)
print(new_s)
ideone demo
Output:
This has punctuation and it's hard to remove .
Did you see Cress haircut .
This thing hasn't a really bad habit you know .
I bought this for 30 from Best Buy it's . What a waste of money The ear gels are comfortable at first but what's after an hour .
The regex:
["?,$!] # Match " ? , $ or !
| # OR
' # A ' if it does not have...
(?!
(?<! ')
[ts] # t or s after it, provided it has no ` '` before the t or s
)
Use this:
(?<data:image/s3,"s3://crabby-images/9551a/9551a47fd79efc061629507eecd5abad106c1c8c" alt="tT")["'?:;,.]
If you also want to leave the period at the end of a line (as long as it is preceded by a space):
(?<data:image/s3,"s3://crabby-images/9551a/9551a47fd79efc061629507eecd5abad106c1c8c" alt="tT")(?<! (?=.$))["'?:;,.]
My take for this is, remove all quotations which are at either end of a word. So split the sentences to word (separated by white-space) and strip any leading or trailing quotation marks from the words
>>> ''.join(e.strip(string.punctuation) for e in re.split("(\s)",st))
"This has punctuation and it's hard to remove "
Use the string.strip(delimiter) function for the outside quotes
like this :
output = chaine.strip("\"")
Be careful, you have to escape some characters with a '\' like ', ", \, and so on. Or you can enter them as "'", '"' (unsure).
Edit : mmh, didn't think about the apostrophes, if the only problem is the apostrophes you can strip the rest first then parse it manually with a for statement, place indice of first apostrophe found then if followed by an 's', leave it, I don't know, you have to set lexical/semantical rules before coding it.
Edit 2 :
If the string is only a sentence, and always has a dot at the end, and always needs the space, then use this at the end :
chaine[:-2]+" "+chaine[-2:]
I am very new a Python
I want to change sentence if there are repeated words.
Correct
Ex. "this just so so so nice" --> "this is just so nice"
Ex. "this is just is is" --> "this is just is"
Right now am I using this reg. but it do all so change on letters.
Ex. "My friend and i is happy" --> "My friend and is happy" (it remove the "i" and space) ERROR
text = re.sub(r'(\w+)\1', r'\1', text) #remove duplicated words in row
How can I do the same change but instead of letters it have to check on words?
text = re.sub(r'\b(\w+)( \1\b)+', r'\1', text) #remove duplicated words in row
The \b matches the empty string, but only at the beginning or end of a word.
Non- regex solution using itertools.groupby:
>>> strs = "this is just is is"
>>> from itertools import groupby
>>> " ".join([k for k,v in groupby(strs.split())])
'this is just is'
>>> strs = "this just so so so nice"
>>> " ".join([k for k,v in groupby(strs.split())])
'this just so nice'
\b: Matches Word Boundaries
\w: Any word character
\1: Replaces the matches with the second word found
import re
def Remove_Duplicates(Test_string):
Pattern = r"\b(\w+)(?:\W\1\b)+"
return re.sub(Pattern, r"\1", Test_string, flags=re.IGNORECASE)
Test_string1 = "Good bye bye world world"
Test_string2 = "Ram went went to to his home"
Test_string3 = "Hello hello world world"
print(Remove_Duplicates(Test_string1))
print(Remove_Duplicates(Test_string2))
print(Remove_Duplicates(Test_string3))
Result:
Good bye world
Ram went to his home
Hello world