Splitting by particular punctuation

Splitting by particular punctuation - python

Let's assume that I want to remove a comma from a sentence, but in this particular way.
I ate pineapples, grapes -> I ate pineapples I ate grapes
we know python 2.0, 3.0 well -> we know python 2.0 well we know python 3.0 well
Basically, I want to keep everything where comma didn't happen. Is there an easy way to do it using 're' library in python?

You’re basically splitting the string by a coma, keeping the first sentence but repeating it replacing the last word of the first sentence with the words after the coma.
s = "I ate pineapples, grapes"
s1 = "we know python 2.0, 3.0 well"
def my_split(string):
sep = string.split(',')
sentence = ' '.join(sep[0].split()[:-1])
words = [sep[0].split()[-1], *sep[1:]]
return ' '.join(f'{sentence} {w.strip()}' for w in words)
print(my_split(s))
print(my_split(s1))

Related

Difficulties in removing characters and white space to tokenize text via Spacy

I'm testing the Spacy library, but I'm having trouble cleaning up the sentences (ie removing special characters; punctuation; patterns like [Verse], [Chorus], \n ...) before working with the library.
I have removed, to some extent, these elements, however, when I perform the tokenization, I notice that there are extra white spaces, in addition to the separation of terms like "it" and "s" (it's).
Here is my code with some text examples:
text1 = "[Intro] Well, alright [Chorus] Well, it's 1969, okay? All across the USA It's another year for me and you"
text2 = "[Verse 1] For fifty years they've been married And they can't wait for their fifty-first to roll around"
text3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
df = pd.DataFrame({'text':[text1, text2, text3]})
replacer ={'\n':' ',"[\[].*?[\]]": " ",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’”“′‘\\\]':" "}
df['cleanText'] = df['text'].replace(replacer, regex=True)
df.head()
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
df
#Output:
result1 = " Well alright Well it s okay All across the USA It s another year for me and you"
result2 = " For fifty years they ve been married And they can t wait for their fifty first to roll around"
result3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
When I try to tokenize, I get, for example: ( , Well, , alright, , Well, , it, s, ...)
I used the same logic to remove the characters to tokenize via nltk and there it worked. Does anyone know what I might be wrong?

 This regex pattern removes almost all extra white spaces since I change the sentences " " by "" and finally add ' +':' ' like this
replacer = {'\n':'',"[\[].*?[\]]": "",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’""′‘\\\]':"", ' +': ' '}
then after applying the regex pattern, call strip() method to remove white spaces at begin and end.
df['cleanText'] = df['cleanText'].apply(lambda x: x.strip())
and when you define the column new_col using npl():
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
[3 rows x 3 columns]
>>> df
text cleanText new_col
0 [Intro] Well, alright [Chorus] Well, it's 1969... Well alright Well its okay All across the USA ... (Well, alright, Well, its, okay, All, across, ...
1 [Verse 1] For fifty years they've been married... For fifty years theyve been married And they c... (For, fifty, years, they, ve, been, married, A...
2 Passion that shouts And red with anger I lost ... Passion that shouts And red with anger I lost ... (Passion, that, shouts, And, red, with, anger,...

Regex noob question: getting several words/sentences from one line, max separation being 1 whitespace?

I'm not terribly familiar with Python regex, or regex in general, but I'm hoping to demystify it all a bit more with time.
My problem is this: given a string like ' Apple Banana Cucumber Alphabetical Fruit Whoops', I'm trying to use python's 're.findall' module to result in a list that looks like this: my_list = [' Apple', ' Banana', ' Cucumber', ' Alphabetical Fruit', ' Whoops']. In other words, I'm trying to find a regex expression that can [look for a bunch of whitespace followed by some non-whitespace], and then check if there is a single space with some more non-whitespace characters after that.
This is the function I've written that gets me cloooose but not quite:
re.findall("\s+\S+\s{1}\S*", my_list)
Which results in:
[' Apple ', ' Banana ', ' Cucumber ', ' Alphabetical Fruit']
I think this result makes sense. It first finds the whitespace, then some non-whitespace, but then it looks for at least one whitespace (which leaves out 'Whoops'), and then looks for any number of other non-whitespace characters (which is why there's no space after 'Alphabetical Fruit'). I just don't know what character combination would give me the intended result.
Any help would be hugely appreciated!
-WW

You can do:
\s+\w+(?:\s\w+)?
\s+\w+ macthes one or more whitespaces, followed by one or more of [A-Za-z0-9_]
(?:\s\w+)? is a conditional (?, zero or one) non-captured group ((?:)) that matches a whitespace (\s) followed by one or more of [A-Za-z0-9_] (\w+). Essentially this is to match Fruit in Alphabetical Fruit.
Example:
In [701]: text = ' Apple Banana Cucumber Alphabetical Fruit Whoops'
In [702]: re.findall(r'\s+\w+(?:\s\w+)?', text)
Out[702]:
[' Apple',
' Banana',
' Cucumber',
' Alphabetical Fruit',
' Whoops']

Your pattern works already, just make the second part (the 'compound word' part) optional:
\s+\S+(\s\S+)?
https://regex101.com/r/Ua8353/3/
(fixed \s{1} per #heemayl)

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.

Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])

You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Printing text after searching for a string

I am following a tutorial to identify and print the words in between a particular string;
f is the string Mango grapes Lemon Ginger Pineapple
def findFruit(f):
global fruit
found = [re.search(r'(.*?) (Lemon) (.*?)$', word) for word in f]
for i in found:
if i is not None:
fruit = i.group(1)
fruit = i.group(3)
grapes and Ginger will be outputted when i print fruit. However what i want the output is to look like "grapes" # "Ginger" (note the "" and # sign).

You can use string formatting here with the use of the str.format() function:
def findFruit(f):
found = re.search(r'.*? (.*?) Lemon (.*?) .*?$', f)
if found is not None:
print '"{}" # "{}"'.format(found.group(1), found.group(2))
Or, a lovely solution Kimvais posted in the comments:
print '"{0}" # "{1}"'.format(*found.groups())
I've done some edits. Firstly, a for-loop isn't needed here (nor is a list comprehension. You're iterating through each letter of the string, instead of each word. Even then you don't want to iterate through each word.
I also changed your regular expression (Do note that I'm not that great in regex, so there probably is a better solution).

re: Match any word in a set repeating

Given a set of space delimited words that may come in any order how can I match only those words in a given set of words. For example say I have:
apple monkey banana dog and I want to match apple and banana how might I do that?
Here's what I've tried:
m = re.search("(?P<fruit>[apple|banana]*)", "apple monkey banana dog")
m.groupdict() --> {'fruit':'apple'}
But I want to match both apple and banana.

In (?P<fruit>[apple|banana]*)
[apple|banana]* defines a character class, e.g. this token matches one a, one p, one l, one e, one |, one b or one n, and then says 'match this 0 or more times'. (You probably meant to use a +, anyway, which would mean 'match one or more times')
What you want is (apple|banana) which will match the string apple or the string banana.
Learn more: http://www.regular-expressions.info/reference.html
For your next question, to get all matches a regex makes against a string, not just the first, use http://docs.python.org/2/library/re.html#re.findall

If you want it to be able to repeat, you're going to fail on white space. Try this:
input = ['apple','banana','orange']
reg_string = '(' + ('|').join(input) + ')'
lookahead_string = '(\s(?=' + ('|').join(input) + '))?' + reg_string + '?'
out_reg_string = reg_string + (len(input)-1)*lookahead_string
matches = re.findall(out_reg_string, string_to_match)
where string_to_match is what you are looking for the pattern within. out_reg_string can be used to match something like:
"apple banana orange"
"apple orange"
"apple banana"
"banana apple"
or any of the cartesian product of your input list.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting by particular punctuation - python

Related

Difficulties in removing characters and white space to tokenize text via Spacy

Regex noob question: getting several words/sentences from one line, max separation being 1 whitespace?

getting words between m and n characters

Printing text after searching for a string

re: Match any word in a set repeating

Categories

Resources