How to split based off two characters "[" and "]" in a string - python

For example calling .split() on the following would give...
x = "[Chorus: Rihanna & Swizz Beatz]
I just wanted you to know
...more lyrics
[Verse 2: Kanye West & Swizz Beatz]
I be Puerto Rican day parade floatin'
... more lyrics"
x.split()
print(x)
would give
["I just wanted you to know ... more lyrics", " be Puerto Rican day parade floatin' ... more lyrics]
Also, how would you save the deleted parts in brackets, thank you. Splitting by an unknown string inside two things is hard :/

Use re.split
>>> x = """[Chorus: Rihanna & Swizz Beatz] I just wanted you to know...more lyrics [Verse 2: Kanye West & Swizz Beatz] I be Puerto Rican day parade floatin' ... more lyrics"""
>>> [i.strip() for i in re.split(r'[\[\]]', x) if i]
# ['Chorus: Rihanna & Swizz Beatz', 'I just wanted you to know...more lyrics', 'Verse 2: Kanye West & Swizz Beatz', "I be Puerto Rican day parade floatin' ... more lyrics"]

data=x.split(']')
print(data)
data=data[1::]
print(data)
location=0;
for i in data:
data[location]=i.split('[')[0]
location=location+1;
print(data)
I got this output for your initial input
['I just wanted you to know...more lyrics', "I be Puerto Rican day parade floatin'... more lyrics"]
I hope this helps

Per the python documentation: https://docs.python.org/2/library/re.html
Python is by and large an excellent language with good consistency, but there are still some quirks to the language that should be ironed out. You would think that the re.split() function would just have a potential argument to decide whether the delimiter is returned. It turns out that, for whatever reason, whether it returns the delimiter or not is based on the input. If you surround your regex with parentheses in re.split(), Python will return the delimiter as part of the array.
Here are two ways you might try to accomplish your goal:
re.split("]",string_here)
and
re.split("(])",string_here)
The first way will return the string with your delimiter removed. The second way will return the string with your delimiter still there, as a separate entry.
For example, running the first example on the string "This is ] a string" would produce:
["This is a ", " string."]
And running the second example would produce:
["This is a ", "]", " string."]
Personally, I'm not sure why they made this strange design choice.

import re
...
input='[youwontseethis]what[hi]ever'
...
output=re.split('\[.*?\]',input)
print(output)
#['','what','ever']
If the input string starts immediately with a 'tag' like your example, the first item in the tuple will be an empty string. If you don't want this functionality you could also do this:
import re
...
input='[youwontseethis]what[hi]ever'
...
output=re.split('\[.*?\]',input)
output=output[1:] if output[0] == '' else output
print(output)
#['what',ever']
To get the tags simply replace the
output=re.split('\[.*?\]',input)
with
output=re.findall('\[.*?\]',input)
#['[youwontseethis]','[hi]']

Related

How to extract only sentences from some texts in Python?

I have a text that is in the following form:
document = "Hobby: I like going to the mountains.\n Something (where): To Everest mountain.\n\n
The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
I want to extract only the sentences from this text, without the "Hobby:", "Something (where):", "The reason". Only the sentences. For example, "To Everest mountain" would not be a sentence, since it is not like a full sentence.
The idea is that I need to get rid of those words followed by ":" (Hobby:, The reason:) (it doesn't matter what's written before the ":" part, the idea is to get rid of that if it is at the beginning of the "sentence") and extract only the sentences from what it remained.
I'd appreciate any idea.
You can just use the split() method. First, you should split by \n and then check if there is ":" in the sentence and append to the final list the second part of this sentence split by ": ". Here is the code:
document = "Hobby: I like going to the mountains.\n Something (where): To Everest mountain.\n\n The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
sentences = []
for element in document.split("\n"):
if ":" in element:
sentences.append(element.split(": ")[1])
print(*sentences, sep="\n")
And the output will be:
I like going to the mountains.
To Everest mountain.
I want to go because I like nature.
I'd like to go hiking and admiring the beauty of the nature.
But if the sentence can also contain ": " you should use the following code:
document = "Hobby: I like go: ing to the mountain: s.\n Something (where): To Everest mountain.\n\n The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
sentences = []
for element in document.split("\n"):
if ":" in element:
sentences.append(element.split(": ")[1:])
for line in sentences:
print(": ".join(line))
Output:
I like go: ing to the mountain: s.
To Everest mountain.
I want to go because I like nature.
I'd like to go hiking and admiring the beauty of the nature.
Hope that helped!
If the text file is structured in the way, that each sentence is separated by newline character parsing it with regex might be feasible.
As other answer mentioned, use `split() function to separate sentences.
lines = document.split("\n")
With that you can apply regex to each line:
import re
sentences = []
for line in lines:
result = re.search(r"^[a-zA-z0-9\s]*:(.*)", line)
if not result:
continue
sentences.extend(result.groups())
print(sentences)
To check out what the regex does, visit a website such as https://regex101.com/
In short: It checks for alphanumerical characters until first : symbol, then grab everything after that. ^ symbol is crucial here as it's asserting position at the start of the line, this way you won't grab another :

Difficulties in removing characters and white space to tokenize text via Spacy

I'm testing the Spacy library, but I'm having trouble cleaning up the sentences (ie removing special characters; punctuation; patterns like [Verse], [Chorus], \n ...) before working with the library.
I have removed, to some extent, these elements, however, when I perform the tokenization, I notice that there are extra white spaces, in addition to the separation of terms like "it" and "s" (it's).
Here is my code with some text examples:
text1 = "[Intro] Well, alright [Chorus] Well, it's 1969, okay? All across the USA It's another year for me and you"
text2 = "[Verse 1] For fifty years they've been married And they can't wait for their fifty-first to roll around"
text3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
df = pd.DataFrame({'text':[text1, text2, text3]})
replacer ={'\n':' ',"[\[].*?[\]]": " ",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’”“′‘\\\]':" "}
df['cleanText'] = df['text'].replace(replacer, regex=True)
df.head()
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
df
#Output:
result1 = " Well alright Well it s okay All across the USA It s another year for me and you"
result2 = " For fifty years they ve been married And they can t wait for their fifty first to roll around"
result3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
When I try to tokenize, I get, for example: ( , Well, , alright, , Well, , it, s, ...)
I used the same logic to remove the characters to tokenize via nltk and there it worked. Does anyone know what I might be wrong?
 This regex pattern removes almost all extra white spaces since I change the sentences " " by "" and finally add ' +':' ' like this
replacer = {'\n':'',"[\[].*?[\]]": "",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’""′‘\\\]':"", ' +': ' '}
then after applying the regex pattern, call strip() method to remove white spaces at begin and end.
df['cleanText'] = df['cleanText'].apply(lambda x: x.strip())
and when you define the column new_col using npl():
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
[3 rows x 3 columns]
>>> df
text cleanText new_col
0 [Intro] Well, alright [Chorus] Well, it's 1969... Well alright Well its okay All across the USA ... (Well, alright, Well, its, okay, All, across, ...
1 [Verse 1] For fifty years they've been married... For fifty years theyve been married And they c... (For, fifty, years, they, ve, been, married, A...
2 Passion that shouts And red with anger I lost ... Passion that shouts And red with anger I lost ... (Passion, that, shouts, And, red, with, anger,...

Split by regex of new line and capital letter

I've been struggling to split my string by a regex expression in Python.
I have a text file which I load that is in the format of:
"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line"
I'd like to get the following output:
['Peter went to the gym; he worked out for two hours','Kyle ate lunch
at Kate's house. He went home at 9.', 'Some other sentence here',
'\u2022Here's a bulleted line']
I'm looking to split my string by a new line and a capital letter or a bullet point in Python.
I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.
Here's what I have so far:
print re.findall(r'\n[A-Z][a-z]+',str,re.M)
This just gives me:
[u'\nKyle', u'\nSome']
which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.
I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?
I hope this makes sense and I'm sorry if my question is in anyway unclear. :)
You can use this split function:
>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)
[u'Peter went to the gym; \nhe worked out for two hours ',
u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
u'Some other sentence here',
u"\u2022Here's a bulleted line"]
Code Demo
You can split at a \n proceeded by a capital letter or the bullet character:
import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))
Output:
['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]
Or, without using the symbol for the bullet character:
new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))

replace more than one pattern python

I have reviewed various links but all showed how to replace multiple words in one pass. However, instead of words I want to replace patterns e.g.
RT #amrightnow: "The Real Trump" Trump About You" Watch Make #1
https:\/\/t.co\/j58e8aacrE #tcot #pjnet #1A #2A #Tru mp #trump2016
https:\/\/t.co\u2026
When I perform the following two commands on the above text I get the desired output
result = re.sub(r"http\S+","",sent)
result1 = re.sub(r"#\S+","",result)
This way I am removing all the urls and #(handlers from the tweet). The output will be something like follows:
>>> result1
'RT "The Real Trump" Trump About You" Watch Make #1 #tcot #pjnet #1A #2A #Trump #trump2016 '
Could someone let me know what is the best way to do it? I will be basically reading tweets from a file. I want to read each tweet and replace these handlers and urls with blanks.
You need the regex "or" operator which is the pipe |:
re.sub(r"http\S+|#\S+","",sent)
If you have a long list of patterns that you want to remove, a common trick is to use join to create the regular expression:
to_match = ['http\S+',
'#\S+',
'something_else_you_might_want_to_remove']
re.sub('|'.join(to_match), '', sent)
You can use an "or" pattern by separating the patterns with |:
import re
s = u'RT #amrightnow: "The Real Trump" Trump About You" Watch Make #1 https:\/\/t.co\/j58e8aacrE #tcot #pjnet #1A #2A #Tru mp #trump2016 https:\/\/t.co\u2026'
result = re.sub(r"http\S+|#\S+", "", s)
print result
Output
RT "The Real Trump" Trump About You" Watch Make #1 #tcot #pjnet #1A #2A #Tru mp #trump2016
See the subsection '|' in the regular expression syntax documentation.

nested for loop for spliting string based on multiple delimiters

I'm working on a Python assignment which requires a text to be delimited, sorted and printed as:
sentences are delimited by .
phrases by ,
then printed
What I've done so far:
text = "what time of the day is it. i'm heading out to the ball park, with the kids, on this nice evening. are you willing to join me, on the walk to the park, tonight."
for i, phrase in enumerate(text.split(',')):
print('phrase #%d: %s' % (i+1,phrase)):
phrase #1: what time of the day is it. i'm heading out to the ball park
phrase #2: with the kids
phrase #3: on this nice evening. are you willing to join me
phrase #4: on the walk to the park
phrase #5: tonight.
I know a nested for loop is needed and have tried with:
for s, sentence in enumerate(text.split('.')):
for p, phrase in enumerate(text.split(',')):
print('sentence #%d:','phrase #%d: %s' %(s+1,p+1,len(sentence),phrase))
TypeError: not all arguments converted during string formatting
A hint and/or a simple example would be welcomed.
You probably want:
'sentence #%d:\nphrase #%d: %d %s\n' %(s+1,p+1,len(sentence),phrase)
And in the inner loop, you certainly want to split sentence, not text again
TypeError: not all arguments converted during string formatting
Is a hint.
Your loops are fine.
'sentence #%d:','phrase #%d: %s' %(s+1,p+1,len(sentence),phrase)
is wrong.
Count the %d and %s conversion specifications. Count the values after the % operator/
The numbers aren't the same, are they? That's a TypeError.
There are couple of issues with your code snippet
for s, sentence in enumerate(text.split('.')):
for p, phrase in enumerate(text.split(',')):
print('sentence #%d:','phrase #%d: %s' %(s+1,p+1,len(sentence),phrase))
If I understand you correctly, you want to split the sentence by delimited .. Then Each of these sentences you want to split on phrases which is again delimited by ,. So second line of your should actually split the output of the outer loops enumeration. Something like
for p, phrase in enumerate(sentence.split(',')):
The Print Statement. If you ever See an Error Like TypeError, you can be sure that you are trying to assign a variable of one type onto another type. Well but there is no assignments? Its an indirect assignment to the print concatenation. What you committed to the print is that, you would be supplying 3 parameters out of which first two would be Integers(%d) and the last a string(%d). But you ended up supplying 3 Integers (s+1, p+1, len(sentence),phrase) which is inconsistent with your print format specifier. Either you drop the third parameter (len(sentence)) like
print('sentence #%d:, phrase #%d: %s' %(s+1,p+1,phrase))
or add one more Format Specifier to the print statement
print('sentence #%d:, phrase #%d:, length #%d, %s' %(s+1,p+1,len(sentence),phrase))
Assuming you want the former want, that leaver us to
for s, sentence in enumerate(text.split('.')):
for p, phrase in enumerate(text.split(',')):
print('sentence #%d:, phrase #%d:, length #%d, %s' %(s+1,p+1,len(sentence),phrase))
>>> sen = [words[1] for words in enumerate(text.split(". "))]
>>> for each in sen: each.split(", ")
['what time of the day is it']
["i'm heading out to the ball park", 'with the kids', 'on this nice evening']
['are you willing to join me', 'on the walk to the park', 'tonight.']
It's up to you to transform this unassigned output to your liking.

Categories