How to a split string like this? - python

I would like to split a string like this:
"[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life"
into something like this:
[41.298669629999999, -81.915329330000006] [6 2011-08-28 19:02:36] [Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life]
Does anyone know how to do this? thanks.

import re
r = re.compile("(\[.*\]) ([^A-Za-z]*) (.*)")
s = "[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life"
m = r.match(s)
if m:
print(m.group(1) + " [" + m.group(2) + "] [" + m.group(3) + "]")
results in
[41.298669629999999, -81.915329330000006] [6 2011-08-28 19:02:36] [Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life]
The regular expression reads:
Find string between (and including) square brackets []
After a space, find anything which isn't A-Z or a-z
After a space, find everything until the end of the string
Once the three parts have been found it prints them (with square brackets around the 2/3 above). The exact regular expression you'll need will depend on what the exact rules are for how you want to split the string up - which you haven't specified in your question. Hopefully this will help you get started.

Related

Difficulties in removing characters and white space to tokenize text via Spacy

I'm testing the Spacy library, but I'm having trouble cleaning up the sentences (ie removing special characters; punctuation; patterns like [Verse], [Chorus], \n ...) before working with the library.
I have removed, to some extent, these elements, however, when I perform the tokenization, I notice that there are extra white spaces, in addition to the separation of terms like "it" and "s" (it's).
Here is my code with some text examples:
text1 = "[Intro] Well, alright [Chorus] Well, it's 1969, okay? All across the USA It's another year for me and you"
text2 = "[Verse 1] For fifty years they've been married And they can't wait for their fifty-first to roll around"
text3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
df = pd.DataFrame({'text':[text1, text2, text3]})
replacer ={'\n':' ',"[\[].*?[\]]": " ",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’”“′‘\\\]':" "}
df['cleanText'] = df['text'].replace(replacer, regex=True)
df.head()
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
df
#Output:
result1 = " Well alright Well it s okay All across the USA It s another year for me and you"
result2 = " For fifty years they ve been married And they can t wait for their fifty first to roll around"
result3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
When I try to tokenize, I get, for example: ( , Well, , alright, , Well, , it, s, ...)
I used the same logic to remove the characters to tokenize via nltk and there it worked. Does anyone know what I might be wrong?
 This regex pattern removes almost all extra white spaces since I change the sentences " " by "" and finally add ' +':' ' like this
replacer = {'\n':'',"[\[].*?[\]]": "",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’""′‘\\\]':"", ' +': ' '}
then after applying the regex pattern, call strip() method to remove white spaces at begin and end.
df['cleanText'] = df['cleanText'].apply(lambda x: x.strip())
and when you define the column new_col using npl():
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
[3 rows x 3 columns]
>>> df
text cleanText new_col
0 [Intro] Well, alright [Chorus] Well, it's 1969... Well alright Well its okay All across the USA ... (Well, alright, Well, its, okay, All, across, ...
1 [Verse 1] For fifty years they've been married... For fifty years theyve been married And they c... (For, fifty, years, they, ve, been, married, A...
2 Passion that shouts And red with anger I lost ... Passion that shouts And red with anger I lost ... (Passion, that, shouts, And, red, with, anger,...

Is there a shorter way to replace words in a string? [duplicate]

This question already has answers here:
How to replace multiple substrings of a string?
(28 answers)
Closed 3 years ago.
This is my task
journey = """Just a small tone girl
Leaving in a lonely whirl
She took the midnight tray going anywhere
Just a seedy boy
Bored and raised in South Detroit or something
He took the midnight tray going anywhere"""
Gross. Okay, so for this exercise, your job is to use Python's string replace method to fix this string up and print the new version out to the console.
This is what I did
journey = """ just a small tone girl
Leaving in a lonely whirl
she took a midnight tray going anywhere
Just a seedy boy
bored and raised in south detroit or something
He took the midnight tray going anywhere"""
journeyEdit = journey.replace("tone" ,
"town").replace("tray","train").replace("seedy","city").replace("Leaving",
"living").replace("bored","born").replace("whirl","world").replace("or
something", " ")
print (journeyEdit)
Here is a sample way to replace words from text. you can use python re package.
please find the below code for your guidance.
import re
journey = """ just a small tone girl Leaving in a lonely whirl she took a
midnight tray going anywhere Just a seedy boy bored and raised in south
detroit or something He took the midnight tray going anywhere"""
# define desired replacements here
journeydict = {"tone" : "town",
"tray":"train",
"seedy":"city",
"Leaving": "living",
"bored":"born",
"whirl":"world"
}
# use these given three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in journeydict.items())
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest
versions
pattern = re.compile("|".join(journeydict.keys()))
text = pattern.sub(lambda m: journeydict[re.escape(m.group(0))], journey)
print(journey)
print(text)
Probably a longer way than you given ;-).
As given at How to replace multiple substrings of a string?:
import re
journey = """ just a small tone girl Leaving in a lonely whirl she took a
midnight tray going anywhere Just a seedy boy bored and raised in south
detroit or something He took the midnight tray going anywhere"""
rep = {"tone": "town",
"tray": "train",
"seedy":"city",
"Leaving": "living",
"bored":"born",
"whirl":"world",
"or something": " "}
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
# Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))
journeyEdit = pattern.sub(lambda m: rep[re.escape(m.group(0))], journey)
print(journeyEdit)

Splitting by particular punctuation

Let's assume that I want to remove a comma from a sentence, but in this particular way.
I ate pineapples, grapes -> I ate pineapples I ate grapes
we know python 2.0, 3.0 well -> we know python 2.0 well we know python 3.0 well
Basically, I want to keep everything where comma didn't happen. Is there an easy way to do it using 're' library in python?
You’re basically splitting the string by a coma, keeping the first sentence but repeating it replacing the last word of the first sentence with the words after the coma.
s = "I ate pineapples, grapes"
s1 = "we know python 2.0, 3.0 well"
def my_split(string):
sep = string.split(',')
sentence = ' '.join(sep[0].split()[:-1])
words = [sep[0].split()[-1], *sep[1:]]
return ' '.join(f'{sentence} {w.strip()}' for w in words)
print(my_split(s))
print(my_split(s1))

Iterate and replace words in lines of a tuple python

I want to iterate through this tuple and for each line, iterate through the words to find and replace some words (internet addresses, precisely) using regex while leaving them as lines.
aList=
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a l... T.CO/CHPBRVH9WKk",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant T.CO?BUXLVZFDWQ",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
This code below almost does that, but it breaks the list into words separated by lines:
i=0
while i<len(aList):
for line in aList[i].split():
line = re.sub(r"^[http](.*)\/(.*)$", "", line)
print (line)
i+=1
I'd love to have results as with the exception of the internet addresses in each line:
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
Thanks
From this:
re.sub(r"^[http](.*)\/(.*)$", "", line)
it looks to me as if you expect that all your URLs will be at the end of the line. In that case, try:
[re.sub('http://.*', '', s) for s in aList]
Here, http:// matches anything that starts with http://. .* matches everything that follows.
Example
Here is your list with some URLs added:
aList = [
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a http://example.com/CHPBRVH9WKk",
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant http://example.com?BUXLVZFDWQ",
"want to go on holiday again, missing the sun"
]
Here is the result:
>>> [re.sub('http://.*', '', s) for s in aList]
['being broken changes people, \nand rn im missing the old me',
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
'#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ',
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
'This was my ring tone, before I decided change was good and missing a call was insignificant ',
'want to go on holiday again, missing the sun']
Your question is a little unclear, but I think I get what you're going for
newlist = [re.sub(r"{regex}", "", line) for line in alist]
Should iterate through a list of strings and replace any strings that match your regex pattern with an empty string using a python list comprehension
side note:
Looking closer at your regex it looks like its not doing what you think its doing
I would look at this stack over flow post about matching urls in regex
Regex to find urls in string in Python

How to return a word in a string if it starts with a certain character? (Python)

I'm building a reddit bot for practice that converts US dollars into other commonly used currencies, and I've managed to get the conversion part working fine, but now I'm a bit stuck trying to pass the characters that directly follow a dollar sign to the converter.
This is sort of how I want it to work:
def run_bot():
subreddit = r.get_subreddit("randomsubreddit")
comments = subreddit.get_comments(limit=25)
for comment in comments:
comment_text = comment.body
#If comment contains a string that starts with '$'
# Pass the rest of the 'word' to a variable
So for example, if it were going over a comment like this:
"I bought a boat for $5000 and it's awesome"
It would assign '5000' to a variable that I would then put through my converter
What would be the best way to do this?
(Hopefully that's enough information to go off, but if people are confused I'll add more)
You could use re.findall function.
>>> import re
>>> re.findall(r'\$(\d+)', "I bought a boat for $5000 and it's awesome")
['5000']
>>> re.findall(r'\$(\d+(?:\.\d+)?)', "I bought two boats for $5000 $5000.45")
['5000', '5000.45']
OR
>>> s = "I bought a boat for $5000 and it's awesome"
>>> [i[1:] for i in s.split() if i.startswith('$')]
['5000']
If you dealing with prices as in float number, you can use this:
import re
s = "I bought a boat for $5000 and it's awesome"
matches = re.findall("\$(\d*\.\d+|\d+)", s)
print(matches) # ['5000']
s2 = "I bought a boat for $5000.52 and it's awesome"
matches = re.findall("\$(\d*\.\d+|\d+)", s2)
print(matches) # ['5000.52']

Categories