Difficulties in removing characters and white space to tokenize text via Spacy - python

I'm testing the Spacy library, but I'm having trouble cleaning up the sentences (ie removing special characters; punctuation; patterns like [Verse], [Chorus], \n ...) before working with the library.
I have removed, to some extent, these elements, however, when I perform the tokenization, I notice that there are extra white spaces, in addition to the separation of terms like "it" and "s" (it's).
Here is my code with some text examples:
text1 = "[Intro] Well, alright [Chorus] Well, it's 1969, okay? All across the USA It's another year for me and you"
text2 = "[Verse 1] For fifty years they've been married And they can't wait for their fifty-first to roll around"
text3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
df = pd.DataFrame({'text':[text1, text2, text3]})
replacer ={'\n':' ',"[\[].*?[\]]": " ",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’”“′‘\\\]':" "}
df['cleanText'] = df['text'].replace(replacer, regex=True)
df.head()
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
df
#Output:
result1 = " Well alright Well it s okay All across the USA It s another year for me and you"
result2 = " For fifty years they ve been married And they can t wait for their fifty first to roll around"
result3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
When I try to tokenize, I get, for example: ( , Well, , alright, , Well, , it, s, ...)
I used the same logic to remove the characters to tokenize via nltk and there it worked. Does anyone know what I might be wrong?

 This regex pattern removes almost all extra white spaces since I change the sentences " " by "" and finally add ' +':' ' like this
replacer = {'\n':'',"[\[].*?[\]]": "",'[!"#%\'()*+,-./:;<=>?#\[\]^_`{|}~1234567890’""′‘\\\]':"", ' +': ' '}
then after applying the regex pattern, call strip() method to remove white spaces at begin and end.
df['cleanText'] = df['cleanText'].apply(lambda x: x.strip())
and when you define the column new_col using npl():
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
[3 rows x 3 columns]
>>> df
text cleanText new_col
0 [Intro] Well, alright [Chorus] Well, it's 1969... Well alright Well its okay All across the USA ... (Well, alright, Well, its, okay, All, across, ...
1 [Verse 1] For fifty years they've been married... For fifty years theyve been married And they c... (For, fifty, years, they, ve, been, married, A...
2 Passion that shouts And red with anger I lost ... Passion that shouts And red with anger I lost ... (Passion, that, shouts, And, red, with, anger,...

Related

Extract phrase count from text files based on a keyword

I have a set of text files with blurbs of text and I need to search these for a particular keyword such that a set of words before and/or after the keyword (i.e. phrases) are returned along with a count of the phrases across the files. For example, contents of a few of files are:
File 1: This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!
File 2: Having a beautiful green park close to your house is great.
File 3: I visited a green park today. My friend also visited a green park today.
So if I search for the keyword park, I'm looking for the output to be a set of phrases (let's say one word before & after park), ranked based on how many times the phrase occurs across files. So in this example, the output should be:
green park today: 2
green park close: 1
Is there a way I can achieve this in Python, maybe using some NLP libraries or even without them. I have some code in my post here but that doesn't solve the purpose (I'll perhaps delete that post once I get a response to this one).
Thank you
Based on your expected output above, it looks like you only want to add one to the count for a single phrase per file (even if it appears several times in the same file). Below is an example of how you can do this without any special NLP libraries, just defining "words" as chains of non-space characters delimited by spaces (I'm assuming you know how to read text from a file so leaving that part out).
from collections import Counter
str1 = "This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!"
str2 = "Having a beautiful green park close to your house is great."
str3 = "I visited a green park today. My friend also visited a green park today."
str1_words = ["START"] + str1.split(" ") + ["END"]
str2_words = ["START"] + str2.split(" ") + ["END"]
str3_words = ["START"] + str3.split(" ") + ["END"]
print(str1_words)
all_phrases = []
SEARCH_WORD = "park"
for words in [str1_words, str2_words, str3_words]:
phrases = []
for i in range(1, len(words) - 1):
if words[i] == SEARCH_WORD:
phrases.append(" ".join(words[i-1:i+2]))
# Only count each phrase once for this text
phrases = set(phrases)
all_phrases.extend(phrases)
phrase_count = Counter(all_phrases)
print(phrase_count.most_common())
The output is:
[('green park today', 1), ('green park close', 1), ('green park today.', 1)]
This perfectly demonstrates the problem with the definition of a "word" above - punctuation is treated as part of the word. For a better way to do it, look into the NLTK library, specifically methods for "word tokenization".
Hopefully the above gives you an idea of how to get started on this.

Getting quantity and unit

I want to get bold parts in sentences below.
Examples:
SmellNice Coffee 450 gr
Clean 2 k Rice
LukaLuka 1,5lt cold drink
Jumbo 7 gutgut eggs 12'li
Espresso 5 Klasik 10 Ad
Expression below works well until to the last two.
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)
I have added \s|$ end of the expression. Thinking that If the unit is not the last word then there should be a space after it. But it didn't work. Briefly, how can I capture all bold expressions?
It works with brackets:
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)(\s+|$)
x2 = (
"\d+" #digit
"[,'\s]" #space comma apostrophe
"[\d*\s*]?" #opt digit or space
"((gr)|g|(kg)|k|(adet)|([Aa]d)|(lı)|(li)|(lu)|(lü)|(cc)|(cl)|(ml)|(lt)|l|(mm)|(cm)|(mt)|m)" #all the weights to look for
"(\s+|$)" #it's gotta be followed with a space, or with end of line.
)

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Splitting by particular punctuation

Let's assume that I want to remove a comma from a sentence, but in this particular way.
I ate pineapples, grapes -> I ate pineapples I ate grapes
we know python 2.0, 3.0 well -> we know python 2.0 well we know python 3.0 well
Basically, I want to keep everything where comma didn't happen. Is there an easy way to do it using 're' library in python?
You’re basically splitting the string by a coma, keeping the first sentence but repeating it replacing the last word of the first sentence with the words after the coma.
s = "I ate pineapples, grapes"
s1 = "we know python 2.0, 3.0 well"
def my_split(string):
sep = string.split(',')
sentence = ' '.join(sep[0].split()[:-1])
words = [sep[0].split()[-1], *sep[1:]]
return ' '.join(f'{sentence} {w.strip()}' for w in words)
print(my_split(s))
print(my_split(s1))

How to a split string like this?

I would like to split a string like this:
"[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life"
into something like this:
[41.298669629999999, -81.915329330000006] [6 2011-08-28 19:02:36] [Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life]
Does anyone know how to do this? thanks.
import re
r = re.compile("(\[.*\]) ([^A-Za-z]*) (.*)")
s = "[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life"
m = r.match(s)
if m:
print(m.group(1) + " [" + m.group(2) + "] [" + m.group(3) + "]")
results in
[41.298669629999999, -81.915329330000006] [6 2011-08-28 19:02:36] [Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life]
The regular expression reads:
Find string between (and including) square brackets []
After a space, find anything which isn't A-Z or a-z
After a space, find everything until the end of the string
Once the three parts have been found it prints them (with square brackets around the 2/3 above). The exact regular expression you'll need will depend on what the exact rules are for how you want to split the string up - which you haven't specified in your question. Hopefully this will help you get started.

Categories