How to extract string that contains specific characters in Python - python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000

You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']

Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']

I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Related

Extract phrase count from text files based on a keyword

I have a set of text files with blurbs of text and I need to search these for a particular keyword such that a set of words before and/or after the keyword (i.e. phrases) are returned along with a count of the phrases across the files. For example, contents of a few of files are:
File 1: This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!
File 2: Having a beautiful green park close to your house is great.
File 3: I visited a green park today. My friend also visited a green park today.
So if I search for the keyword park, I'm looking for the output to be a set of phrases (let's say one word before & after park), ranked based on how many times the phrase occurs across files. So in this example, the output should be:
green park today: 2
green park close: 1
Is there a way I can achieve this in Python, maybe using some NLP libraries or even without them. I have some code in my post here but that doesn't solve the purpose (I'll perhaps delete that post once I get a response to this one).
Thank you
Based on your expected output above, it looks like you only want to add one to the count for a single phrase per file (even if it appears several times in the same file). Below is an example of how you can do this without any special NLP libraries, just defining "words" as chains of non-space characters delimited by spaces (I'm assuming you know how to read text from a file so leaving that part out).
from collections import Counter
str1 = "This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!"
str2 = "Having a beautiful green park close to your house is great."
str3 = "I visited a green park today. My friend also visited a green park today."
str1_words = ["START"] + str1.split(" ") + ["END"]
str2_words = ["START"] + str2.split(" ") + ["END"]
str3_words = ["START"] + str3.split(" ") + ["END"]
print(str1_words)
all_phrases = []
SEARCH_WORD = "park"
for words in [str1_words, str2_words, str3_words]:
phrases = []
for i in range(1, len(words) - 1):
if words[i] == SEARCH_WORD:
phrases.append(" ".join(words[i-1:i+2]))
# Only count each phrase once for this text
phrases = set(phrases)
all_phrases.extend(phrases)
phrase_count = Counter(all_phrases)
print(phrase_count.most_common())
The output is:
[('green park today', 1), ('green park close', 1), ('green park today.', 1)]
This perfectly demonstrates the problem with the definition of a "word" above - punctuation is treated as part of the word. For a better way to do it, look into the NLTK library, specifically methods for "word tokenization".
Hopefully the above gives you an idea of how to get started on this.

How to count the number of words in the list, provided there is more than one?

For example I have a text,
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
I need to get the number of words in this text, we enter the words with input().
Type will be a list, dict, set this required condition.
It is also not clear how to remove the attention to punctuation marks.
My solution, but perhaps there is a cleaner way.
text = list(text.split(' '))
word = input('Enter a word: ')
for i in text:
if text.count(word) < 2:
break
if word in text:
print(f'{word} - {text.count(word)}')
break
Output:
this - 2
the - 7
The 'moment' occurs only once in the text, we do not deduce it
You can think of this as two steps:
Clean the input
Find the count
A fast way to clean the input is to strip all the punctuation first using translate combined with string.punctuation:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
Now you have all the text with no punctuation and can split it into words and count:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
word = "this"
count = clean.count(word)
if count > 1:
print(f'{word} - {count}')
# prints: this - 2
Since you are using count you don't need to loop. Just be careful not to call count multiple times if you don't need to. Each time you do, it needs to look through the whole list. Notice above the code calls it once and saves it so we can use the count in multiple places.
You can use collections.Counter() to get a dictionary of the number of occurrences of each element in a list:
import collections
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
word = input('Enter a word: ')
# Remove punctuation from text
for char in text:
if char.lower() not in "abcdefghijklmnopqrstuvwxyz ":
text = text.replace(char, "")
wordcount = collections.Counter(text.split())
print(f"{word} - {wordcount[word]}")

Check how many words from a given list occur in list of text/strings

I have a list of text data which contains reviews, something likes this:
1. 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.'
2. 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
3. 'This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
I have a seperate list of words which I want to know exists in the these reviews:
['food','science','good','buy','feedback'....]
I want to know which of these words are present in the review and select reviews which contains certain number of these words. For example, lets say only select reviews which contains atleast 3 of the words from this list, so it displays all those reviews, but also show which of those were encountered in the review while selecting it.
I have the code for selecting reviews containing at least 3 of the words, but how do I get the second part which tells me which words exactly were encountered. Here is my initial code:
keywords = list(words)
text = list(df.summary.values)
sentences=[]
for element in text:
if len(set(keywords)&set(element.split(' '))) >=3:
sentences.append(element)
To answer the second part, allow me to revisit how to approach the first part. A handy approach here is to cast your review strings into sets of word strings.
Like this:
review_1 = "I have bought several of the Vitality canned dog food products and"
review_1 = set(review_1.split(" "))
Now the review_1 set contains one of every word. Then take your list of words, convert it to a set, and do an intersection.
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = review_1.intersection(words)
The resulting set, matches, contains all the words that are common. The length of this is the number of matches.
Now, this does not work if you cared about how many of each word matches. For example, if the word "food" is found twice in the review and "science" is found once, does that count as matching three words?
If so, let me know via comment and I can write some code to update the answer to include that scenario.
EDIT: Updating to include comment question
If you want to keep a count of how many times each word repeats, then hang onto the review list. Only cast it to set when performing the intersection. Then, use the 'count' list method to count the number of times each match appears in the review. In the example below, I use a dictionary to store the results.
review_1 = "I have bought several of the Vitality canned dog food products and"
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = set(review_1).intersection(words)
match_counts = dict()
for match in matches:
match_counts[match] = words.count(match)
You can use set intersection for finding the common words:
def filter_reviews(data, *, trigger_words = frozenset({'food', 'science', 'good', 'buy', 'feedback'})):
for review in data:
words = review.split() # use whatever method is appropriate to get the words
common = trigger_words.intersection(words)
if len(common) >= 3:
yield review, common

Iterate over a text and find the distance between predefined substrings

I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

Categories