Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a string that looks like:
str_in = "Lemons: J2020, M2021. Eat by 9/03/28
Strawberries: N2023, O2024. Buy by 10/10/20"
How do I get just "J2020, M2021, N2023, O2024"?
What I have so far is very hardcoded. It is:
str_in.replace("Lemon:","")
str_in.replace("Strawberries:", "")
str_in.replace("Buy by")
I don't know how to get rid of the date if the date changes from the number specified. Is there a RegEx form I could use?
Based on your original post and your follow-up comments, you can explicitly fetch the strings you want to keep by using this regex: \b[A-Z]+\d+\b. It allows for 1 or more letters followed by 1 or more digits, bounded as a single word. To test it and other regexes in the future, use this great online tool.
The findall() method on the regex class is best used here because it will return all instances of this pattern. For more on findall() and other kinds of matching methods, check out this tutorial.
Putting all that together, the code would be:
values = re.findall(r'\b[A-Z]+\d+\b', str_in)
Be sure to import re first.
I just saw your edited question, so, here's my edited answer
import re
re_pattern = re.compile(r'(\w+),\s(\w+)\.')
data = [ 'Lemons: J2020, M2021. Eat by 9/03/28',
'Strawberries: N2023, O2024. Buy by 10/10/20',
'Peaches: N12345, O123456. Buy by 10/10/20'
]
for line in data:
match = re_pattern.search(line)
if match:
print(match.group(1), match.group(2))
import re
string = "Lemons: J2020, M2021. Eat by 9/03/28 Strawberries: N2023, O2024. Buy by 10/10/20"
array = re.findall(r"\b[A-Z]\d{4}\b", string)
result = ','.join(array)
The result string is "J2020, M2021, N2023, O2024"
The array is ['J2020', 'M2021', 'N2023', 'O2024']
The regex matches the possibility of having 1 OR 2 chars in the begining of the required text an then matches the later portions of the digits. I think the OP has the requisite information to make a test on the basis of this information.
import re
str_in = "Lemons: J2020, M2021. Eat by 9/03/28 \
Strawberries: N2023, O2024. Buy by 10/10/20"
result = re.findall(r'([A-Z]{1,2}\d+)', str_in)
print(result)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How to choose
[Andrey] and [21] from info?
info = "my name is [Andrey] and I am [21] years old"
result = ["[Andrey]", "[21]"];
I am sure other ways would be better. But I tried this and it worked.
If you want to extract characters inside [] without knowing its position, you can use this method:
Run a for loop through string
If you find character [
append all the next characters in a string until you find ]
you can add these strings in a list to fetch result together. Here is the code.
info = "my name is [Andrey] and I am [21] years old"
s=[] #list to collect searched result
s1="" #elements of s
for i in range(len(info)):
if info[i]=="[":
while info[i+1] != "]":
s1 += info[i+1]
i=i+1
s.append(s1)
s1=""
#make s1 empty to search for another string inside []
print s
Output will be:
['Andrey', '21']
You may choose to regex method.
Or simply use list comprehension for your use case here:
>>> print([ lst[index] for index in [3,7] ])
['[Andrey]', '[21]']
But another way, You first convert your string to list and then choose by index method with the help of itemgetter:
>>> info = "my name is [Andrey] and I am [21] years old"
>>> lst = info.split()
>>> lst
['my', 'name', 'is', '[Andrey]', 'and', 'I', 'am', '[21]', 'years', 'old']
>>> from operator import itemgetter
>>> print(itemgetter(3,7)(lst))
('[Andrey]', '[21]')
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I've been looking into developing a discord bot that can reply to messages by reading their contents and checking if they appear in a list.
My problem is, I need to find a reliable way of getting python to look for certain words from a text, see if they appear in the given list and output the words that are detected.
I've managed to get it working somewhat myself with the following code:
if any(word in text in list):
print("Word Spotted")
I would really apreciate some help.
Here's some code that does something like what you're describing. But really it sounds like you need to spend a significant amount of time working through some basic Python tutorials before you will be able to implement this.
import re
key_words = set(['foo', 'bar', 'baz'])
typed_str = 'You are such a Foo BAR!'
print key_words & set(re.findall('[a-z]+', typed_str.lower()))
I'm not sure exactly what is being asked but somethings to consider (in no particular order) if you are building a bot that is taking in raw user input.
capitalization sensitivity
spell check
understanding intent simplistically
If your environment allows access to libraries you might consider checking out TextBlob. The following commands will give you the functionality needed for the example below.
pip install textblob
python -m textblob.download_corpora
core function
from textblob import TextBlob, Word
import copy
def score_intent(rawstring,keywords,weights=None,threshold=0.01,debug=False):
"""
rawstring: string of text with words that you want to detect
keywords: list of words that you are looking for
weights: (optional) dictionary with relative weights of words you want
threshold: spellcheck confidence threshold
debug: boolean for extra print statements to help debug
"""
allwords = TextBlob(rawstring).words
allwords = [w.upper() for w in allwords]
keywords = [k.upper() for k in keywords]
processed_input_as_list = spellcheck_subject_matter_specific(rawstring,keywords,threshold=threshold,debug=debug)
common_words = intersection(processed_input_as_list,keywords)
intent_score = len(common_words)
if weights:
for special_word in weights.keys():
if special_word.upper() in common_words:
# the minus one is so we dont double count a word.
intent_score = intent_score + weights[special_word] -1
if debug:
print "intent score: %s" %intent_score
print "words of interest found in text: {}".format(common_words)
# you could return common_words and score intent based on the list.
# return common_words, intent_score
return common_words
utilities for intersection & spellchecking
def intersection(a,b):
"""
a and b are lists
function returns a list that is the intersection of the two
"""
return list(set(a)&set(b))
def spellcheck_subject_matter_specific(rawinput,subject_matter_vector,threshold=0.01,capitalize=True,debug=False):
"""
rawinput: all the text that you want to check for spelling
subject_matter_vector: only the words that are worth spellchecking for (since the function can be sort of sensitive it might correct words that you don't want to correct)
threshold: the spell check confidence needed to update the word to the correct spelling
capitalize: boolean determining if you want the return string to be capitalized.
"""
new_input = copy.copy(rawinput)
for w in TextBlob(rawinput).words:
spellchecked_vec = w.spellcheck()
if debug:
print "Word: %s" %w
print "Spellchecked Guesses & Confidences: %s" %spellchecked_vec
print "Only spellchecked confidences greater than {} and in this list {} will be included".format(threshold,subject_matter_vector)
corrected_words = [z[0].upper() for z in spellchecked_vec if z[1] > threshold]
important_words = intersection(corrected_words,subject_matter_vector)
for new_word in important_words:
new_input = new_input + ' ' + new_word
inputBlob = TextBlob(new_input)
processed_input = inputBlob.words
if capitalize:
processed_input = [word.upper() for word in processed_input]
return processed_input
Usage Example
discord_str = "Hi, i want to talk about codee and pYtHon"
words2detect = ["python","code"]
score_intent(rawstring=discord_str,keywords=words2detect,threshold=0.01,debug=True)
output
intent score: 2
words of interest found in text: ['PYTHON', 'CODE']
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
How to remove noises from word (or sequence of words) edges. By noises I mean: 's, 're, ., ?, ,, ;, etc. In other words, punctuation and abbreviations. But it needs to be only from left and right edges, noises within word should remain.
examples:
Apple. Apple
Donald Trump's Trump
They're They
I'm I
¿Hablas espanol? Hablas espanhol
$12 12
H4ck3r H4ck3r
What's up What's up
So basically remove apostrophes, verb abbreviations and punctuation but only for the string edges (right/left). It seems strip doesn't work with full matches and couldn't find re suitable method only for edges.
What about
import re
strings = ['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"]
rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in strings for m in [rx.search(string)] if m]
print(filtered)
Yielding
['Apple', 'Trump', 'They', 'I', 'Hablas', '12', 'H4ck3r']
Instead of eating something away from the left or right, it simply takes the first match of word characters (i.e. [a-zA-Z0-9_]).
To apply it "in the wild", you could split the sentence first, like so:
sentence = "Apple. Trump's They're I'm ¿Hablas $12 H4ck3r"
rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in sentence.split() for m in [rx.search(string)] if m]
print(filtered)
This obviously yields the same list as above.
Use pandas:
import pandas as pd
s = pd.Series(['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"])
s.str.extract(r'(\w+)')
Output:
0 Apple
1 Trump
2 They
3 I
4 Hablas
5 12
6 H4ck3r
Name: 0, dtype: object
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
The desired result is either a function or a way to find where is a sentence within a list of strings.
sentence = 'The cat went to the pool yesterday'
structure = ['The cat went,', 'to the pool yesterday.','I wonder if you realize the effect you are having on me. It hurts. A lot.']
for example
def findsentence(sentence, list of strings):
# do something to get the output, vec of positions to find the sentence in hte string list
return output
findsentence(sentence, structure)
> (0,1) # beacuse the phrase is splitted in the list...
Caution!!
The challenge it is not to find exactly the sentence. Look at the example, this sentence is part of sentence position 0 and part in structure postition 1.
So this is not a simple, string manipulation problem.
Use the following :
sentence = "foo sam bar go"
structure = ["rq", "foo sam", "bar go", "ca", "da"]
def findsentencelist(sentence, list_of_strings):
l = []
for item in list_of_strings:
if item in sentence:
l.append(list_of_strings.index(item))
return l
print str(findsentencelist(sentence, structure))
Hopefully this will help you, Yahli.
EDIT :
There is a problem with your variables.
Your sentence MUST be a string - not a list.
Edit your variables and try this function again :)
SECOND EDIT:
I think I've finally understood what you're trying to do. Let me know if this one works better.
THIRD EDIT:
Jesus, Hopefully this one would solve your problem. Let me know if it did the trick :)
I just remove punctuations on structure to make it work:
sentence = 'The cat went to the pool yesterday'
structure = ['The cat went,', 'to the pool yesterday.','I wonder if you realize the effect you are having on me. It hurts. A lot.','Life is too short as it is. In short, she had a cushion job.']
import string
def findsentence(sentence, list_of_strings):
return tuple(i for i, s in enumerate(list_of_strings) if s.translate(None, string.punctuation) in sentence)
print findsentence(sentence, structure)
# (0, 1)
After removing the punctuation. You can use this code to get the index ,
for i,j in enumerate(structure):
if j in sentence:
print(i)
Hope this solves your problems. There are quite other solutions as python is flexible.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have a string of text
text = u"Hey, there, hope you are doing good?????? or maybe not?"
and a token version using spacy, I'm using spacy because I want to be able to use its other features like part of speech tagging, lemmatization and so on. The problem I'd love to solve is removing stop words like ['?',',',you'] from the token. The tokenized version of token is saved in toks
token = nlp(text)
toks = []
for t in token:
toks.append(t.lower_)
I was thinking of using multiple while loops like this
while "?" in token.text:
toks.remove("?")
while "," in token.text:
toks.remove(",")
while "you" in token.text:
toks.remove("you")
but I keep getting ValueError: list.remove(x): x not in list which is perfectly understandable, as it keeps removing until there is nothing to remove which thereby leads to an error.
However I found a way to handle the error using
while True:
try:
if '?' in tokens.text:
toks.remove('?')
except:
try:
if ',' in tokens.text:
toks.remove(',')
except:
try:
if 'you' in tokens.text:
toks.remove('you')
except:
break
I'm not getting the error any more, but I feel like there should be a better way to solve the problem without nested loops. Can you suggest a cleaner way?
Since you seem to want to exclude all tokens from a given set of tokens, it's easier to just ignore them while creating the toks list:
from spacy.en import English
unwanted_tokens = {'?', ',', 'you'}
text = u"Hey, there, hope you are doing good?????? or maybe not?"
nlp = English()
tokens = nlp(text)
toks = []
for t in tokens:
if t.lower_ not in unwanted_tokens:
toks.append(t.lower_)
>>> toks
[u'hey', u'there', u'hope', u'are', u'doing', u'good', u' ', u'or', u'maybe', u'not']
The for loop could be replaced by a list comprehension:
toks = [t.lower_ for t in tokens if t.lower_ not in unwanted_tokens]
If, for reasons that you don't show in your question, you must remove the tokens after toks has been created, then you can just use a list comprehension:
toks = [t for t in toks if t not in unwanted_tokens]
Use the str.replace method, with the empty string as the new string.
for target in ['?', ',', 'you']:
text = text.replace(target, '')
What this does is loop through items that need to be replaced and inserts empty strings every time it sees that string