Remove non-select text in Python Script - python

I got huge amount of text data to deal with(through Orange), but need to clean it up somehow. Which means I need to remove all useless word for every line. Here is the code I put in Python Script(In Orange).
for i in range(1):
print in_data[i]
The data is one word per column.
Running script:
['1', 'NSW', 'Worst service ever', '0', 'I've', 'experi', 'drop', 'calls', 'late', 'voicemail', 'messages', 'poor', 'batteri', 'life', 'and', 'bad', '3G', 'coverage.', 'Complain', 'to', 'the', 'call', 'centr', 'doe', 'noth', 'and', 'thei', 'refus', 'to', 'replac', 'my', 'phone', 'or', 'let', 'me', 'out', 'of', 'the', 'contract', 'I', 'just', 'signed.', 'Thei', 'deni', 'there', 'is', 'ani', 'Dropped calls']
I am planning to remove all useless word. For example, I wanna keep only "Dropped calls","Complain" and remove all the rest. Base on this large amount of data. I need to use for loop to clean each line. But what method can keep the word I want and remove all the rest?

If the order of words is not important, you could define a set of useful words and take a set intersection with the list of all words per line.
useful_words = set(['Complain', 'Dropped calls', 'lolcat'])
for i in range(x):
filtered_words = useful_words.intersection(set(in_data[i]))
print filtered_words
(This is just a rough draft, which needs some form of text preprocessing and normalizations, but you get the idea)

The following should be an efficient solution both in time and space
# generator to yield every word which is in the set to keep
def filter_gen(words, words_to_keep):
for word in words:
if word in words_to_keep:
yield word
words_to_keep = set(( "Bb", "Dd"))
words = [ "Aa", "Bb", "Cc", "Dd" ]
res = [ word for word in filter_gen(words, words_to_keep) ]
>>> res
['Bb', 'Dd']

Related

I compare two identical sentences with GLEU NLTK and don't get 1.0. Why?

I’m trying to use GLEU score from NLTK for quality evaluation of the machine translation. I wanted to check this code with two identical sentences, I’m comparing two sentences and not corpuses. But as a result I’m getting 0.015151515151515152. What am I doing wrong? Two identical sentences should get score 1.0.
My code:
from nltk.translate.gleu_score import sentence_gleu
hyp1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', 'ensures', 'that', 'the', 'military', 'always', 'obeys', 'the', 'commands', 'of', 'the', 'party']
ref1a = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', 'ensures', 'that', 'the', 'military', 'always', 'obeys', 'the', 'commands', 'of', 'the', 'party']
gleu_score = sentence_gleu(ref1a, hyp1)
print(gleu_score)
My result:
0.015151515151515152
Process finished with exit code 0
Am I mistaken? Please help!
The first parameter to sentence_gleu should be a list of lists (a list of reference sentences, where each sentence is a list of words).
Try calling it like this:
gleu_score = sentence_gleu([ref1a], hyp1)

removing quotes and double quotes from a list of words

this is my first question on this site. Please forgive me for any formatting or language errors. So this question is based on a book called "think python" by Allen Downey. The activity is to write a python program that reads a book in text format and removes all the whitespace such as spaces and tabs and punctuations and other symbols. I tried many different ways to remove the punctuations and it never removes the quotes and double-quotes. They persistently stay. I'll copy-paste the last code I tried.
import string
def del_punctuation(item):
'''
This function deletes punctuation from a word.
'''
punctuation = string.punctuation
for c in item:
if c in punctuation:
item = item.replace(c, '')
return item
def break_into_words(filename):
'''
This function reads file, breaks it into
a list of used words in lower case.
'''
book = open(filename)
words_list = []
for line in book:
for item in line.split():
item = del_punctuation(item)
item=item.lower()
#print(item)
words_list.append(item)
return words_list
print(break_into_words('input.txt'))
I have not included the code to remove the whitespace as they work perfectly. I have only included code for removing punctuations. All the punctuational characters are removed except for the quotes and the double-quotes. Please help me by finding the bug in the code or is it something to do with my IDE or compiler?
Thanks in advance
input.txt:
“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”
“What is his name?”
“Bingley.”
“Is he married or single?”
“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”
“How so? how can it affect them?”
“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”
“Is that his design in settling here?”
The output I get is copied below:
['“why', 'my', 'dear', 'you', 'must', 'know', 'mrs', 'long', 'says', 'that', 'netherfield', 'is', 'taken', 'by', 'a', 'young', 'man', 'of', 'large', 'fortune', 'from', 'the', 'north', 'of', 'england', 'that', 'he', 'came', 'down', 'on', 'monday', 'in', 'a', 'chaise', 'and', 'four', 'to', 'see', 'the', 'place', 'and', 'was', 'so', 'much', 'delighted', 'with', 'it', 'that', 'he', 'agreed', 'with', 'mr', 'morris', 'immediately', 'that', 'he', 'is', 'to', 'take', 'possession', 'before', 'michaelmas', 'and', 'some', 'of', 'his', 'servants', 'are', 'to', 'be', 'in', 'the', 'house', 'by', 'the', 'end', 'of', 'next', 'week”', '“what', 'is', 'his', 'name”', '“bingley”', '“is', 'he', 'married', 'or', 'single”', '“oh', 'single', 'my', 'dear', 'to', 'be', 'sure', 'a', 'single', 'man', 'of', 'large', 'fortune', 'four', 'or', 'five', 'thousand', 'a', 'year', 'what', 'a', 'fine', 'thing', 'for', 'our', 'girls”', '“how', 'so', 'how', 'can', 'it', 'affect', 'them”', '“my', 'dear', 'mr', 'bennet”', 'replied', 'his', 'wife', '“how', 'can', 'you', 'be', 'so', 'tiresome', 'you', 'must', 'know', 'that', 'i', 'am', 'thinking', 'of', 'his', 'marrying', 'one', 'of', 'them”', '“is', 'that', 'his', 'design', 'in', 'settling', 'here”']
It has removed all the punctuations except for the double quotes and single quotes (there are single quotes in the input I guess).
Thanks
Real texts may contains too many tricky symbols: n-dash –, m-dash —, over ten different quotes " ' ` ‘ ’ “ ” « » ‹› et cetera, et cetera...
It makes little sense to try to count all the possible punctuation symbols. Common way is try to get only letters (and spaces). Easiest way is to use RegExp:
import re
text = '''“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”
“What is his name?”
“Bingley.”
“Is he married or single?”
“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”
“How so? how can it affect them?”
“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”
“Is that his design in settling here?”'''
# remove everything except letters, spaces, \n and, for example, dashes
text = re.sub("[^A-z \n\-]", "", text)
# split the text by spaces and \n
output = text.split()
print(output)
But actually the matter is much more complicated than it looks at first glance. Say I'm is a two words? Probably so. What about someone's? Or rock'n'roll.
I think your text contains this character ” as double-quotes instead of ". ” doesn't exist in string.punctuation so you are not removing it. Maybe it is better to change your del_punctuation function a little:
def del_punctuation(item):
'''
This function deletes punctuation from a word.
'''
punctuation = string.punctuation
for c in item:
if c in punctuation:
item = item.replace(c, '')
item = item.replace('”','')
item = item.replace('“','')
return item

How do I remove english stop words from a dataframe column using a custom dictionary of stop words

I'm writing a function that takes in a dataframe(df) of tweets as input. I need to tokenize the tweets and remove the stop words and add this output to a new column. I can't import anything except numpy and pandas.
The stop words are in a dictionary as follows:
stop_words_dict = {
'stopwords':[
'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon',
'may', 'why', '’s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former',
'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through',
'seeming', 'hence', 'us', 'anywhere....}
This is what I attempted to do: A function to remove the stop words
def stop_words_remover(df):
stop_words = list(stop_words_dict.values())
df["Without Stop Words"] = df["Tweets"].str.lower().str.split()
df["Without Stop Words"] = df["Without Stop Words"].apply(lambda x: [word for word in x if word not in stop_words])
return df
So if this was my input:
[#bongadlulane, please, send, an, email, to,]
This is the expected output:
[#bongadlulane, send, email, mediadesk#eskom.c]
but I keep returning the former instead of the latter
Any insight would be really appreciated. Thank you
Your problem is in this line:
stop_words = list(stop_words_dict.values())
This returns a list of the list of stop words
Replace it by:
stop_words = stop_words_dict['stopwords']

How to create a 3rd list from a nested list based on items in another list

I have a list of some users
list_of_users=['#elonmusk', '#YouTube','#FortniteGame','#BillGates','#JeffBezos']
and a nested list made by tweets, split by words.
tweets_splitted_by_words=[['#MrBeastYT', '#BillGates', 'YOU’RE', 'THE', 'LAST', 'ONE', 'FINISH', 'THE', 'MISSION', '#TeamTrees'], ['#MrBeastYT', '#realDonaldTrump', 'do', 'something', 'useful', 'with', 'your', 'life', 'and', 'donate', 'to', '#TeamTrees'], ['Please', 'please', 'donate']]
I want to create a third new list made by subblists of tweets_splitted_by_words only if each subblist contains at least one of the users in list_of_users.
Output that I want:
output=[['#MrBeastYT', '#BillGates', 'YOU’RE', 'THE', 'LAST', 'ONE', 'FINISH', 'THE', 'MISSION', '#TeamTrees']]
I tried the following code but it didn't work out:
tweets_per_user_mentioned= []
giorgia=[]
for r in range(len(tweets_splitted_by_words)):
giorgia.append(r)
for _i in range(len(giorgia)):
if _i in range(len(list_of_users)):
tweets_per_user_mentioned.append(tweets_splitted_by_words[r])
else:
pass
print(tweets_per_user_mentioned)
Since you will be performing lookups on the list of users, it is a good idea to have a set data structure. Sets provide O(1) lookup which greatly reduces time complexity of many problems.
For filtering, I'd just use python's built-in any and a list comprehension
set_of_users = set(list_of_users)
filtered_tweets = [tweet for tweet in tweets_splitted_by_words \
if any(word in set_of_users for word in tweet)]

how to write a Python program that reads from a text file, and builds a dictionary which maps each word

I am having difficulties with writing a Python program that reads from a text file, and builds a dictionary which maps each word that appears in the file to a list of all the words that immediately follow that word in the file. The list of words can be in any order and should include duplicates.
For example,the key "and" might have the list ["then", "best", "after", ...] listing all the words which came after "and" in the text.
Any idea would be great help.
A couple of ideas:
Set up a collections.defaultdict for your output. This is a dictionary with a default value for keys that don't yet exist (in this case, as aelfric5578 suggests, an empty list);
Build a list of all the words in your file, in order; and
You can use zip(lst, lst[1:]) to create pairs of consecutive list elements.
Welcome on stackoverflow.com
Are you sure you need a dictionary ?
It will takes a lot of memory if the text is long, just to repeat several times the same data for several entries.
While if you use a function, it will give you the desired list(s) at will.
For example:
s = """In Newtonian physics, free fall is any motion
of a body where its weight is the only force acting
upon it. In the context of general relativity where
gravitation is reduced to a space-time curvature,
a body in free fall has no force acting on it and
it moves along a geodesic. The present article
concerns itself with free fall in the Newtonian domain."""
import re
def say_me(word,li=re.split('\s+',s)):
for i,w in enumerate(li):
if w==word:
print '\n%s at index %d followed by\n%s' % (w,i,li[i+1:])
say_me('free')
result
free at index 3 followed by
['fall', 'is', 'any', 'motion', 'of', 'a', 'body', 'where', 'its', 'weight', 'is', 'the', 'only', 'force', 'acting', 'upon', 'it.', 'In', 'the', 'context', 'of', 'general', 'relativity', 'where', 'gravitation', 'is', 'reduced', 'to', 'a', 'space-time', 'curvature,', 'a', 'body', 'in', 'free', 'fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']
free at index 38 followed by
['fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']
free at index 58 followed by
['fall', 'in', 'the', 'Newtonian', 'domain.']
The assignement li=re.split('\s+',s) is a manner to bind the parameter li to the object re.split('\s+',s) passed as argument.
This binding is done only one time: at the moment where the definition of the function is read by the interpreter to create the function object. It as a parameter defined with a default argument.
Here was I would do :
from collections import defaultdict
# My example line :
s = 'In the face of ambiguity refuse the temptation to guess'
# Previous string is quite easy to tokenize but in real world, you'll have to :
# Remove comma, dot, etc...
# Probably encode to ascii (unidecode 3rd party module can be helpful)
# You'll also probably want to normalize case
lst = s.lower().split(' ') # naive tokenizer
ddic = defaultdict(list)
for word1, word2 in zip(lst, lst[1:]):
ddic[word1].append(word2)
# ddic contains what you want (but is a defaultdict)
# if you want to work with "classical" dictionnary, just cast it :
# (Often it's not needed)
dic = dict(ddic)
Sorry if I seem to steal commentators ideas, but this is almost the same code that I used in some of my projects (similar document algorithms pre-computation)

Categories