Remove all special characters and numbers and stop words - python

I have a list of issues like below and I would like to remove all special characters, numbers from this list of issues and i would like to do tokenization and stop words removal from this issues list:
issue=[[hi iam !#going $%^ to uk&*(us \\r\\ntomorrow {morning} by
the_way two-three!~`` [problems]:are there;]
[happy"journey" (and) \\r\\n\\rbring 576 chachos?>]]
I have tried below code but I am not getting desired output:
import re
ab=re.sub('[^A-Za-z0-9]+', '', issue)
bc=re.split(r's, ab)
I would like to see output like below:
issue_output=[['hi','going','uk','us','tomorrow','morning',
'way','two','three','problems' ]
[ 'happy','journey','bring','chachos']]

There are two glaring problems with the code that you have posted. First is that your input list issue is not formatted properly which makes it impossible to parse. Depending on the way you actually want it formatted, the answer to your question might change, but in general, this leads to the second problem, which is that you are trying to do re.sub on a list. You want to do the substitution on the list's elements. You can use list comprehension for that:
issue_output = [re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in issue]
Since there is no valid Python list provided in the question, I will assume the values in the list based on my best guess.
issue = [
['hi iam !#going $%^ to uk&*(us \\r\\ntomorrow {morning} by the_way two-three!~`` [problems]:are there;'],
['happy"journey" (and) \\r\\n\\rbring 576 chachos?>']
]
In this case, when you have a list of lists of strings, you need to adjust the list comprehension for that.
cleaned_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in inner_list] for inner_list in issue]
This returns a list of lists with strings inside:
[['hi iam going to uk us r ntomorrow morning by the way two three problems are there '], ['happy journey and r n rbring 576 chachos ']]
If you want to have the separate words in that list, simply split() them after substitution.
tokenized_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item.split()) for item in inner_list][0] for inner_list in issue]
This gives the result of:
[['hi', 'iam', 'going', 'to', 'uk', 'us', 'r', 'ntomorrow', 'morning', 'by', 'the', 'way', 'two', 'three', 'problems', 'are', 'there'], ['happy', 'journey', 'and', 'r', 'n', 'rbring', '576', 'chachos']]

Related

Convert list of string representations of sentences into vocabulary set

I have a list of string representations of sentences that looks something like this:
original_format = ["This is a question", "This is another question", "And one more too"]
I want to convert this list into a set of unique words in my corpus. Given the above list, the output would look something like this:
{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}
I've figured out a way to do this, but it takes a very long time to run. I am interested in a more efficient way of converting from one format to another (especially since my actual dataset contains >200k sentences).
FYI, what I'm doing right now is creating an empty set for the vocab and then looping through each sentence (split by spaces) and unioning with the vocab set. Using the original_format variable as defined above, it looks like this:
vocab = set()
for q in original_format:
vocab = vocab.union(set(q.split(' ')))
Can you help me run this conversion more efficiently?
You can use itertools.chain with set. This avoids nested for loops and list construction.
from itertools import chain
original_format = ["This is a question", "This is another question", "And one more too"]
res = set(chain.from_iterable(i.split() for i in original_format))
print(res)
{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}
Or for a truly functional approach:
from itertools import chain
from operator import methodcaller
res = set(chain.from_iterable(map(methodcaller('split'), original_format)))
Using a simple set comprehension:
{j for i in original_format for j in i.split()}
Output:
{'too', 'is', 'This', 'And', 'question', 'another', 'more', 'one', 'a'}

Getting second element of a tuple (in a list of tuples) as a string

I have an output that is a list of tuples. It looks like this:
annot1=[(402L, u"[It's very seldom that you're blessed to find your equal]"),
(415L, u'[He very seldom has them in this show or his movies]')…
I need to use the second part of the tuple only to apply ‘split’ and get each word on the sentence separately.
At this point, I’m not able to isolate the second part of the tuple (the text).
This is my code:
def scope_match(annot1):
scope = annot1[1:]
scope_string = ‘’.join(scope)
scope_set = set(scope_string.split(' '))
But I get:
TypeError: sequence item 0: expected string, tuple found
I tried to use annot1[1] but it gives me the second index of the text instead of the second element of the tuple.
You can do something like this with list comprehensions:
annot1=[(402L, u"[It's very seldom that you're blessed to find your equal]"),
(415L, u'[He very seldom has them in this show or his movies]')]
print [a[1].strip('[]').encode('utf-8').split() for a in annot1]
Output:
[["It's", 'very', 'seldom', 'that', "you're", 'blessed', 'to', 'find', 'your', 'equal'], ['He', 'very', 'seldom', 'has', 'them', 'in', 'this', 'show', 'or', 'his', 'movies']]
You can calculate the intersection of strings in corresponding positions in annot1 and annot2 like this:
for x,y in zip(annot1,annot2):
print set(x[1].strip('[]').encode('utf-8').split()).intersection(y[1].strip('[]').encode('utf-8').split())
annot1 is a list of tuples. To get the string from each of the elements, you can do something like this
def scope_match(annot1):
for pair in annot1:
string = pair[1]
print string # or whatever you want to do

How do I put single word as an array to a list in Python?

I need to make such a list of words in python:
list_of_words = ['saffron'], ['aloha'], ['leave'],['cola'],['packing']\
by choosing some random words from other word_bank = ['cola', 'home', 'undone', 'some', 'good', ....] unless, let's say len(list_of_words)=15
I have never used that before. What is it called?
Where should I search for it?
How do I obtain such a list?
Maybe that is what you are looking for:
import random
word_bank = ['cola', 'home', 'undone', 'some', 'good']
tuple([[x] for x in random.sample(word_bank, 5)])
Possible output:
(['cola'], ['some'], ['good'], ['undone'], ['home'])
Here is my solution to what I believe you are saying:
import random
list_of_words = []
word_bank = ['cola', 'home', 'undone', 'some', 'good']
while len(list_of_words)<15:
list_of_words.append(random.choice(word_bank))
this will create an empty list, and then append a random choice from word_bank onto this list. I wasn't sure exactly why you wanted a list of lists, but this will put it into a format similar to word_bank.

In Python, how to check if words in a string are keys in a dictionary?

For a class I am talking the twitter sentiment analysis problem. I have looked at the other questions on the site and they don't help for my particular issue.
I am given a string that is one tweet with its letters changed so that they are all in lowercase. For example,
'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
as well as a dictionary of words where the key is the word and the value is the value for the sentiment for that word. To be more specific, a key can be a single word (such as 'hello'), more than one word separated by a space (such as 'yellow hornet'), or a hyphenated compound word (such as '2-dimensional'), or a number (such as '365').
I need to find the sentiment of the tweet by adding the sentiments for every eligible word and dividing by the number of eligible words (by eligible word, I mean word that is in the dictionary). I'm not sure what's the best way to go about checking if a tweet has a word in the dictionary.
I tried using the "key in string" convention with looping through all the keys, but this was problematic because there are a lot of keys and word-in-words would be counted (e.g. eradicate counts cat, ate, era, etc. as well)
I then tried using .split(' ') and looping through the elements of the resultant list but I ran into problems because of punctuation and keys which are two words.
Anyone have any ideas on how I can more suitably tackle this?
For example: using the example above, still : -0.625, love : 0.625, every other word is not in the dictionary. so this should return (-0.625 + 0.625)/2 = 0.
The whole point of dictionaries is that they are quick at looking things up:
for word in instring.split():
if wordsdict.has_key(word):
print word
You would probably do better at getting rid of punctuation, etc, (thank-you Soke), by using regular expressions rather than split, e.g.
for word in re.findall(r'[\w]', instring):
if wordsdict.get(word) is not None:
print word
Of course you will have to have some maximum length of word groupings, possibly generated with a single run through of the dictionary and then take your pairs, triples, etc. and also check them.
you can use nltk its very powerfull what you want to do, it can be done by split too:
>>> import string
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> import nltk
>>> my_dict = {'still' : -0.625, 'love' : 0.625}
>>> words = nltk.word_tokenize(a)
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(', '#', 'tel', 'aviv', 'kosher', 'pizza', ')', 'http', ':', '//t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
using split:
>>> words = a.split()
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(#', 'tel', 'aviv', 'kosher', 'pizza)', 'http://t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
my_dict.get(key,default), so get will return value if key is found in dictionary else it will return default. In this case '0'
check this example: you asked for place
>>> import string
>>> my_dict = {'still' : -0.625, 'love' : 0.625,'place':1}
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> words = nltk.word_tokenize(a)
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.5
going by length of the dictionary key might be one solution.
For example, you have the dict as:
Sentimentdict = {"habit":5, "bad habit":-1}
the sentence might be:
s1="He has good habit"
s2="He has bad habit"
s1 should be getting good sentiment compare to s2. Now, you can do this:
for w in sorted(Sentimentdict.keys(), key=lambda x: len(x)):
if w in s1:
remove the word and do your sentiment calculation

Parse out elements from a pattern

I am trying to parse the result output from a natural language parser (Stanford parser).
Some of the results are as below:
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
The result am trying to get are:
['dep', 'Company', 'rent']
['conj_or', 'rent', 'share']
['amod', 'information', 'personal']
...
['amod', 'companies', 'non-affiliated']
...
['aux', 'requested', "'ve"]
First I tried to directly get these elements out, but failed.
Then I realized regex should be the right way forward.
However, I am totally unfamiliar with regex. With some exploration, I got:
m = re.search('(?<=())\w+', line)
m2 =re.search('(?<=-)\d', line)
and stuck.
The first one can correctly get the first elements, e.g. 'dep', 'amod', 'conj_or', but I actually have not totally figured out why it is working...
Second line is trying to get the second elements, e.g. 'Company', 'rent', 'information', but I can only get the number after the word. I cannot figure out how to lookbefore rather than lookbehind...
BTW, I also cannot figure out how to deal with exceptions such as 'non-affiliated' and "'ve".
Could anyone give some hints or help. Highly appreciated.
It is difficult to give an optimal answer without knowing the full range of possible outputs, however, here's a possible solution:
>>> [re.findall(r'[A-Za-z_\'-]+[^-\d\(\)\']', line) for line in s.split('\n')]
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]
It works by finding all the groups of contiguous letters ([A-Za-z] represent the interval between capital A and Z and small a and z) or the characters "_" and "'" in the same line.
Furthermore it enforce the rule that your matched string must not have in the last position a given list of characters ([^...] is the syntax to say "must not contain any of the characters (replace "..." with the list of characters)).
The character \ escapes those characters like "(" or ")" that would otherwise be parsed by the regex engine as instructions.
Finally, s is the example string you gave in the question...
HTH!
Here is something you're looking for:
([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)
The parenthesis around [\w-]* are for grouping, so that you can access data as:
ex = r'([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)'
m = re.match(ex, line)
print(m.group(0), m.group(1), m.group(2))
Btw, I recommend using "Kodos" program written in Python+PyQT to learn and test regular expressions. It's my favourite tool to test regexs.
If the results from the parser are as regular as suggested, regexes may not be necessary:
from pprint import pprint
source = """
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
"""
items = []
for line in source.splitlines():
head, sep, tail = line.partition('(')
if head:
item = [head]
head, sep, tail = tail.strip('()').partition(', ')
item.append(head.rpartition('-')[0])
item.append(tail.rpartition('-')[0])
items.append(item)
pprint(items)
Output:
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]

Categories