I would like to do a stopword removal.
I have a list which consists of about 15,000 strings. those strings are little texts. My code is the following:
h = []
for w in clean.split():
if w not in cachedStopWords:
h.append(w)
if w in cachedStopWords:
h.append(" ")
print(h)
I understand that .split() is necessary so that not every whole string is being compared to the list of stopwords. But it does not seem to work because it cannot split lists. (Without any kind of splitting h = clean, because nothing matches obviously.)
Does anyone have an idea how else I could split the different strings in the list while still preserving the different cases?
A very minimal example:
stops = {'remove', 'these', 'words'}
strings = ['please do not remove these words', 'removal is not cool', 'please please these are the bees\' knees', 'there are no stopwords here']
strings_cleaned = [' '.join(word for word in s.split() if word not in stops) for s in strings]
Or you could do:
strings_cleaned = []
for s in strings:
word_list = []
for word in s.split():
if word not in stops:
word_list.append(word)
s_string = ' '.join(word_list)
strings_cleaned.append(s_string)
This is a lot uglier (I think) than the one-liner before it, but perhaps more intuitive.
Make sure you're converting your container of stopwords to a set (a hashable container which makes lookups O(1) instead of lists, whose lookups are O(n)).
Edit: This is just a general, very straightforward example of how to remove stopwords. Your use case might be a little different, but since you haven't provided a sample of your data, we can't help any further.
Related
so I was recently working on this function here:
# counts owls
def owl_count(text):
# sets all text to lowercase
text = text.lower()
# sets text to list
text = text.split()
# saves indices of owl in list
indices = [i for i, x in enumerate(text) if x == ["owl"] ]
# counts occurences of owl in text
owl_count = len(indices)
# returns owl count and indices
return owl_count, indices
My goal was to count how many times "owl" occurs in the string and save the indices of it. The issue I kept running into was that it would not count "owls" or "owl." I tried splitting it into a list of individual characters but I couldn't find a way to search for three consecutive elements in the list. Do you guys have any ideas on what I could do here?
PS. I'm definitely a beginner programmer so this is probably a simple solution.
thanks!
If you don't want to use huge libraries like NLTK, you can filter words that starts with 'owl', not equal to 'owl':
indices = [i for i, x in enumerate(text) if x.startswith("owl")]
In this case words like 'owlowlowl' will pass too, but one should use NLTK to properly tokenize words like in real world.
Python has built in functions for these.These types of matching of strings comes under something called Regular Expressions,which you can go into detail later
a_string = "your string"
substring = "substring that you want to check"
matches = re.finditer(substring, a_string)
matches_positions = [match.start() for match in matches]
print(matches_positions)
finditer() will return an iteration object and start() will return the starting index of the found matches.
Simply put ,it returns indices of all the substrings in the string
I'm trying to take a string input, like a sentence, and find all the words that have their reverse words in the sentence. I have this so far:
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
def semordnilap(s):
s = s.lower()
b = "!##$,"
for char in b:
s = s.replace(char,"")
s = s.split(' ')
dict = {}
index=0
for i in range(0,len(s)):
originalfirst = s[index]
sortedfirst = ''.join(sorted(str(s[index])))
for j in range(index+1,len(s)):
next = ''.join(sorted(str(s[j])))
if sortedfirst == next:
dict.update({originalfirst:s[j]})
index+=1
print (dict)
semordnilap(s)
So this works for the most part, but if you run it, you can see that it's also pairing "he" and "he" as an anagram, but it's not what I am looking for. Any suggestions on how to fix it, and also if it's possible to make the run time faster, if I was to input a large text file instead.
You could split the string into a list of words and then compare lowercase versions of all combinations where one of the pair is reversed. Following example uses re.findall() to split the string into a list of words and itertools.combinations() to compare them:
import itertools
import re
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
words = re.findall(r'\w+', s)
pairs = [(a, b) for a, b in itertools.combinations(words, 2) if a.lower() == b.lower()[::-1]]
print(pairs)
# OUTPUT
# [('was', 'saw'), ('stressed', 'desserts'), ('stop', 'pots')]
EDIT: I still prefer the solution above, but per your comment regarding doing this without importing any packages, see below. However, note that str.translate() used this way may have unintended consequences depending on the nature of your text (like stripping # from email addresses) - in other words, you may need to deal with punctuation more carefully than this. Also, I would typically import string and use string.punctuation rather than the literal string of punctuation characters I am passing to str.translate(), but avoided that below in keeping with your request to do this without imports.
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
words = s.translate(None, '!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~').split()
length = len(words)
pairs = []
for i in range(length - 1):
for j in range(i + 1, length):
if words[i].lower() == words[j].lower()[::-1]:
pairs.append((words[i], words[j]))
print(pairs)
# OUTPUT
# [('was', 'saw'), ('stressed', 'desserts'), ('stop', 'pots')]
sometimes I have a string like this
string = "Hett, Agva,"
and sometimes I will have duplicates in it.
string = "Hett, Agva, Delf, Agva, Hett,"
how can I check if my string has duplicates and then if it does remove them?
UPDATE.
So in the second string i need to remove Agva, and Hett, because there is 2x of them in the string
Iterate over the parts (words) and add each part to a set of seen parts and to a list of parts if it is not already in that set. Finally. reconstruct the string:
seen = set()
parts = []
for part in string.split(','):
if part.strip() not in seen:
seen.add(part.strip())
parts.append(part)
no_dups = ','.join(parts)
(note that I had to add some calls to .strip() as there are spaces at the start of some of the words which this method removes)
which gives:
'Hett, Agva, Delf,'
Why use a set?
To query whether an element is in a set, it is O(1) average case - since they are stored by a hash which makes lookup constant time. On the other hand, lookup in a list is O(n) as Python must iterate over the list until the element is found. This means that it is much more efficient for this task to use a set since, for each new word, you can instantly check to see if you have seen in before whereas you'd have to iterate over a list of seen elements otherwise which would take much longer for a large list.
Oh and to just check if there are duplicates, query whether the length of the split list is the same as the set of that list (which removes the duplicates but looses the order).
I.e.
def has_dups(string):
parts = string.split(',')
return len(parts) != len(set(parts))
which works as expected:
>>> has_dups('Hett, Agva,')
False
>>> has_dups('Hett, Agva, Delf, Agva, Hett,')
True
You can use toolz.unique, or equivalently the unique_everseen recipe in the itertools docs, or equivalently #JoeIddon's explicit solution.
Here's the solution using 3rd party toolz:
x = "Hett, Agva, Delf, Agva, Hett,"
from toolz import unique
res = ', '.join(filter(None, unique(x.replace(' ', '').split(','))))
print(res)
'Hett, Agva, Delf'
I've removed whitespace and used filter to clean up a trailing , which may not be required.
if you will receive a string in only this format then you can do the following:
import numpy as np
string_words=string.split(',')
uniq_words=np.unique(string_words)
string=""
for word in uniq_words:
string+=word+", "
string=string[:-1]
what this code does is that it splits words into a list, finds unique items, and then merges them into a string like before
If order of words id important then you can make a list of words in the string and then iterate over the list to make a new list of unique words.
string = "Hett, Agva, Delf, Agva, Hett,"
words_list = string.split()
unique_words = []
[unique_words.append(w) for w in words_list if w not in unique_words]
new_string = ' '.join(unique_words)
print (new_String)
Output:
'Hett, Agva, Delf,'
Quick and easy approach:
', '.join(
set(
filter( None, [ i.strip() for i in string.split(',') ] )
)
)
Hope it helps. Please feel free to ask if anything is not clear :)
So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("אבגדהוזחטיכךלמנס עפצקרשתםןףץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)
I'm new to Python and found a couple of suggestions for finding the longest WORD in a string, but none which accounted for a string with a number of words which match the longest length.
After playing around, I settled on this:
inputsentence = raw_input("Write a sentence: ").split()
longestwords = []
for word in inputsentence:
if len(word) == len(max(inputsentence, key=len)):
longestwords.append(word)
That way I have a list of the longest words that I can do something with. Is there any better way of doing this?
NB: Assume inputsentence contains no integers or punctuation, just a series of words.
If you'll be doing this with short amounts of text only, there's no need to worry about runtime efficiency: Programming efficiency, in coding, reviewing and debugging, is far more important. So the solution you have is fine, since it's clear and sufficiently efficient for even thousands of words. (However, you ought to calculate len(max(inputsentence, key=len)) just once, before the for loop.)
But suppose you do want to do this with a large corpus, which might possibly be several gigabytes long? Here's how to do it in one pass, without ever storing every word in memory (note that inputcorpus might be an iterator or function that reads the corpus in stages): Save all the longest words only. If you see a word that's longer than the current maximum, it's obviously the first one at this length, so you can start a fresh list.
maxlength = 0
maxwords = [ ] # unnecessary: will be re-initialized below
for word in inputcorpus:
if len(word) > maxlength:
maxlength = len(word)
maxwords = [ word ]
elif len(word) == maxlength:
maxwords.append(word)
If a certain word of maximal length repeats, you'll end up with several copies. To avoid that, just use set( ) instead of a list (and adjust initializing and extending).
How about this:
from itertools import groupby as gb
inputsentence = raw_input("Write a sentence: ").split()
lwords = list(next(gb(sorted(inputsentence, key=len, reverse=True), key=len))[1])
Make it a defaultdict with the length as the key and adapt the following:
words = inputsentence.split()
from collections import defaultdict
dd = defaultdict(list)
for word in words:
dd[len(word)].append(word)
key_by_len = sorted(dd)
print dd[key_by_len[0]]
Hope this help:
print max(raw_input().split(), key=len)