Searching for duplicates and remove them - python

sometimes I have a string like this
string = "Hett, Agva,"
and sometimes I will have duplicates in it.
string = "Hett, Agva, Delf, Agva, Hett,"
how can I check if my string has duplicates and then if it does remove them?
UPDATE.
So in the second string i need to remove Agva, and Hett, because there is 2x of them in the string

Iterate over the parts (words) and add each part to a set of seen parts and to a list of parts if it is not already in that set. Finally. reconstruct the string:
seen = set()
parts = []
for part in string.split(','):
if part.strip() not in seen:
seen.add(part.strip())
parts.append(part)
no_dups = ','.join(parts)
(note that I had to add some calls to .strip() as there are spaces at the start of some of the words which this method removes)
which gives:
'Hett, Agva, Delf,'
Why use a set?
To query whether an element is in a set, it is O(1) average case - since they are stored by a hash which makes lookup constant time. On the other hand, lookup in a list is O(n) as Python must iterate over the list until the element is found. This means that it is much more efficient for this task to use a set since, for each new word, you can instantly check to see if you have seen in before whereas you'd have to iterate over a list of seen elements otherwise which would take much longer for a large list.
Oh and to just check if there are duplicates, query whether the length of the split list is the same as the set of that list (which removes the duplicates but looses the order).
I.e.
def has_dups(string):
parts = string.split(',')
return len(parts) != len(set(parts))
which works as expected:
>>> has_dups('Hett, Agva,')
False
>>> has_dups('Hett, Agva, Delf, Agva, Hett,')
True

You can use toolz.unique, or equivalently the unique_everseen recipe in the itertools docs, or equivalently #JoeIddon's explicit solution.
Here's the solution using 3rd party toolz:
x = "Hett, Agva, Delf, Agva, Hett,"
from toolz import unique
res = ', '.join(filter(None, unique(x.replace(' ', '').split(','))))
print(res)
'Hett, Agva, Delf'
I've removed whitespace and used filter to clean up a trailing , which may not be required.

if you will receive a string in only this format then you can do the following:
import numpy as np
string_words=string.split(',')
uniq_words=np.unique(string_words)
string=""
for word in uniq_words:
string+=word+", "
string=string[:-1]
what this code does is that it splits words into a list, finds unique items, and then merges them into a string like before

If order of words id important then you can make a list of words in the string and then iterate over the list to make a new list of unique words.
string = "Hett, Agva, Delf, Agva, Hett,"
words_list = string.split()
unique_words = []
[unique_words.append(w) for w in words_list if w not in unique_words]
new_string = ' '.join(unique_words)
print (new_String)
Output:
'Hett, Agva, Delf,'

Quick and easy approach:
', '.join(
set(
filter( None, [ i.strip() for i in string.split(',') ] )
)
)
Hope it helps. Please feel free to ask if anything is not clear :)

Related

Find the word from the list given and replace the words so found

My question is pretty simple, but I haven't been able to find a proper solution.
Given below is my program:
given_list = ["Terms","I","want","to","remove","from","input_string"]
input_string = input("Enter String:")
if any(x in input_string for x in given_list):
#Find the detected word
#Not in bool format
a = input_string.replace(detected_word,"")
print("Some Task",a)
Here, given_list contains the terms I want to exclude from the input_string.
Now, the problem I am facing is that the any() produces a bool result and I need the word detected by the any() and replace it with a blank, so as to perform some task.
Edit: any() function is not required at all, look for useful solutions below.
Iterate over given_list and replace them:
for i in given_list:
input_string = input_string.replace(i, "")
print("Some Task", input_string)
No need to detect at all:
for w in given_list:
input_string = input_string.replace(w, "")
str.replace will not do anything if the word is not there and the substring test needed for the detection has to scan the string anyway.
The problem with finding each word and replacing it is that python will have to iterate over the whole string, repeatedly. Another problem is you will find substrings where you don't want to. For example, "to" is in the exclude list, so you'd end up changing "tomato" to "ma"
It seems to me like you seem to want to replace whole words. Parsing is a whole new subject, but let's simplify. I'm just going to assume everything is lowercase with no punctuation, although that can be improved later. Let's use input_string.split() to iterate over whole words.
We want to replace some words with nothing, so let's just iterate over the input_string, and filter out the words we don't want, using the builtin function of the same name.
exclude_list = ["terms","i","want","to","remove","from","input_string"]
input_string = "one terms two i three want to remove"
keepers = filter(lambda w: w not in exclude_list, input_string.lower().split())
output_string = ' '.join(keepers)
print (output_string)
one two three
Note that we create an iterator that allows us to go through the whole input string just once. And instead of replacing words, we just basically skip the ones we don't want by having the iterator not return them.
Since filter requires a function for the boolean check on whether to include or exclude each word, we had to define one. I used "lambda" syntax to do that. You could just replace it with
def keep(word):
return word not in exclude_list
keepers = filter(keep, input_string.split())
To answer your question about any, use an assignment expression (Python 3.8+).
if any((word := x) in input_string for x in given_list):
# match captured in variable word

How to count occurences of word in string that stil works with periods and endings

so I was recently working on this function here:
# counts owls
def owl_count(text):
# sets all text to lowercase
text = text.lower()
# sets text to list
text = text.split()
# saves indices of owl in list
indices = [i for i, x in enumerate(text) if x == ["owl"] ]
# counts occurences of owl in text
owl_count = len(indices)
# returns owl count and indices
return owl_count, indices
My goal was to count how many times "owl" occurs in the string and save the indices of it. The issue I kept running into was that it would not count "owls" or "owl." I tried splitting it into a list of individual characters but I couldn't find a way to search for three consecutive elements in the list. Do you guys have any ideas on what I could do here?
PS. I'm definitely a beginner programmer so this is probably a simple solution.
thanks!
If you don't want to use huge libraries like NLTK, you can filter words that starts with 'owl', not equal to 'owl':
indices = [i for i, x in enumerate(text) if x.startswith("owl")]
In this case words like 'owlowlowl' will pass too, but one should use NLTK to properly tokenize words like in real world.
Python has built in functions for these.These types of matching of strings comes under something called Regular Expressions,which you can go into detail later
a_string = "your string"
substring = "substring that you want to check"
matches = re.finditer(substring, a_string)
matches_positions = [match.start() for match in matches]
print(matches_positions)
finditer() will return an iteration object and start() will return the starting index of the found matches.
Simply put ,it returns indices of all the substrings in the string

Python: Replace single character with multiple values

I have a string with "?" as placeholders. I need to loop through the string and replace each "?" with the next value in a list.
For example:
my_str = "Take the source data from ? and pass to ?;"
params = ['temp_table', 'destination_table']
Here's what I've tried, which works:
params_counter = 0
for letter in my_str:
if letter == '?':
# Overwrite my_str with a new version, where the first ? is replaced
my_str = my_str.replace('?', params[params_counter], 1)
params_counter += 1
The problem is it seems rather slow to loop through every single letter in the string, multiple times. My example is a simple string but real situations could be strings that are much longer.
What are more elegant or efficient ways to accomplish this?
I saw this question, which is relatively similar, but they're using dictionaries and are replacing multiple values with multiple values, rather than my case which is replacing one value with multiple values.
you don't need to iterate, replace will replace the first occcurrence of the string:
a clever solution is to split by the string you are searching, then zipping the list with a list of replacements of length the same and then joining
list1 = my_str.split('?')
params = ['temp_table', 'destination_table']
zipped = zip(list1, params)
replaced = ''.join([elt for sublist in zipped for elt in sublist])
In [19]: replaced
Out[19]: 'Take the source data from temp_table and pass to destination_table'
You can use multiple characters strings, which is what would kill your method:
my_str = "Take the source data from magicword and pass to magicword;
list1 = my_str.split('magicword')
params = ['temp_table', 'destination_table']
zipped = zip(list1, params)
replaced = ''.join([elt for sublist in zipped for elt in sublist])
In [25]: replaced
Out[25]: 'Take the source data from temp_table and pass to destination_table'
Note that if your params are shorter than the number of occurrences of your search string this will cut it
IN []my_str = "Take the source data from ? and pass to ?; then do again with ? and to ?"
Out[22]: 'Take the source data from temp_table and pass to destination_table'
Also note that the last bit after finding the string is deleted :(, something like
replaced = ''.join([elt for sublist in zipped for elt in sublist] + [list1[-1]])
Will do the trick
While #E. Serra's answer is a great answer, the comment from #jasonharper made me realize there's a MUCH simpler answer. So incredibly simple that I'm surprised that I completely missed it!
Instead of looping through the string, I should loop through the parameters. That'll allow me to replace the first instance of "?" with the current parameter that I'm looking at. Then I overwrite the string, allowing it to execute correctly on the next iteration.
Unlike the other solution posted, it won't cut off the end of my string either.
my_str = "Take the source data from ? and pass to ?;"
params = ['temp_table', 'destination_table']
for item in params:
query_str = query_str.replace('?', item, 1)

Splitting to words in a list of strings

I would like to do a stopword removal.
I have a list which consists of about 15,000 strings. those strings are little texts. My code is the following:
h = []
for w in clean.split():
if w not in cachedStopWords:
h.append(w)
if w in cachedStopWords:
h.append(" ")
print(h)
I understand that .split() is necessary so that not every whole string is being compared to the list of stopwords. But it does not seem to work because it cannot split lists. (Without any kind of splitting h = clean, because nothing matches obviously.)
Does anyone have an idea how else I could split the different strings in the list while still preserving the different cases?
A very minimal example:
stops = {'remove', 'these', 'words'}
strings = ['please do not remove these words', 'removal is not cool', 'please please these are the bees\' knees', 'there are no stopwords here']
strings_cleaned = [' '.join(word for word in s.split() if word not in stops) for s in strings]
Or you could do:
strings_cleaned = []
for s in strings:
word_list = []
for word in s.split():
if word not in stops:
word_list.append(word)
s_string = ' '.join(word_list)
strings_cleaned.append(s_string)
This is a lot uglier (I think) than the one-liner before it, but perhaps more intuitive.
Make sure you're converting your container of stopwords to a set (a hashable container which makes lookups O(1) instead of lists, whose lookups are O(n)).
Edit: This is just a general, very straightforward example of how to remove stopwords. Your use case might be a little different, but since you haven't provided a sample of your data, we can't help any further.

How can I make multiple replacements in a string using a dictionary?

Suppose we have:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
How can I replace each appearance within s of any of d's keys, with the corresponding value (in this case, the result would be 'Досуг englishA')?
Using re:
import re
s = 'Спорт not russianA'
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
keys = (re.escape(k) for k in d.keys())
pattern = re.compile(r'\b(' + '|'.join(keys) + r')\b')
result = pattern.sub(lambda x: d[x.group()], s)
# Output: 'Досуг not englishA'
This will match whole words only. If you don't need that, use the pattern:
pattern = re.compile('|'.join(re.escape(k) for k in d.keys()))
Note that in this case you should sort the words descending by length if some of your dictionary entries are substrings of others.
You could use the reduce function:
reduce(lambda x, y: x.replace(y, dict[y]), dict, s)
Solution found here (I like its simplicity):
def multipleReplace(text, wordDict):
for key in wordDict:
text = text.replace(key, wordDict[key])
return text
one way, without re
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'.split()
for n,i in enumerate(s):
if i in d:
s[n]=d[i]
print ' '.join(s)
Almost the same as ghostdog74, though independently created. One difference,
using d.get() in stead of d[] can handle items not in the dict.
>>> d = {'a':'b', 'c':'d'}
>>> s = "a c x"
>>> foo = s.split()
>>> ret = []
>>> for item in foo:
... ret.append(d.get(item,item)) # Try to get from dict, otherwise keep value
...
>>> " ".join(ret)
'b d x'
With the warning that it fails if key has space, this is a compressed solution similar to ghostdog74 and extaneons answers:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
' '.join(d.get(i,i) for i in s.split())
I used this in a similar situation (my string was all in uppercase):
def translate(string, wdict):
for key in wdict:
string = string.replace(key, wdict[key].lower())
return string.upper()
hope that helps in some way... :)
Using regex
We can build a regular expression that matches any of the lookup dictionary's keys, by creating regexes to match each individual key and combine them with |. We use re.sub to do the substitution, by giving it a function to do the replacement (this function, of course, will do the dict lookup). Putting it together:
import re
# assuming global `d` and `s` as in the question
# a function that does the dict lookup with the global `d`.
def lookup(match):
return d[match.group()]
# Make the regex.
joined = '|'.join(re.escape(key) for key in d.keys())
pattern = re.compile(joined)
result = pattern.sub(lookup, s)
Here, re.escape is used to escape any characters with special meaning in the replacements (so that they don't interfere with building the regex, and are matched literally).
This regex pattern will match the substrings anywhere they appear, even if they are part of a word or span across multiple words. To avoid this, modify the regex so that it checks for word boundaries:
# pattern = re.compile(joined)
pattern = re.compile(rf'\b({joined})\b')
Using str.replace iteratively
Simply iterate over the .items() of the lookup dictionary, and call .replace with each. Since this method returns a new string, and does not (cannot) modify the string in place, we must reassign the results inside the loop:
for to_replace, replacement in d.items():
s = s.replace(to_replace, replacement)
This approach is simple to write and easy to understand, but it comes with multiple caveats.
First, it has the disadvantage that it works sequentially, in a specific order. That is, each replacement has the potential to interfere with other replacements. Consider:
s = 'one two'
s = s.replace('one', 'two')
s = s.replace('two', 'three')
This will produce 'three three', not 'two three', because the 'two' from the first replacement will itself be replaced in the second step. This is normally not desirable; however, in the rare case when it should work this way, this approach is the only practical one.
This approach also cannot easily be fixed to respect word boundaries, because it must match literal text, and a "word boundary" can be marked in multiple different ways - by varying kinds of whitespace, but also without text at the beginning and end of the string.
Finally, keep in mind that a dict is not an ideal data structure for this approach. If we will iterate over the dict, then its ability to do key lookup is useless; and in Python 3.5 and below, the order of dicts is not guaranteed (making the sequential replacement problem worse). Instead, it would be better to specify a list of tuples for the replacements:
d = [('Спорт', 'Досуг'), ('russianA', 'englishA')]
s = 'Спорт russianA'
for to_replace, replacement in d: # no more `.items()` call
s = s.replace(to_replace, replacement)
By tokenization
The problem becomes much simpler if the string is first cut into pieces (tokenized), in such a way that anything that should be replaced is now an exact match for a dict key. That would allow for using the dict's lookup directly, and processing the entire string in one go, while also not building a custom regex.
Suppose that we want to match complete words. We can use a simpler, hard-coded regex that will match whitespace, and which uses a capturing group; by passing this to re.split, we split the string into whitespace and non-whitespace sections. Thus:
import re
tokenizer = re.compile('([ \t\n]+)')
tokenized = tokenizer.split(s)
Now we look up each of the tokens in the dictionary: if present, it should be replaced with the corresponding value, and otherwise it should be left alone (equivalent to replacing it with itself). The dictionary .get method is a natural fit for this task. Finally, we join the pieces back up. Thus:
s = ''.join(d.get(token, token) for token in tokenized)
More generally, for example if the strings to replace could have spaces in them, a different tokenization rule will be needed. However, it will usually be possible to come up with a tokenization rule that is simpler than the regex from the first section (that matches all the keys by brute force).
Special case: replacing single characters
If the keys of the dict are all one character (technically, Unicode code point) each, there are more specific techniques that can be used. See Best way to replace multiple characters in a string? for details.

Categories