How to remove duplicate chars in a string? - python

I've got this problem and I simply can't get it right. I have to remove duplicated chars from a string.
phrase = "oo rarato roeroeu aa rouroupa dodo rerei dde romroma"
The output should be: "O rato roeu a roupa do rei de roma"
I tried things like:
def remove_duplicates(value):
var=""
for i in value:
if i in value:
if i in var:
pass
else:
var=var+i
return var
print(remove_duplicates(entrada))
But it's not there yet...
Any pointers to guide me here?

It seems from your example that you want to remove REPEATED SEQUENCES of characters, not duplicate chars across the whole string. So this is what I'm solving here.
You can use a regular expression.. not sure how horribly inefficient it is but it
works.
>>> import re
>>> phrase = str("oo rarato roeroeu aa rouroupa dodo rerei dde romroma")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'o rato roeu a roupa do rei de roma'
How this substitution proceeds down the string:
oo -> o
" " -> " "
rara -> ra
to -> to
" "-> " "
roeroe -> roe
etc..
Edit: Works for the other example string which should not be modified:
>>> phrase = str("Barbara Bebe com Bernardo")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'Barbara Bebe com Bernardo'

What you can do is form a set out of the string and then sort the remaining letters according to their original order.
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index) # this will give you a list
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)

String in python is a list of chars, right? But lists can have duplicates... sets cannot. So, if we convert list to set, then back to list, we'll get a list without duplicates ;P
I've seen a suggestion to use regex for replacing patterns. This will work, but that'll be a slow, and overcomplicated solution (human unfriendly to read also).
Regex is a heavy and costly weapon.
Also, you do not remove duplicated from string provided, but from words in the string:
First, split your string into lists of words.
for each of the words, remove duplicate letters
put back words to string
`
phrase = "oo rarato roeroeu aa rouroupa dodo rerei dde romroma"
words = phrase.split(' ')
`
words ['oo', 'rarato', 'roeroeu', 'aa', 'rouroupa', 'dodo', 'rerei', 'dde', 'romroma']
words_without_duplicates = []
for word in words:
word = ''.join(letter for letter in list(set(word)))
words_without_duplicates.append(word_without_duplicates)
phrase = ' '.join(word in words_without_duplicates)
phrase 'o oatr oeur a auopr od eir ed oamr'
Of curse, that can be optimized, but you wanted to be guided, so this is better to show the idea. It will be faster than regex too.

Actually I add a space end of the space. After that this is working
code
phrase =("oo rarato roeroeu aa rouroupa dodo rerei dde romroma ")
print(phrase)
ch=""
ali=[]
for i in phrase:
if i ==" ":
print(ch)
ch=""
if i not in ch:
ch=ch+i
Output
o
rato
roeu
a
roupa
do
rei
de
roma

Related

Remove from a list of strings, those strings that have only empty spaces or that are made up of less than 3 alphanumeric characters

import re
sentences_list = ['Hay 5 objetos rojos sobre la mesada de ahí.', 'Debajo de la mesada hay 4 objetos', '', ' ', "\taa!", '\t\n \n', '\n ', 'ai\n ', 'Salto rapidamente!!!', 'y la vio volar', '!', ' aa', 'aa', 'día']
#The problem with this is that there are several cases that need to be eliminated
# and the complexity to figure that out should be resolved with a regex.
sentences_list = [i for a,i in enumerate(sentences_list) if i != ' ']
print(repr(sentences_list)) #print the already filtered list to verify
I got these strings with a sentence separator, the problem is that some sentences aren't really sentences or aren't really linguistically significant units.
Those strings that have less than 3 alphanumeric characters (that is, 2 characters or less) must be eliminated from the list.
Those strings that are empty "" or " " , or that are made up of single symbols "...!", ";", ".\n", "\taa!" must be eliminated from the list.
Those strings that have only escape characters and nothing else, except symbols or that have less than 3 alphanumeric characters, for example "\t\n ab ." , "\n .", "\n" must be eliminated from the list.
This is how the correct list should look after having filtered those elements that are substrings that do not meet the conditions
['Hay 5 objetos rojos sobre la mesada de ahí.', 'Debajo de la mesada hay 4 objetos', 'Salto rapidamente!!!', 'y la vio volar', 'día']
You can count the number of alphanumeric characters in a string by calling .isalnum() on each character and summing these values. Then you can keep only the strings that have at least 3 of these with a list comprehension.
sentences_list = [s for s in sentences_list if sum(c.isalnum() for c in s) >= 3]
You could use this if clause:
if len(re.sub(r"\W", "", i)) >= 3
Here is a simple solution to filter out what you need:
from string import punctuation
for sentence in sentences_list:
trimmed_sentence = sentence.strip().strip(punctuation)
if len (trimmed_sentence) > 2:
print (sentence)
Basically it goes through all the sentences from the list, trims the sentences (space, new lines), trims the punctuation marks and then checks the length of the sentence to be more than 2.
Also learn more about strip() - here.

how to remove instances and possible multiple instances of a certain word in a string and return a string (CODEWARS dubstep)

I have had a go at the CODEWARS dubstep challenge using python.
My code is below, it works and I pass the kata test. However, it took me a long time and I ended up using a brute force approach (newbie).
(basically replacing and striping the string until it worked)
Any ideas with comments on how my code could be improved please?
TASK SUMMARY:
Let's assume that a song consists of some number of words (that don't contain WUB). To make the dubstep remix of this song, Polycarpus inserts a certain number of words "WUB" before the first word of the song (the number may be zero), after the last word (the number may be zero), and between words (at least one between any pair of neighbouring words), and then the boy glues together all the words, including "WUB", in one string and plays the song at the club.
For example, a song with words "I AM X" can transform into a dubstep remix as "WUBWUBIWUBAMWUBWUBX" and cannot transform into "WUBWUBIAMWUBX".
song_decoder("WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB")
# => WE ARE THE CHAMPIONS MY FRIEND
song_decoder("AWUBBWUBC"), "A B C","WUB should be replaced by 1 space"
song_decoder("AWUBWUBWUBBWUBWUBWUBC"), "A B C","multiples WUB should be replaced by only 1 space"
song_decoder("WUBAWUBBWUBCWUB"), "A B C","heading or trailing spaces should be removed"
Thanks in advance, (I am new to stackoverflow also)
MY CODE:
def song_decoder(song):
new_song = song.replace("WUB", " ")
new_song2 = new_song.strip()
new_song3 = new_song2.replace(" ", " ")
new_song4 = new_song3.replace(" ", " ")
return(new_song4)
I don't know if it can improve it but I would use split and join
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
text = text.replace("WUB", " ")
print(text)
words = text.split()
print(words)
text = " ".join(words)
print(text)
Result
WE ARE THE CHAMPIONS MY FRIEND
['WE', 'ARE', 'THE', 'CHAMPIONS', 'MY', 'FRIEND']
WE ARE THE CHAMPIONS MY FRIEND
EDIT:
Dittle different version. I split usinsg WUB but then it creates empty elements between two WUB and it needs to remove them
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
words = text.split("WUB")
print(words)
words = [x for x in words if x] # remove empty elements
#words = list(filter(None, words)) # remove empty elements
print(words)
text = " ".join(words)
print(text)

Using .replace effectively on text

I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.
from collections import Counter
from string import punctuation
path = input("Path to file: ")
with open(path) as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
.replace(":", " ").replace("", " ").split())
wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")
unique = [word for word, count in word_counts.items() if count == 1]
for word in unique:
print(word)
wordlist = wordlist.replace(word, str(word.upper()))
print(wordlist)
The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.
Any ideas?
I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.
import string
# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."
rm_punc = sentence.translate(None, string.punctuation) # remove punctuation
words = rm_punc.split(' ') # split spaces to get a list of words
# Find all unique word occurrences.
single_occurrences = []
for word in words:
# if word only occurs 1 time, append it to the list
if words.count(word) == 1:
single_occurrences.append(word)
# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
try:
word_idx = start + sentence[start:].index(word)
except ValueError:
# Could not find word in sentence. Skip it.
pass
else:
# Update counter.
start = word_idx + len(word)
# Rebuild sentence with capitalization.
first_letter = sentence[word_idx].upper()
sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]
print(sentence)
Text replacement by patters calls for regex.
Your text is a bit tricky, you have to
remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.
This should do this - see comments inside for explanations:
bible.txt is from your link
from collections import Counter
from string import punctuation , digits
import re
from collections import defaultdict
with open(r"SO\AllThingsPython\P4\bible.txt") as f:
s = f.read()
# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)
s2 = ''.join( c for c in s if c not in ps)
# split into words
s3 = s2.split()
# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
repl[word.upper()].add(word) # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}
# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s
# now the replace part - for all upper single words
for upp in single_occurence_upper_words:
# for all occuring capitalizations in the text
for orig in repl[upp]:
# use regex replace to find the original word from our repl dict with
# space/punktuation before/after it and replace it with the uppercase word
text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)
print(text)
Output (shortened):
Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.
2 These are the GENERATIONS of Jacob.
Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.
<snipp>
The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.

Getting list of string array into separate string arrays in python

This is my code.
SENTENCE = "He sad might have lung cancer. It’s just a rumor."
sent=(sent_tokenize(SENTENCE))
The output is
['He sad might have lung cancer.', 'It’s just a rumor.']
I want to get this array as
['He sad might have lung cancer.']
['It’s just a rumor.']
Is their any way of doing this and if so how?
Since you want to split according to a sentence, you can simply do this:
sentence_list = SENTENCE.split('.')
for sentence in sentence_list:
single_sentence = [sentence + '.']
If you actually want all lists containing a single sentence in the same data structure, you'd have to use a list of lists or a dictionary:
my_sentences = []
sentence_list = SENTENCE.split('.')
for sentence in sentence_list:
my_sentences.append([sentence + '.'])
To shorten this out using a list comprehension:
my_sentences = [[sentence + '.'] for sentence in SENTENCE.split('.')]
with the only culprit being that the SENTENCE splitting part will happen more often so it'll be slower working with a massive amount of sentences.
The solution using re.split() function:
import re
s = "He sad might have lung cancer. It’s just a rumor."
parts = [l if l[-1] == '.' else l + '.' for l in re.split(r'\.\s?(?!$)', s)]
print(parts)
The output:
['He sad might have lung cancer.', 'It’s just a rumor.']
r'\.\s?(?!$)' pattern, defines separator as . except that which is at the end of the text (?!$)
l if l[-1] == '.' else l + '.' - recovering . at the end of each line(as the dilimiter was not captured while splitting)

Find Pattern in Textfile From Several Elements In Several Lists?

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.

Categories