Replacing characters with some restrictions - python

I have a list of strings in Python and I want to create a function that perform many replacements. Some of them are:
Replace "o agrícola" for "trabajo agrícola"
Replace "ob agricola" for "trabajo agrícola"
One solution that comes to my mind is:
text = str(text).replace('o agricola','trabajo agrícola')
text = str(text).replace('t agricola','trabajo agrícola')
One of the issues:
text = 'obrero agricola' is transformed into 'obrertrabajo agrícola'
Is there a solution that respects these two conditions?
'obrero agricola' is mapped into 'obrero agrícola'
'o something' is mapped into 'o something' (it fixes 'o' if it's mixed with other words)

Use word boundaries along with re.sub:
text = 'ob agricola and obrero agricola'
text = re.sub(r'\b(?:o agrícola|ob agricola)', 'trabajo agrícola', text)
print(text) # trabajo agrícola and obrero agricola

Related

How to remove duplicate chars in a string?

I've got this problem and I simply can't get it right. I have to remove duplicated chars from a string.
phrase = "oo rarato roeroeu aa rouroupa dodo rerei dde romroma"
The output should be: "O rato roeu a roupa do rei de roma"
I tried things like:
def remove_duplicates(value):
var=""
for i in value:
if i in value:
if i in var:
pass
else:
var=var+i
return var
print(remove_duplicates(entrada))
But it's not there yet...
Any pointers to guide me here?
It seems from your example that you want to remove REPEATED SEQUENCES of characters, not duplicate chars across the whole string. So this is what I'm solving here.
You can use a regular expression.. not sure how horribly inefficient it is but it
works.
>>> import re
>>> phrase = str("oo rarato roeroeu aa rouroupa dodo rerei dde romroma")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'o rato roeu a roupa do rei de roma'
How this substitution proceeds down the string:
oo -> o
" " -> " "
rara -> ra
to -> to
" "-> " "
roeroe -> roe
etc..
Edit: Works for the other example string which should not be modified:
>>> phrase = str("Barbara Bebe com Bernardo")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'Barbara Bebe com Bernardo'
What you can do is form a set out of the string and then sort the remaining letters according to their original order.
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index) # this will give you a list
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)
String in python is a list of chars, right? But lists can have duplicates... sets cannot. So, if we convert list to set, then back to list, we'll get a list without duplicates ;P
I've seen a suggestion to use regex for replacing patterns. This will work, but that'll be a slow, and overcomplicated solution (human unfriendly to read also).
Regex is a heavy and costly weapon.
Also, you do not remove duplicated from string provided, but from words in the string:
First, split your string into lists of words.
for each of the words, remove duplicate letters
put back words to string
`
phrase = "oo rarato roeroeu aa rouroupa dodo rerei dde romroma"
words = phrase.split(' ')
`
words ['oo', 'rarato', 'roeroeu', 'aa', 'rouroupa', 'dodo', 'rerei', 'dde', 'romroma']
words_without_duplicates = []
for word in words:
word = ''.join(letter for letter in list(set(word)))
words_without_duplicates.append(word_without_duplicates)
phrase = ' '.join(word in words_without_duplicates)
phrase 'o oatr oeur a auopr od eir ed oamr'
Of curse, that can be optimized, but you wanted to be guided, so this is better to show the idea. It will be faster than regex too.
Actually I add a space end of the space. After that this is working
code
phrase =("oo rarato roeroeu aa rouroupa dodo rerei dde romroma ")
print(phrase)
ch=""
ali=[]
for i in phrase:
if i ==" ":
print(ch)
ch=""
if i not in ch:
ch=ch+i
Output
o
rato
roeu
a
roupa
do
rei
de
roma

Python find text in string

I have the following string for which I want to extract data:
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n
Every variable I want to extract starts with \n
The value I want to get starts with a colon ':' followed by more than 1 dot
When it doesnt start with a colon followed by dots, I dont want to extract that value.
For example my preferred output looks like:
LOA = 189.9
LBP = 176.0
BM = 26.4
DM = 9.2
import re
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n'
# capture all the characters BEFORE the ':' character
variables = re.findall(r'(.*?):', text_example)
# matches all floats and integers (does not account for minus signs)
values = re.findall(r'(\d+(?:\.\d+)?)', text_example)
# zip into dictionary (this is assuming you will have the same number of results for both regex expression.
result = dict(zip(variables, values))
print(result)
--> {'Length oa': '189.9', 'Breadth moulded': '26.4', 'Length bp': '176', 'Depth moulded to main deck': '9.2'}
You can create a regex and workaround the solution-
re.findall(r'(\\n|\n)([A-Za-z\s]*)(?:(\:\s*\.+))(\d*\.*\d*)',text_example)[2]
('\n', 'Breadth moulded', ': .......', '26.4')

Replace space between indicated phrases with underscore

I've a text file where important phrases are indicated with special symbols. To be exact, they will start with <highlight> and end with <\highlight>.
For example,
"<highlight>machine learning<\highlight> is gaining more popularity, so do <highlight>block chain<\highlight>."
In this sentence, important phrases are segmented by <highlight> and <\highlight>.
I need to remove the <highlight> and <\highlight>, and replace the space connecting words surrounded by them with underscore. Namely, convert "<highlight>machine learning<\highlight>" to "machine_learning". The whole sentence after processing will be "machine_learning is gaining more popularity, so do block_chain".
Try this:
>>> text = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."
>>> re.sub(r"<highlight>(.*?)<\\highlight>", lambda x: x.group(1).replace(" ", "_"), text)
'machine_learning is gaining more popularity, so do block_chain.'
There you go:
import re
txt = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."
words = re.findall('<highlight>(.*?)<\\\highlight', txt)
for w in words:
txt = txt.replace(w, w.replace(' ', '_'))
txt = txt.replace('<highlight>', '')
txt = txt.replace('<\highlight>', '')
print(txt)

Tagging words based on a dictionary/list in Python

I have the following dictionary of gene names:
gene_dict = {"repA1":1, "leuB":1}
# the actual dictionary is longer, around ~30K entries.
# or in list format
# gene_list = ["repA1", "leuB"]
What I want to do is given any sentence, we search for terms that is contained in the above dictionary and then tagged them.
For example given this sentence:
mytext = "xxxxx repA1 yyyy REPA1 zzz."
It will be then tagged as:
xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.
Is there any efficient way to do that? In practicality we would process couple of millions of sentences.
If you "gene_list" in not really-really-really long, you could use a compiled regular expression, like
import re
gene_list = ["repA1", "leuB"]
regexp = re.compile('|'.join(gene_list), flags=re.IGNORECASE)
result = re.sub(regexp, r'<GENE>\g<0></GENE>', 'xxxxx repA1 yyyy REPA1 zzz.')
and put in a loop for all your sentences. I think this should be quite fast.
If most of the sentences are short and separated by single spaces, something like:
gene_dict = {"repA1":1, "leuB":1}
format_gene = "<GENE>{}</GENE>".format
mytext = " ".join(format_gene(word) if word in gene_dict else word for word in mytext.split())
is going to be faster.
For slightly longer sentences or sentences you cannot reform with " ".join it might be more efficient or more correct to use several .replaces:
gene_dict = {"repA1":1, "leuB":1}
genes = set(gene_dict)
format_gene = "<GENE>{}</GENE>".format
to_replace = genes.intersection(mytext.split())
for gene in to_replace:
mytext = mytext.replace(gene, format_gene(gene))
Each of these assume that splits of the sentences will not take extortionate time, which is fair assuming genes_dict is a much longer than the sentences.

identifying strings which cant be spelt in a list item

I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.

Categories