Python efficient mass replacing unknown characterers - python

PHP4+mySQL4 based project post to Django 1.1 project and it mixes up some letters.
What is the best way (most efficient) to replace in this fashion?
The problem for me is that i cannot get values for those letters. Is there an online tool to do that?
I have textField with various letters and i want to replace those in this fashion:
àèæëáðøûþ => ąčęėįšųūž
ÀÈÆËÁÐØÛÞ => ĄČĘĖĮŠŲŪŽ
I had similar case where i had to clean up the code so i used this:
def clean(string):
return ''.join([c for c in string if ord(c) > 31 or ord(c) in [9, 10, 13]] )
Update: i succeeded to extract Unicode values looking at Django debug messages (replace_from:replace_to):
{'\xe0':'\u0105', '\xe8':'\u010d', '\xe6':'\u0119', '\xeb':'\u0117', '\xe1':'\u012f',
'\xf0':'\u0161', '\xf8':'\u0179', '\xfb':'\u016b', '\xfe':'\u017e',
'\xc0':'\u0104', '\xc8':'\u010c', '\xc6':'\u0118', '\xcb':'\u0116', '\xc1':'\u012e',
'\xd0':'\u0160', '\xd8':'\u0172', '\xdb':'\u016a', '\xde':'\u017d'
So the main problem remains - replacing

Try the str.replace() method - should work with unicode strings.
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
Make sure your old and new strings are of type Unicode
(that applies to your input data as well).
Find out what your input (non-unicode) string is supposed to be encoded in.
For example, it may be in latin1 encoding.
Use the builtin str.decode() method to create a Unicode version of your data,
and feed that to str.replace().
>>> unioldchars = oldchars.decode("latin1")
>>> newdata = data.replace(unioldchars, newchars)

I'd do it myself. The built-in replace function is of little use if you want multiple, efficient replacements.
Give this a look: http://code.activestate.com/recipes/81330-single-pass-multiple-replace/
EDIT: WAIT, you wanted to do the replacement client-side, like in the text-box?

string.translate(s, table[, deletechars])
Delete all characters from s that are in deletechars (if
present), and then translate the characters using table, which must be
a 256-character string giving the translation for each character value,
indexed by its ordinal. If table is None, then only the character deletion
step is performed.
See also http://docs.python.org/library/string.html#string.maketrans

Related

Remove "." and "\" from a string

my project is to capture a log number from Google Sheet using gspread module. But now the problem is that the log number captured is in the form of string ".\1300". I only want the number in the string but I could not remove it using the below code.
Tried using .replace() function to replace "\" with "" but failed.
a='.\1362'
a.replace('\\',"")
Should obtain the string "1362" without the symbol.
But the result obtained is ".^2"
The problem is that \136 has special meaning (similar to \n for newline, \t for tab, etc). Seemingly it represents ^.
Check out the following example:
a = '.\1362'
a = a.replace('\\',"")
print(a)
b = r'.\1362'
b = b.replace('\\',"")
print(b)
Produces
.^2
.\1362
Now, if your Google Sheets module sends .\1362 instead of .\\1362, if is very likely because you are in fact supposed to receive .^2. Or, there's a problem with your character encoding somewhere along the way.
The r modifier I put on the b variable means raw string, meaning Python will not interpret backlashes and leave your string alone. This is only really useful when typing the strings in manually, but you could perhaps try:
a = r'{}'.format(yourStringFromGoogle)
Edit: As pointed out in the comments, the original code did in fact discard the result of the .replace() method. I've updated the code, but please note that the string interpolation issue remains the same.
When you do a='.\1362', a will only have three bytes:
a = '.\1362'`
print(len(a)) # => 3
That is because \132 represents a single character. If you want to create a six byte string with a dot, a slash, and the digits 1362, you either need to escape the backslash, or create a raw string:
a = r'.\1362'
print(len(a)) # => 6
In either case, calling replace on a string will not replace the characters in that string. a will still be what it was before calling replace. Instead, replace returns a new string:
a = r'.\1362'
b = a.replace('\\', '')
print(a) # => .\1362
print(b) # => .1362
So, if you want to replace characters, calling replace is the way to do it, but you've got to save the result in a new variable or overwrite the old.
See String and Bytes literals in the official python documentation for more information.
Your string should contains 2 backslashes like this .\\1362 or use r'.\1362' (which is declaring the string as raw and then it will be converted to normal during compile time). If there is only one backslash, Python will understand that \136 mean ^ as you can see (ref: link)
Whats happening here is that \1362 is being encoded as ^2 because of the backslash, so you need to make the string raw before you're able to use it, you can do this by doing
a = r'{}'.format(rawInputString)
or if you're on python3.6+ you can do
a = rf'{rawInputString}'

python Regex to find values between 2 strings

I have a string: a = '*1357*0123456789012345678901234567890123456789*2468*'
I want to find a value between 1357 and 2468 which is 0123456789012345678901234567890123456789.
I want to use regex or easier method to extract the value.
I tried re.findall(r'1357\.(.*?)2468', a), but I don't know where I'm doing wrong.
You have a couple of problems here:
You're escaping the . after 1357, which means a literal ., which isn't what you meant to have
You aren't treating the * characters (which do need to be escaped, of course).
To make a long story short:
re.findall(r'1357\*(.*?)\*2468', a)
If you want a slightly more general or flexible method, you can use this:
re.findall(r'\*\d+\*(\d+)\*\d+\*',a)
Which gives you the same output:
['0123456789012345678901234567890123456789']
But the advantage is that it gives you the value between any set of numeric values that are surrounded by the *. For instance, this would work for your string, but also for the string a = *0101*0123456789012345678901234567890123456789*0*, etc...

Searching words without diacritics in a sorted list of words

I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.
First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:
cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }
Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.
Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:
def dia_search(word, cmap, dictionary):
prefixes = ['']
for c in word:
# each character maps to itself
subchars = [c]
# and some diacritical siblings if they exist
if cmap.has_key(c):
subchars += cmap[c]
# build a list of matching prefixes for the next round
prefixes = [p+s for s in subchars
for p in prefixes
if dictionary.psearch(p+s)>0]
return prefixes
This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.
Thanks for all comments.
EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.
using the standard library only (str.maketrans and str.translate) you could do this:
intab = "řížéě" # ...add all the other characters
outtab = "rizee" # and the characters you want them translated to
transtab = str.maketrans(intab, outtab)
strg = "abc kříž def ";
print(strg.translate(transtab)) # abc kriz def
this is for python3.
for python 2 you'd need to:
from string import maketrans
transtab = maketrans(intab, outtab)
# the rest remains the same
Have a look into Unidecode using which u can actually convert the diacritics into closest ascii. e.g.:-unidecode(u'kříž')
As has been suggested, what you want to do is to translate your unicode words (containing diacritics) to the closest standard 24-word alphabet version.
One way of implementing this would be to create a second list of words (of the same size of the original) with the corresponding translations. Then you do the query in the translated list, and once you have a match look up the corresponding location in the original list.
Or in case you can alter the original list, you can translate everything in-place and strip duplicates.

"ValueError: invalid literal for int() with base 10" trying to get character's ASCII code

I am a beginner in python. I came across this question in codewars.
Jaden is known for some of his philosophy that he delivers via Twitter. When writing on Twitter, he is known for almost always capitalizing every word.
Your task is to convert strings to how they would be written by Jaden Smith. The strings are actual quotes from Jaden Smith, but they are not capitalized in the same way he originally typed them.
Example :
Not Jaden-Cased: "How can mirrors be real if our eyes aren't real"
Jaden-Cased: "How Can Mirrors Be Real If Our Eyes Aren't Real"
This is my attempt (I am supposed to code using a function)
def toJadenCase(string):
l = len(string)
for i in range(0,l):
if string[i] == ' ':
y = string[i]
string[i+1] = chr(int(y)-32)
return srting
s = raw_input()
print toJadenCase(s)
When run, the following errors showed up
How can mirrors be real if our eyes aren't real (this is the input string)
Traceback (most recent call last):
File "jaden_smith.py", line 9, in <module>
print toJadenCase(s)
File "jaden_smith.py", line 6, in toJadenCase
string[i+1] = chr(int(y)-32)
ValueError: invalid literal for int() with base 10: ''
I couldn't understand these errors even after google-ing it. Any help would be appreciated. I would also be great if other errors in my code are highlighted and a better code is suggested.
Thanks in advance :D
As Goodies points out, string should not be used as a variable name
Following the Zen of Python, this is technically a function that does exactly what you're trying to achieve:
def toJadenCase(quote):
return quote.title()
Edit:
Revised version to deal with apostrophes:
import string
def toJadenCase(quote):
return string.capwords(quote)
First you have to understand that strings are immutable, so you cannot set a single character inside a string, but build a new string from the old one and replace the old one (this can be usually done still in one pass so it's not a big complication).
Second, for most of these kind of operations, it is much better to use the methods of the string object itself, rather than redo everything from scratch.
Said that, there is still some complication with the question, but a function that does what you want is in the module string:
import string
s="How can mirrors be real if our eyes aren't real"
newstring=string.capwords(s)
If you prefer (why?!) a DIY solution (using string methods):
newstring=' '.join([ss.capitalize() for ss in s.split()])
Note that using split without argument splits the string on any whitespace (e.g. tabs etc.), that I think is the desired behavior.
If you want to do this without using a function that already exists, this is how I would do it and I'll explain everything:
Assuming you get a string with ONLY text based words and all words start with a character*
def toJadenCase(string):
words = string.strip().split()
# This first strips all empty spaces around the words in the text and then splits the string by spaces (default) otherwise you can add a character inside split in order to split it at the character. This returns a list of words in the sentence.
li = [] # initialize empty list
for word in words:
word = chr(ord(word[0])-32) + word[1:]
# So there's a couple of things going on here.
# I could use .upper() to upper case something (like word[0].upper() + word[1:]
# in order to get it but I wanted to do it without the use of that.
# That being said, ord just figures out the ascii number and subtracting
# 32 makes it uppercase. chr changes it back to a string.
# Then it can be concatenated to the rest of the word.
# Strings can be treated as lists in python so word[0] and word[1:] works
Also, word[1:] just means from the 1st index to the end.
li.append(word) # this appends the word to the list
return ' '.join(li) # this joins all of the words in the list with a space
Now, if you want something a lot more concise (you can use .capitalize()):
def toJadenCaseShort(string):
return ' '.join([x.capitalize() for x in string.strip().split()])
which returns:
>>> abc("hello my friends")
'Hello My Friends'
Basically what it does is it uses list comprehension to strip and then split the words, capitalizes them, and then joins them with spaces!
Of course, you could just use string.title() as mark s. says but what's the fun in that? :)
Here is the answer that passed for me
import string
def toJadenCase(str):
quote = string.capwords(str)
return quote #Do not use print(quote) as it adds spaces
def toJadenCase(str):
quote = string.capwords(str)
return quote #Do not use print(quote) as it adds spaces

How to compare unicode strings with entity ref to non-unicode string

I am evaluating hundreds of thousands of html files. I am looking for particular parts of the files. There can be small variations in the way the files were created
For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space:
u'KEY1A\x97RISKFACTORS'
In another file I could have:
'KEY1ARISKFACTORS'
I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. But every substitution I try to run the first string to remove the '\97 does not seem to work
There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like:
key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}
I am assuming that since when I run
S1='A'
S2=u'A'
S1==S2
I get
True
I should be able to compare these once the html entities are handled
What I specifically tried to do is
new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')
I got an error
Sorry, I have been at this since last night. SLott pointed out something and I see I used the wrong label I hope this makes more sense
You are correct that if S1='A' and S2 = u'A', then S1 == S2. Instead of assuming this though, you can do a simple test:
key_dict= {u'A':'Value1',
'A':'Value2'}
print key_dict
print u'A' == 'A'
This outputs:
{u'A': 'Value2'}
True
That resolved, let's look at:
new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')
There's a problem here, \x97 is the value you're trying to replace in the target string. However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. Even if the target and search string were both ascii or unicode, you'd still not find the '\x97'. Second problem is that you are trying to search for a non-unicode string in a unicode string. The easiest solution, and one that makes the most sense is to simply search for u'\x97':
print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')
Outputs:
KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES
Why not the obvious .replace(u'\x97','')? Where does the idea of that '|' come from?
>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'

Categories