Comparing unicode with unicode in python - python

I am trying to count the number of same words in an Urdu document which is saved in UTF-8.
so for example I have document containing 3 exactly same words separated by space
خُداوند خُداوند خُداوند
I tried to count the words by reading the file using the following code:
file_obj = codecs.open(path,encoding="utf-8")
lst = repr(file_obj.readline()).split(" ")
word = lst[0]
count =0
for w in lst:
if word == w:
count += 1
print count
but the value of count I am getting is 1 while I should get 3.
How does one compare Unicode strings?

Remove the repr() from your code. Use repr() only to create debug output; you are turning a unicode value into a string that can be pasted back into the interpreter.
This means your line from the file is now stored as:
>>> repr(u'خُداوند خُداوند خُداوند\n').split(" ")
["u'\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f", '\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f', "\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f\\n'"]
Note the double backslashes (escaped unicode escapes) and the first string starts with u' and the last string ends with \\n'. These values are obviously never equal.
Remove the repr(), and use .split() without arguments to remove the trailing whitespace too:
lst = file_obj.readline().split()
and your code will work:
>>> res = u'خُداوند خُداوند خُداوند\n'.split()
>>> res[0] == res[1] == res[2]
True
You may need to normalize the input first; some characters can be expressed either as one unicode codepoint or as two combining codepoints. Normalizing moves all such characters to a composed or decomposed state. See Normalizing Unicode.

Try removing the repr?
lst = file_obj.readline().split(" ")
The point is that you should at least print variables like lst and w to see what they are.

Comparing unicode strings in Python:
a = u'Artur'
print(a)
b = u'\u0041rtur'
print(b)
if a == b:
print('the same')
result:
Artur
Artur
the same

Related

Convert number values into ascii characters?

The part where I need to go from the number values I obtained to characters to spell out a word it not working, it says I need to use an integer for the last part?
accept string
print "This program reduces and decodes a coded message and determines if it is a palindrome"
string=(str(raw_input("The code is:")))
change it to lower case
string_lowercase=string.lower()
print "lower case string is:", string_lowercase
strip special characters
specialcharacters="1234567890~`!##$%^&*()_-+={[}]|\:;'<,>.?/"
for char in specialcharacters:
string_lowercase=string_lowercase.replace(char,"")
print "With the specials stripped out the string is:", string_lowercase
input offset
offset=(int(raw_input("enter offset:")))
conversion of text to ASCII code
result=[]
for i in string_lowercase:
code=ord(i)
result.append([code-offset])
conversion from ASCII code to text
text=''.join(chr(i) for i in result)
print "The decoded string is:", text.format(chr(result))
It looks like you have a list of lists instead of a list of ints when you call result.append([code-offset]). This means later when you call chr(i) for i in result, you are passing a list instead of an int to chr().
Try changing this to result.append(code-offset).
Other small suggestions:
raw_input already gives you a string, so there's no need to explicitly cast it.
Your removal of special characters can be more efficiently written as:
special_characters = '1234567890~`!##$%^&*()_-+={[}]|\:;'<,>.?/'
string_lowercase = ''.join(c for c in string_lowercase if string not in special_characters)
This allows you to only have to iterate through string_lowercase once instead of per character in special_characters.
While doing .append() to list, use code-offset instead of [code-offset]. As in later you are storing the value as a list (of one ASCII) instead of storing the ASCII value directly.
Hence your code should be:
result = []
for i in string_lowercase:
code = ord(i)
result.append(code-offset)
However you may simplified this code as:
result = [ord(ch)-offset for ch in string_lowercase]
You may even further simplify your code. The one line to get decoded string will be:
decoded_string = ''.join(chr(ord(ch)-offset) for ch in string_lowercase)
Example with offset as 2:
>>> string_lowercase = 'abcdefghijklmnopqrstuvwxyz'
>>> offset = 2
>>> decoded_string = ''.join(chr(ord(ch)-offset) for ch in string_lowercase)
>>> decoded_string
'_`abcdefghijklmnopqrstuvwx'
You are passing a list to chr when it only accepts integers. Try result.append(code-offset). [code-offset] is a one-item list.
Specifically, instead of:
result=[]
for i in string_lowercase:
code=ord(i)
result.append([code-offset])
use:
result=[]
for i in string_lowercase:
code=ord(i)
result.append(code-offset)
If you understand list comprehension, this works too: result = [ord(i)-offset for i in string_lowercase]

AttributeError: 'str' object has no attribute 'remove' [duplicate]

There is a string, for example. EXAMPLE.
How can I remove the middle character, i.e., M from it? I don't need the code. I want to know:
Do strings in Python end in any special character?
Which is a better way - shifting everything right to left starting from the middle character OR creation of a new string and not copying the middle character?
In Python, strings are immutable, so you have to create a new string. You have a few options of how to create the new string. If you want to remove the 'M' wherever it appears:
newstr = oldstr.replace("M", "")
If you want to remove the central character:
midlen = len(oldstr) // 2
newstr = oldstr[:midlen] + oldstr[midlen+1:]
You asked if strings end with a special character. No, you are thinking like a C programmer. In Python, strings are stored with their length, so any byte value, including \0, can appear in a string.
To replace a specific position:
s = s[:pos] + s[(pos+1):]
To replace a specific character:
s = s.replace('M','')
This is probably the best way:
original = "EXAMPLE"
removed = original.replace("M", "")
Don't worry about shifting characters and such. Most Python code takes place on a much higher level of abstraction.
Strings are immutable. But you can convert them to a list, which is mutable, and then convert the list back to a string after you've changed it.
s = "this is a string"
l = list(s) # convert to list
l[1] = "" # "delete" letter h (the item actually still exists but is empty)
l[1:2] = [] # really delete letter h (the item is actually removed from the list)
del(l[1]) # another way to delete it
p = l.index("a") # find position of the letter "a"
del(l[p]) # delete it
s = "".join(l) # convert back to string
You can also create a new string, as others have shown, by taking everything except the character you want from the existing string.
How can I remove the middle character, i.e., M from it?
You can't, because strings in Python are immutable.
Do strings in Python end in any special character?
No. They are similar to lists of characters; the length of the list defines the length of the string, and no character acts as a terminator.
Which is a better way - shifting everything right to left starting from the middle character OR creation of a new string and not copying the middle character?
You cannot modify the existing string, so you must create a new one containing everything except the middle character.
Use the translate() method:
>>> s = 'EXAMPLE'
>>> s.translate(None, 'M')
'EXAPLE'
def kill_char(string, n): # n = position of which character you want to remove
begin = string[:n] # from beginning to n (n not included)
end = string[n+1:] # n+1 through end of string
return begin + end
print kill_char("EXAMPLE", 3) # "M" removed
I have seen this somewhere here.
card = random.choice(cards)
cardsLeft = cards.replace(card, '', 1)
How to remove one character from a string:
Here is an example where there is a stack of cards represented as characters in a string.
One of them is drawn (import random module for the random.choice() function, that picks a random character in the string).
A new string, cardsLeft, is created to hold the remaining cards given by the string function replace() where the last parameter indicates that only one "card" is to be replaced by the empty string...
On Python 2, you can use UserString.MutableString to do it in a mutable way:
>>> import UserString
>>> s = UserString.MutableString("EXAMPLE")
>>> type(s)
<class 'UserString.MutableString'>
>>> del s[3] # Delete 'M'
>>> s = str(s) # Turn it into an immutable value
>>> s
'EXAPLE'
MutableString was removed in Python 3.
Another way is with a function,
Below is a way to remove all vowels from a string, just by calling the function
def disemvowel(s):
return s.translate(None, "aeiouAEIOU")
Here's what I did to slice out the "M":
s = 'EXAMPLE'
s1 = s[:s.index('M')] + s[s.index('M')+1:]
To delete a char or a sub-string once (only the first occurrence):
main_string = main_string.replace(sub_str, replace_with, 1)
NOTE: Here 1 can be replaced with any int for the number of occurrence you want to replace.
You can simply use list comprehension.
Assume that you have the string: my name is and you want to remove character m. use the following code:
"".join([x for x in "my name is" if x is not 'm'])
If you want to delete/ignore characters in a string, and, for instance, you have this string,
"[11:L:0]"
from a web API response or something like that, like a CSV file, let's say you are using requests
import requests
udid = 123456
url = 'http://webservices.yourserver.com/action/id-' + udid
s = requests.Session()
s.verify = False
resp = s.get(url, stream=True)
content = resp.content
loop and get rid of unwanted chars:
for line in resp.iter_lines():
line = line.replace("[", "")
line = line.replace("]", "")
line = line.replace('"', "")
Optional split, and you will be able to read values individually:
listofvalues = line.split(':')
Now accessing each value is easier:
print listofvalues[0]
print listofvalues[1]
print listofvalues[2]
This will print
11
L
0
Two new string removal methods are introduced in Python 3.9+
#str.removeprefix("prefix_to_be_removed")
#str.removesuffix("suffix_to_be_removed")
s='EXAMPLE'
In this case position of 'M' is 3
s = s[:3] + s[3:].removeprefix('M')
OR
s = s[:4].removesuffix('M') + s[4:]
#output'EXAPLE'
from random import randint
def shuffle_word(word):
newWord=""
for i in range(0,len(word)):
pos=randint(0,len(word)-1)
newWord += word[pos]
word = word[:pos]+word[pos+1:]
return newWord
word = "Sarajevo"
print(shuffle_word(word))
Strings are immutable in Python so both your options mean the same thing basically.

replace all characters in a string with asterisks

I have a string
name = "Ben"
that I turn into a list
word = list(name)
I want to replace the characters of the list with asterisks. How can I do this?
I tried using the .replace function, but that was too specific and didn't change all the characters at once.
I need a general solution that will work for any string.
I want to replace the characters of the list w/ asterisks
Instead, create a new string object with only asterisks, like this
word = '*' * len(name)
In Python, you can multiply a string with a number to get the same string concatenated. For example,
>>> '*' * 3
'***'
>>> 'abc' * 3
'abcabcabc'
You may replace the characters of the list with asterisks in the following ways:
Method 1
for i in range(len(word)):
word[i]='*'
This method is better IMO because no extra resources are used as the elements of the list are literally "replaced" by asterisks.
Method 2
word = ['*'] * len(word)
OR
word = list('*' * len(word))
In this method, a new list of the same length (containing only asterisks) is created and is assigned to 'word'.
I want to replace the characters of the list with asterisks. How can I
do this?
I will answer this question quite literally. There may be times when you may have to perform it as a single step particularly when utilizing it inside an expression
You can leverage the str.translate method and use a 256 size translation table to mask all characters to asterix
>>> name = "Ben"
>>> name.translate("*"*256)
'***'
Note because string is non-mutable, it will create a new string inside of mutating the original one.
Probably you are looking for something like this?
def blankout(instr, r='*', s=1, e=-1):
if '#' in instr:
# Handle E-Mail addresses
a = instr.split('#')
if e == 0:
e = len(instr)
return instr.replace(a[0][s:e], r * (len(a[0][s:e])))
if e == 0:
e = len(instr)
return instr.replace(instr[s:e], r * len(instr[s:e]))

how to get the last part of a string before a certain character?

I am trying to print the last part of a string before a certain character.
I'm not quite sure whether to use the string .split() method or string slicing or maybe something else.
Here is some code that doesn't work but I think shows the logic:
x = 'http://test.com/lalala-134'
print x['-':0] # beginning at the end of the string, return everything before '-'
Note that the number at the end will vary in size so I can't set an exact count from the end of the string.
You are looking for str.rsplit(), with a limit:
print x.rsplit('-', 1)[0]
.rsplit() searches for the splitting string from the end of input string, and the second argument limits how many times it'll split to just once.
Another option is to use str.rpartition(), which will only ever split just once:
print x.rpartition('-')[0]
For splitting just once, str.rpartition() is the faster method as well; if you need to split more than once you can only use str.rsplit().
Demo:
>>> x = 'http://test.com/lalala-134'
>>> print x.rsplit('-', 1)[0]
http://test.com/lalala
>>> 'something-with-a-lot-of-dashes'.rsplit('-', 1)[0]
'something-with-a-lot-of'
and the same with str.rpartition()
>>> print x.rpartition('-')[0]
http://test.com/lalala
>>> 'something-with-a-lot-of-dashes'.rpartition('-')[0]
'something-with-a-lot-of'
Difference between split and partition is split returns the list without delimiter and will split where ever it gets delimiter in string i.e.
x = 'http://test.com/lalala-134-431'
a,b,c = x.split(-)
print(a)
"http://test.com/lalala"
print(b)
"134"
print(c)
"431"
and partition will divide the string with only first delimiter and will only return 3 values in list
x = 'http://test.com/lalala-134-431'
a,b,c = x.partition('-')
print(a)
"http://test.com/lalala"
print(b)
"-"
print(c)
"134-431"
so as you want last value you can use rpartition it works in same way but it will find delimiter from end of string
x = 'http://test.com/lalala-134-431'
a,b,c = x.rpartition('-')
print(a)
"http://test.com/lalala-134"
print(b)
"-"
print(c)
"431"

How to delete a character from a string using Python

There is a string, for example. EXAMPLE.
How can I remove the middle character, i.e., M from it? I don't need the code. I want to know:
Do strings in Python end in any special character?
Which is a better way - shifting everything right to left starting from the middle character OR creation of a new string and not copying the middle character?
In Python, strings are immutable, so you have to create a new string. You have a few options of how to create the new string. If you want to remove the 'M' wherever it appears:
newstr = oldstr.replace("M", "")
If you want to remove the central character:
midlen = len(oldstr) // 2
newstr = oldstr[:midlen] + oldstr[midlen+1:]
You asked if strings end with a special character. No, you are thinking like a C programmer. In Python, strings are stored with their length, so any byte value, including \0, can appear in a string.
To replace a specific position:
s = s[:pos] + s[(pos+1):]
To replace a specific character:
s = s.replace('M','')
This is probably the best way:
original = "EXAMPLE"
removed = original.replace("M", "")
Don't worry about shifting characters and such. Most Python code takes place on a much higher level of abstraction.
Strings are immutable. But you can convert them to a list, which is mutable, and then convert the list back to a string after you've changed it.
s = "this is a string"
l = list(s) # convert to list
l[1] = "" # "delete" letter h (the item actually still exists but is empty)
l[1:2] = [] # really delete letter h (the item is actually removed from the list)
del(l[1]) # another way to delete it
p = l.index("a") # find position of the letter "a"
del(l[p]) # delete it
s = "".join(l) # convert back to string
You can also create a new string, as others have shown, by taking everything except the character you want from the existing string.
How can I remove the middle character, i.e., M from it?
You can't, because strings in Python are immutable.
Do strings in Python end in any special character?
No. They are similar to lists of characters; the length of the list defines the length of the string, and no character acts as a terminator.
Which is a better way - shifting everything right to left starting from the middle character OR creation of a new string and not copying the middle character?
You cannot modify the existing string, so you must create a new one containing everything except the middle character.
Use the translate() method:
>>> s = 'EXAMPLE'
>>> s.translate(None, 'M')
'EXAPLE'
def kill_char(string, n): # n = position of which character you want to remove
begin = string[:n] # from beginning to n (n not included)
end = string[n+1:] # n+1 through end of string
return begin + end
print kill_char("EXAMPLE", 3) # "M" removed
I have seen this somewhere here.
card = random.choice(cards)
cardsLeft = cards.replace(card, '', 1)
How to remove one character from a string:
Here is an example where there is a stack of cards represented as characters in a string.
One of them is drawn (import random module for the random.choice() function, that picks a random character in the string).
A new string, cardsLeft, is created to hold the remaining cards given by the string function replace() where the last parameter indicates that only one "card" is to be replaced by the empty string...
On Python 2, you can use UserString.MutableString to do it in a mutable way:
>>> import UserString
>>> s = UserString.MutableString("EXAMPLE")
>>> type(s)
<class 'UserString.MutableString'>
>>> del s[3] # Delete 'M'
>>> s = str(s) # Turn it into an immutable value
>>> s
'EXAPLE'
MutableString was removed in Python 3.
Another way is with a function,
Below is a way to remove all vowels from a string, just by calling the function
def disemvowel(s):
return s.translate(None, "aeiouAEIOU")
Here's what I did to slice out the "M":
s = 'EXAMPLE'
s1 = s[:s.index('M')] + s[s.index('M')+1:]
To delete a char or a sub-string once (only the first occurrence):
main_string = main_string.replace(sub_str, replace_with, 1)
NOTE: Here 1 can be replaced with any int for the number of occurrence you want to replace.
You can simply use list comprehension.
Assume that you have the string: my name is and you want to remove character m. use the following code:
"".join([x for x in "my name is" if x is not 'm'])
If you want to delete/ignore characters in a string, and, for instance, you have this string,
"[11:L:0]"
from a web API response or something like that, like a CSV file, let's say you are using requests
import requests
udid = 123456
url = 'http://webservices.yourserver.com/action/id-' + udid
s = requests.Session()
s.verify = False
resp = s.get(url, stream=True)
content = resp.content
loop and get rid of unwanted chars:
for line in resp.iter_lines():
line = line.replace("[", "")
line = line.replace("]", "")
line = line.replace('"', "")
Optional split, and you will be able to read values individually:
listofvalues = line.split(':')
Now accessing each value is easier:
print listofvalues[0]
print listofvalues[1]
print listofvalues[2]
This will print
11
L
0
Two new string removal methods are introduced in Python 3.9+
#str.removeprefix("prefix_to_be_removed")
#str.removesuffix("suffix_to_be_removed")
s='EXAMPLE'
In this case position of 'M' is 3
s = s[:3] + s[3:].removeprefix('M')
OR
s = s[:4].removesuffix('M') + s[4:]
#output'EXAPLE'
from random import randint
def shuffle_word(word):
newWord=""
for i in range(0,len(word)):
pos=randint(0,len(word)-1)
newWord += word[pos]
word = word[:pos]+word[pos+1:]
return newWord
word = "Sarajevo"
print(shuffle_word(word))
Strings are immutable in Python so both your options mean the same thing basically.

Categories