Python strip() unicode string? - python

How can you use string methods like strip() on a unicode string? and can't you access characters of a unicode string like with oridnary strings? (ex: mystring[0:4] )

It's working as usual, as long as they are actually unicode, not str (note: every string literal must be preceded by u, like in this example):
>>> a = u"coțofană"
>>> a
u'co\u021bofan\u0103'
>>> a[-1]
u'\u0103'
>>> a[2]
u'\u021b'
>>> a[3]
u'o'
>>> a.strip(u'ă')
u'co\u021bofan'

Maybe it's a bit late to answer to this, but if you are looking for the library function and not the instance method, you can use that as well.
Just use:
yourunicodestring = u' a unicode string with spaces all around '
unicode.strip(yourunicodestring)
In some cases it's easier to use this one, for example inside a map function like:
unicodelist=[u'a',u' a ',u' foo is just...foo ']
map (unicode.strip,unicodelist)

You can do every string operation, actually in Python 3, all str's are unicode.
>>> my_unicode_string = u"abcşiüğ"
>>> my_unicode_string[4]
u'i'
>>> my_unicode_string[3]
u'\u015f'
>>> print(my_unicode_string[3])
ş
>>> my_unicode_string[3:]
u'\u015fi\xfc\u011f'
>>> print(my_unicode_string[3:])
şiüğ
>>> print(my_unicode_string.strip(u"ğ"))
abcşiü

See the Python docs on Unicode strings and the following section on string methods. Unicode strings support all of the usual methods and operations as normal ASCII strings.

Related

Python Unicode Casting on Variable Bug

I've found out this weird python2 behavior related to unicode and variable:
>>> u"\u2730".encode('utf-8').encode('hex')
'e29cb0'
This is the expected result I need, but I want to dynamically control the first part ("u\u2730")
>>> type(u"\u2027")
<type 'unicode'>
Good, so the first part is casted as unicode. Now declaring a string variable and casting it to unicode:
>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> type(myvar)
<type 'unicode'>
>>> print myvar
\u2027
It seems that now I can use the variable in my original code, right?
>>> myvar.encode('utf-8').encode('hex')
'5c7532303237'
The results, as you can see, is not the original one. It seems that python is treating 'myvar' as string instead of unicode. Do I miss something?
Anyway, my final goal is to loop Unicode from \u0000 to \uFFFF, cast them as string and cast the string as HEX. Is there an easy way?
unichr() in Python 2 or chr() in Python 3 are the ways to construct a character from a number. \uxxxx escapes codes can only be typed directly in code.
Python 2:
>>> a='20'
>>> b='27'
>>> unichr(int(a+b,16))
u'\u2027'
Python 3:
>>> a='20'
>>> b='27'
>>> chr(int(a+b,16))
'‧'
You are confusing the Unicode escape sequence with an the \u characters. It's like confusing r"\n" (or "\\n") with an actual newline. You want to usecodecs.raw_unicode_escape_decode decode the str with 'unicode_escape':
>>> import codecs
>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> myvar
u'\\u2027'
>>> myvar.decode('unicode_escape')
(u'\u2027', 6)
>>> print(myvar.decode('unicode_escape')[0])
‧

C-style escaping in python

How do I escape (and unescape) the C escaped characters( newlines, slashes etc) for a string in python?
I guess JSON.encode( string) does this, but is there a better way?
Use str.encode('string-escape') in Python 2.7:
>>> '12\t34\n'.encode('string-escape')
'12\\t34\\n'
>>> '12\\t34\\n'.decode('string-escape')
'12\t34\n'
Use str.encode('unicode-escape') or str.encode('unicode-escape').decode('utf-8'):
>>> '12\t34\n'.encode('unicode-escape')
b'12\\t34\\n'
>>> b'12\\t34\\n'.decode('unicode-escape')
'12\t34\n'
>>> '12\t34\n'.encode('unicode-escape').decode('utf-8')
'12\\t34\\n'
>>> '12\\t34\\n'.encode('utf-8').decode('unicode-escape')
'12\t34\n'

How to use hex() without 0x in Python?

The hex() function in python, puts the leading characters 0x in front of the number. Is there anyway to tell it NOT to put them? So 0xfa230 will be fa230.
The code is
import fileinput
f = open('hexa', 'w')
for line in fileinput.input(['pattern0.txt']):
f.write(hex(int(line)))
f.write('\n')
(Recommended)
Python 3 f-strings: Answered by #GringoSuave
>>> i = 3735928559
>>> f'{i:x}'
'deadbeef'
Alternatives:
format builtin function (good for single values only)
>>> format(3735928559, 'x')
'deadbeef'
And sometimes we still may need to use str.format formatting in certain situations #Eumiro
(Though I would still recommend f-strings in most situations)
>>> '{:x}'.format(3735928559)
'deadbeef'
(Legacy) f-strings should solve all of your needs, but printf-style formatting is what we used to do #msvalkon
>>> '%x' % 3735928559
'deadbeef'
Without string formatting #jsbueno
>>> i = 3735928559
>>> i.to_bytes(4, "big").hex()
'deadbeef'
Hacky Answers (avoid)
hex(i)[2:] #GuillaumeLemaître
>>> i = 3735928559
>>> hex(i)[2:]
'deadbeef'
This relies on string slicing instead of using a function / method made specifically for formatting as hex. This is why it may give unexpected output for negative numbers:
>>> i = -3735928559
>>> hex(i)[2:]
'xdeadbeef'
>>> f'{i:x}'
'-deadbeef'
Use this code:
'{:x}'.format(int(line))
it allows you to specify a number of digits too:
'{:06x}'.format(123)
# '00007b'
For Python 2.6 use
'{0:x}'.format(int(line))
or
'{0:06x}'.format(int(line))
You can simply write
hex(x)[2:]
to get the first two characters removed.
Python 3.6+:
>>> i = 240
>>> f'{i:x}' # 02x to pad with zeros
'f0'
Old style string formatting:
In [3]: "%x" % 127
Out[3]: '7f'
New style
In [7]: '{:x}'.format(127)
Out[7]: '7f'
Using capital letters as format characters yields uppercase hexadecimal
In [8]: '{:X}'.format(127)
Out[8]: '7F'
Docs are here.
'x' - Outputs the number in base 16, using lower-case letters for the digits above 9.
>>> format(3735928559, 'x')
'deadbeef'
'X' - Outputs the number in base 16, using upper-case letters for the digits above 9.
>>> format(3735928559, 'X')
'DEADBEEF'
You can find more information about that in Python's documentation:
Format Specification Mini-Language
format()
F-strings
Python 3's formatted literal strings (f-strings) support the Format Specification Mini-Language, which designates x for hexadecimal numbers. The output doesn't include 0x.
So you can do this:
>>> f"{3735928559:x}"
'deadbeef'
See the spec for other bases like binary, octal, etc.
Edit: str.removeprefix
Since Python 3.9, there is now a str.removeprefix method, which allows you to write the following more obvious code:
>>> hexadecimal = hex(3735928559)
>>> hexadecimal.removeprefix('0x')
'deadbeef'
Not that this does NOT work for negative numbers ❌:
>>> negadecimal = hex(-3735928559)
>>> negadecimal.removeprefix('0x')
'-0xdeadbeef'
Besides going through string formatting, it is interesting to have in mind that when working with numbers and their hexadecimal representation we usually are dealing with byte-content, and interested in how bytes relate.
The bytes class in Python 3 had been enriched with methods over the 3.x series, and int.to_bytes combined with the bytes.hex() provide full control of the hex-digits output, while preserving the semantics of the transform (not to mention, holding the intermediate "bytes" object ready to be used in any binary protocol that requires the number):
In [8]: i = 3735928559
In [9]: i.to_bytes(4, "big").hex()
Out[9]: 'deadbeef'
Besides that, bytes.hex() allow some control over the output, such as specifying a separator for the hex digits:
In [10]: bytes.hex?
Docstring:
Create a string of hexadecimal numbers from a bytes object.
sep
An optional single character or byte to separate hex bytes.
bytes_per_sep
How many bytes between separators. Positive values count from the
right, negative values count from the left.
Example:
>>> value = b'\xb9\x01\xef'
>>> value.hex()
'b901ef'
>>> value.hex(':')
'b9:01:ef'
>>> value.hex(':', 2)
'b9:01ef'
>>> value.hex(':', -2)
'b901:ef'
(That said, in most scenarios just a quick print is wanted, I'd probably just go through f-string formatting, as in the accepted answer: f"{mynumber:04x}" - for the simple reason of "less things to remember")
While all of the previous answers will work, a lot of them have caveats like not being able to handle both positive and negative numbers or only work in Python 2 or 3. The version below works in both Python 2 and 3 and for positive and negative numbers:
Since Python returns a string hexadecimal value from hex() we can use string.replace to remove the 0x characters regardless of their position in the string (which is important since this differs for positive and negative numbers).
hexValue = hexValue.replace('0x','')
EDIT: wjandrea made a good point in that the above implementation doesn't handle values that contain 0X instead of 0x, which can occur in int literals. With this use case in mind, you can use the following case-insensitive implementation for Python 2 and 3:
import re
hexValue = re.sub('0x', '', hexValue, flags=re.IGNORECASE)
Decimal to Hexadecimal,
it worked
hex(number).lstrip("0x").rstrip("L")

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

Why is it when I print something, there is always a unicode next to it? (Python)

[u'Iphones', u'dont', u'receieve', u'messages']
Is there a way to print it without the "u" in front of it?
What you are seeing is the __repr__() representation of the unicode string which includes the u to make it clear. If you don't want the u you could print the object (using __str__) - this works for me:
print [str(x) for x in l]
Probably better is to read up on python unicode and encode using the particular unicode codec you want:
print [x.encode() for x in l]
[edit]: to clarify repr and why the u is there - the goal of repr is to provide a convenient string representation, "to return a string that would yield an object with the same value when passed to eval()". Ie you can copy and paste the printed output and get the same object (list of unicode strings).
Python contains string classes for both unicode strings and regular strings. The u before a string indicates that it is a unicode string.
>>> mystrings = [u'Iphones', u'dont', u'receieve', u'messages']
>>> [str(s) for s in mystrings]
['Iphones', 'dont', 'receieve', 'messages']
>>> type(u'Iphones')
<type 'unicode'>
>>> type('Iphones')
<type 'str'>
See http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-buffer-xrange for more information about the string types available in Python.

Categories