Basically what I want to do is print u'\u1001', but I do not want the 1001 hardcoded. Is there a way I can use a variable or string for this? Also, it would be nice if I could retrieve this code again, when I use the output as input.
According to the python doc on unicode (located Here):
One-character Unicode strings can also be created with the unichr()
built-in function, which takes integers and returns a Unicode string
of length 1 that contains the corresponding code point. The reverse
operation is the built-in ord() function that takes a one-character
Unicode string and returns the code point value:
unichr(40960)
results in the character :
u'\ua000'
Related
so I have the following text:
'\x56\x10\x34\xf8'
According to the assignment I got, this is written in hex. however, I can't convert this into readable text or binary using the unhexlify function from the binascii library (python).
So I guess this is a three part question:
I would really appreciate if someone could identify in which form this text is written, give a brief explanation about it (or give a link that explains it) and give a code that converts it from it's current form to binary and readable text.
This is hex representation of ascii text.
In python, anything with a '\x' is a hex character, with the 2 digits after that representing its value.
'\x41' represents the capital 'A' character
You can turn most of the characters into standard ascii using a str() call however some hex numbers do not map to readable ascii characters, for example '\x10' maps to the new line character '\n'.
However strings formatted in this way would not act any differently if they were in their ascii counterparts. The string '\x54\x47\x47\x19' can be indexed or printed. index 0 would return '\x54' and index 1 would return '\x47'. Converting it is not necessary.
Assuming those values map to ASCII character values, you should be able to loop over those values (possibly converting to int) and then feed them into chr(val) to obtain the corresponding character.
I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "π" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.
I've tried to acomplish this with
re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)
but obviously this won't work because we cannot insert the group into this unicode escape.
What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...
When you have a number of the character, you can do ord(number) to get the character of that number.
Because we have a string, we need to read it as int with base 16.
Both of those together:
>>> chr(int("0001F44D", 16))
'π'
However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub
Now we get:
re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)
PS Don't name your string just str - you'll shadow the builtin str meaning type.
Using Python, how would I encode all the characters of a string to a URL-encoded string?
As of now, just about every answer eventually references the same methods, such as urllib.parse.quote() or urllib.parse.urlencode(). While these answers are technically valid (they follow the required specifications for encoding special characters in URLs), I have not managed to find a single answer that describes how to encode other/non-special characters as well (such as lowercase or uppercase letters).
How do I take a string and encode every character into a URL-encoded string?
This gist reveals a very nice answer to this problem. The final function code is as follows:
def encode_all(string):
return "".join("%{0:0>2}".format(format(ord(char), "x")) for char in string)
Let's break this down.
The first thing to notice is that the return value is a generator expression (... for char in string) wrapped in a str.join call ("".join(...)). This means we will be performing an operation for each character in the string, then finally joining each outputted string together (with the empty string, "").
The operation performed on each character in the string is "%{0:0>2}".format(format(ord(char), "x")). This can be broken down into the following:
ord(char): Convert each character to the corresponding number.
format(..., "x"): Convert the number to a hexadecimal value.
"%{0:0>2}".format(...): Format the hexadecimal value into a string (with a prefixed "%").
When you look at the whole function from an overview, it is converting each character to a number, converting that number to hexadecimal, then jamming all the hexadecimal values into a string (which is then returned).
I am from a c background. started learning python few days ago. my question is what is the end of string notation in python. like we are having \0 in c. is there anything like that in python.
There isn't one. Python strings store the length of the string independent from the string contents.
There is nothing like that in Python. A string is simply a string. The following:
test = "Hello, world!"
is simply a string of 13 characters. It's a self-contained object and it knows how many character it contains, there is no need for an end-of-string notation.
Python's string management is internally a little more complex than that. Strings is a sequence type so that from a python coder's point of view it is more an array of characters than anything. (And so it has no terminating character but just a length property.)
If you must know: Internally python strings' data character arrays are null terminated. But the String object stores a couple of other properties as well. (e.g. the hash of the string for use as key in dictionaries.)
For more detailed info (especially for C coders) see here: http://www.laurentluce.com/posts/python-string-objects-implementation/
In Python, if I have a string like:
a =" Hello - to - everybody"
And I do
a.split('-')
then I get
[u'Hello', u'to', u'everybody']
This is just an example.
How can I get a simple list without that annoying u'??
The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent.
The u is purely used to let you know it's a unicode string in the representation - it will not affect the string itself.
In general, unicode strings work exactly as normal strings, so there should be no issue with leaving them as unicode strings.
In Python 3.x, unicode strings are the default, and don't have the u prepended (instead, bytes (the equivalent to old strings) are prepended with b).
If you really, really need to convert to a normal string (rarely the case, but potentially an issue if you are using an extension library that doesn't support unicode strings, for example), take a look at unicode.encode() and unicode.decode(). You can either do this before the split, or after the split using a list comprehension.
I have a opposite problem. The str '第δΈε\u3000η士ιζ’¦εΉ»θ―ιη΅ θ΄Ύι¨ζι£ε°ζιΊη§' needs to be splitted by the unicode character. But I made wrong and code split('\u') that leaded to the unicode syntax error.
I should code split('\u3000')