Python Requests Reading text [duplicate] - python

This question already has an answer here:
How to remove Byte Order Mark in python
(1 answer)
Closed 7 months ago.
I am trying to learn text processing. And using nltk.
Trying to follow the NLTK book.
When I try to read a text, it is reading it a little different.
import requests
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = requests.get(url)
response.text[:25]
How can I read the text without the highlighted part in the image uploaded.

This is a unicode format that you're seeing here.
What you should do is, convert the unicode string to ascii with ignore if not ascii.
Example:
a=u'\uffefHello World'
print(a.encode('ascii', 'ignore'))
"Hello World"

The simple answer is to print it and not put it just in the shell:
print(response.text[:25])
Should print:
The Project Gutenberg E8
The shell does repr on the value to find out what it should print
print(repr(response.text[25]))
will again print:
'\ufeffThe Project Gutenberg E8'

Related

How to decode this "%E3%83%9C" string in python? [duplicate]

This question already has answers here:
Url decode UTF-8 in Python
(5 answers)
Closed 6 months ago.
So I have the following string
"%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
It actually means this
ボドカさん
This string seems to be encoded in UTF-8 because when I write this in python
encoded_str = b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
print(encoded_str)
print(encoded_str.decode('utf-8'))
Here is the output I get
b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
ボドカさん
But now I would like a script that will allow me to decode any string in the initial format and here is my code.
import re
import os
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
mystr = mystr.lower()
mystr = re.sub('%', r'\\x', mystr)
encoded_str = bytes(mystr, "utf-8")
print(mystr)
print(encoded_str)
print(encoded_str.decode('utf-8'))
Output:
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
b'\\xe3\\x83\\x9c\\xe3\\x83\\x89\\xe3\\x82\\xab\\xe3\\x81\\x95\\xe3\\x82\\x93'
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
I tried so many possibilities but I couldn't find the right way to encode proprely my string like the b'STRING' thing would do. I always get extra \ characters from the encoding process that then spoil the decoding process too.
I tried all the encoding methods existing in python for the bytes() function.
I need help please. Thank you.
Stack overflow banned me for that question lol
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
encoded_str = bytes.fromhex(mystr.replace('%', ''))
print(encoded_str.decode('utf-8'))
Output:
ボドカさん

extract some arabic/persian (unicode) words with regex using python [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None
Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

Python3 saving JSON with unicode single quote [duplicate]

This question already has answers here:
Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence
(12 answers)
Closed 1 year ago.
I know this has been asked before on Stackoverflow and on other sites but I cannot seem to be able to save a JSON file using escaped Unicode characters (Python3). I have read a lot of tutorials.
What am I missing? I have tried a lot of things but nothing works. I have also tried encoding/decoding in UTF-8 but I am obviously missing something.
Just to be clear, I have managed to get it working for other characters like й (0439) but I am having trouble with a single quote being encoded..
If I have the following dict:
import json
data = {"key": "Test \u0027TEXT\u0027 around"}
I want to save it exactly as it is in a new JSON file, but no matter what I do it always ends up as a single character, which is what is encoded in Unicode.
The following 2 blocks print the exact same thing: {"key": "Test 'TEXT' around"}.
print(json.dumps(data))
print(json.dumps(data, ensure_ascii=False))
Is there any way to keep the Unicode string literal? I want to have that very string as a value: "Test \u0027TEXT\u0027 around"
The behavior you are describing has nothing to do with JSON. This is simply how Python 3 handles strings. Open the shell and write:
>>> "Test \u0027TEXT\u0027 around"
"Test 'TEXT' around"
If you do not want Python to interpret the special characters, you should use raw strings (or maybe even byte sequences):
>>> r"Test \u0027TEXT\u0027 around"
'Test \\u0027TEXT\\u0027 around'
Reference:
https://docs.python.org/2.0/ref/strings.html
https://docs.python.org/3/library/stdtypes.html#binaryseq

Python: Replace URLEncoded characters in String with what they represent [duplicate]

This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 7 years ago.
I've been banging my head against the wall with this for a while. I'm trying to parse an RSS feed with Python's BeautifulSoup, and every now and then I get errors like:
I don't know what I am talking about
I can't seem to find any python library that will replace those characters with what they should be, so the resulting string looks like this:
I don't know what I am talking about
The closest I've gotten was
urllib.unquote(post_content).decode('utf-8')
But that still does not replace the url encoded character with a '. Does anyone know a good way to replace those urlencoded characters into the ascii characters they represent? There's also other errors that I get like ( and ) appearing as ( and )
Those weird strings are called html entities. You can decode them as described by this URL: Decode HTML entities in Python string?. It says to use the function unescape from the module html.parse

how to normalize or decode an URL in python? [duplicate]

This question already has answers here:
Decode escaped characters in URL
(5 answers)
Url decode UTF-8 in Python
(5 answers)
Closed 9 years ago.
I have a link like below
http%253A%252F.....25252520.doc
How do i convert this to normal link in python?..the link has lots of encoded stuff..
Apply urllib.unquote twice:
>>> import urllib
>>> strs = urllib.unquote("http%253A%252F.....25252520.doc")
>>> urllib.unquote(strs)
'http:/.....25252520.doc'
Use urllib.unquote():
Replace %xx escapes by their single-character equivalent.
It looks as if you have a double or ever triple encoded URL; the http:// part has been encoded to http%253A%252F which decodes to http%3A%2F which in turn becomes http:/. The URL itself may contain another stage of encoding but you didn't share enough of the actual URL with us to determine that.

Categories