I was using bz2 earlier to try to decompress an input. The input that I wanted to decode was already in compressed format, so I decided to input the format into the interactive Python console:
>>> import bz2
>>> bz2.decompress(input)
This worked just fine without any errors. However, I got different results when I tried to extract the text from a html file and then decompress it:
file = open("example.html", "r")
contents = file.read()
# Insert code to pull out the text, which is of type 'str'
result = bz2.decompress(parsedString)
I've checked the string I parsed with the original one, and it looks identical. Furthermore, when I copy and paste the string I wish to decompress into my .py file (basically enclosing it with double parentheses ""), it works fine. I have also tried to open with "rb" in hopes that it'll look at the .html file as a binary, though that failed to work as well.
My questions are: what is the difference between these two strings? They are both of type 'str', so I'm assuming there is an underlying difference I am missing. Furthermore, how would I go about retrieving the bz2 content from the .html in such a way that it will not be treated as an incorrect datastream? Any help is appreciated. Thanks!
My guess is that the html file has the text representation of the data instead of the actual binary data in the file itself.
For instance take a look at the following code:
>>> t = '\x80'
>>> print t
>>> '\x80'
But say I create a text file with the contents \x80 and do:
with open('file') as f:
t = f.read()
print t
I would get back:
'\\x80'
If this is the case, you could use eval to get the desired result:
result = bz2.decompress(eval('"'+parsedString'"'))
Just make sure that you only do this for trusted data.
Related
I have obtained a data.txt file from an online source. When I open the file with Notepad, I see the random characters as shown in the figure.
I attempted to open the file using the following python code snippet:
my_file = 'data.txt'
f = open(my_file, 'rb')
print(f)
ff = pickle.load(f)
print(ff)
f.close()
The first print operation gives <_io.BufferedReader name='data.txt'>in the console. And the second print operation displays all the data of data.txt file in a readable form.
I want to edit the data in the data.txt file with my own data sets. I googled for possible solutions. Most of the available solutions (for example this) suggest changing the Encoding scheme of the data.txt file to UTF-8. At present, the data.txt Encoding is ANSI. I changed the Encoding to UTF-8 as suggested. However, the problem still persists (file still contains gibberish). Moreover, I tried to see the contents of the file (now UTF-8 encoding) using the above python code snippet. This time, I get the following error.
_pickle.UnpicklingError: invalid load key, '\xef'.
The python code shows that the file has valid data. However, I'm unable to edit the data with my own data sets. Any help, please!
The error:
_pickle.UnpicklingError: invalid load key, '\xef'.
means that the load key:\xef isn't plain text. This could be an image, music file, etc. If the contents of the .txt file is not plain text there is no way to convert the characters to text.
I am trying to write strings containing special characters such as chinese letters and french accents to a csv file. At first I was getting the classic Unicode encode error and looked online for a solution. Many resources told me to use .encode('utf-8',errors='ignore') to solve the problem.This places bytes in the excel file. In my code shown below I tried getting the function that appends the character to the csv file to convert the character to utf-8. This makes the program run without error, however, when I open up the excel document I see that instead of "é" and "蒋" being added to the file, I see "é" and "è’‹".
import csv
def appendToCSV(specialCharacter):
with open('myCSVFile.csv',"a",newline="",encoding='utf-8') as csvFile:
csvFileWriter = csv.writer(csvFile)
csvFileWriter.writerow([specialCharacter])
csvFile.close()
appendToCSV('é')
appendToCSV('蒋')
I would like to get display the characters in the excel document exactly as shown, any help would be appreciated. Thank you.
Use utf-8-sig for the encoding. Excel requires the byte order mark (BOM) signature or it will interpret the file in the local ANSI encoding.
I'm pretty sure your excel worksheet is set to use "Latin 1". Try to switch the setting to use utf-8.
Note:
>>> x = "蒋"
>>> bs = x.encode()
>>> bs
b'\xe8\x92\x8b'
>>> bs.decode("latin")
'è\x92\x8b'
And:
>>> x = 'é'
>>> bs = x.encode()
>>> bs.decode('latin-1')
'é'
I am trying to download a .jpg file from a url using requests module in python. This is what I tried.
There is no error. but I am unable to open the output file.
>>> import requests
>>> l = requests.get("http://www.mosta2bal.com/vb/imgcache/2/9086screen.jpg")
>>> l
<Response [200]>
>>> l.text
u'\ufffd\ufffd\ufffd\ufffd\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\ufffd\ufffd\x12EExif\x00\x00MM\x00*\x00\x00\x00\x08\x00\x07\x01\x12\x00\x03\x......long text
>>> l.encoding
>>> import codecs
>>> f = codecs.open('out.jpg', mode="w", encoding="utf-8")
>>> f.write(l.text)
You're trying to access binary data as if it were text. This means that Requests has to guess an encoding for it (and any guess it makes will be wrong, because it's not text) and decode it… just so you can decode it to utf-8. If you're really lucky, maybe Requests will guess UTF-8, and your data will just happen to be data that can be round-tripped as UTF-8, so it might work one time in a thousand, at best.
Just ask requests for binary response content, and save it to a binary file.
While we're at it, you never actually close the file. You're just sitting there in the interactive interpreter with an open file object that hasn't been flushed yet. So, it's entirely possible that the last buffer worth of data, or even all the data, won't be there yet. That's exactly what the with statement is for.
So:
l = requests.get("http://www.mosta2bal.com/vb/imgcache/2/9086screen.jpg")
with open('out.jpg', 'wb') as f:
f.write(l.content)
First, as #abarnert mentioned in the comment, an image file consists of binary data, not text. To get the data, use .content attribute:
data = l.content
with open('image.jpg', 'wb') as image: #Open it in binary mode
image.write(data)
I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python
I just started using Python, I am trying to make a program that writes the lyrics of a song on the screen opened from the internet "www....../lyrics.txt".
My first code:
import urllib.request
lyrics=urllib.request.urlopen("http://hereIsMyUrl/lyrics.txt")
text=lyrics.read()
print(text)
When I activated this code, it didn't give me the lyrics as they are written on the website, it gave me new line commands '\r\n' at all the places that should have been new lines and gave me all the lyrics in a long messy string. For example:
Some lyrics here\r\nthis should already be the next line\r\nand so on.
I searched the internet for codes to replace the '\r\n' commands with new lines and tried the following:
import urllib.request
lyrics=urllib.request.urlopen("http://hereIsMyUrl/lyrics.txt")
text=lyrics.read()
text=text.replace("\r\n","\n")
print(text)
I hoped it would atleast replace something, but instead it gave me a runtime-error:
TypeError: expected bytes, bytearray or buffer compatible object
I searched the internet about that error, but I didn't find anything connected to opening files from the internet.
I have been stuck at this point for hours and have no idea how to continue.
Please help!
Thanks in advance!
Your example is not working because the data returned by the read statement is a "bytes object". You need to decode it using an appropriate encoding. See also the docs for request.urlopen, file.read and byte array operations.
A complete working example is given below:
#!/usr/bin/env python3
import urllib.request
# Example URL
url = "http://ntl.matrix.com.br/pfilho/oldies_list/top/lyrics/black_or_white.txt"
# Open URL: returns file-like object
lyrics = urllib.request.urlopen(url)
# Read raw data, this will return a "bytes object"
text = lyrics.read()
# Print raw data
print(text)
# Print decoded data:
print(text.decode('utf-8'))
# If you still need newline conversion, you could use the following
text = text.decode('utf-8')
text = text.replace('\r\n', '\n')
print(text)
In Python 3, bytes are treated differently from text strings. After the line
text=lyrics.read()
If you try this
print(type(text))
It returns
<class 'bytes'>
So it is not a string, it's a list of bytes.
When you're calling text=text.replace("\r\n","\n") you're passing it strings, which is the reason for the error message. So you have two options.
Convert variable "text" from bytes to text by adding this line after the
text=lyrics.read() line.
text = text.decode("utf-8")
Change the replace call to use bytes instead of strings
text=text.replace(b"\r\n",b"\n")
I recommend option 1 just in case you have more string manipulation to do on the text.
The following works for me in Python 3.2:
import urllib.request
lyrics=urllib.request.urlopen("http://google.com/")
text=str(lyrics.read())
text=text.replace("\r\n","\n")
print(text)
Key difference was that lyrics.read() was returning a bytes object, rather than a string, which the replace() did not know how to handle. Wrapping this in str() before performing the replace works.