Unzipping a gzip file that contains a csv - python

I have just hit an endpoint and can pull down a gzip compressed file.
I have tried saving it and extracting the csv inside but I keep getting errors around encoding whether I try casting from its current state in binary to utf-8/utf-16.
To write to the saved gzip I write in binary mode:
r = requests.get(url, auth=auth, stream=True)
with gzip.open('file.gz', 'wb') as f:
f.write(r.content)
Where r.content looks like:
b'PK\x03\x04\x14\x00\x08\x08\x08\x00f\x8dKM\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00A\x00\x00\x00RANKTRACKING_report_created_at_11_10_18_17_41-20181011-174141.csv\xec\xbdk\x8f\xe3V\x96\xae\xf9}\x80\xf9\x0f\ ... '
To extract the file on my machine manually I first have to extract to zip and then I can extract that to get the csv. I have tried the same there but ran into encoding errors there too.
Looking for a way to pull out this csv so I can print lines in python console.

That's not a gzip file. That's a zip file. You are then taking the zip file that you retrieved from the URL, and trying to compress it again as a gzip file. So now you have a zip file inside a gzip file. You have moved one step further away from extracting the CSV contents, as opposed to one step closer.
You need to use zipfile to extract the contents of the zip file that you downloaded.

Related

how to compress json files?

I am currently writing json files to disk using
print('writing to disk .... ')
f = open('mypath/myfile, 'wb')
f.write(getjsondata.read())
f.close()
Which works perfectly, except that the json files are very large and I would like to compress them. How can I do that automatically? What should I do?
Thanks!
Python has a standard module for zlib, which can compress and decompress data for you. You can use this immediately on your data and write (and read) a custom format, or use the module gzip, which wraps the inner workings of zlib to read and write gzip compatible files, while
automatically compressing or decompressing the data so that it looks like an ordinary file object.
It thus neatly replaces the default open format to interact with files, and all you need is this:
import gzip
print('writing to disk .... ')
with gzip.open('mypath/myfile', 'wb') as f:
f.write(getjsondata.read())
(with a change in the open line because I highly recommend using the with syntax to handle file objects.)

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files
The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.

how to decompress .tar.bz2 in memory with python

How to decompress *.bz2 file in memory with python?
The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of #metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file.
For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression, BZ2File class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content should contain the decompressed contents of the file.
However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.
The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()

Bytes from gzip file to text in python

Once the contents of a gzip file is extracted into a string called text, it looks like gibberish. How can I turn it into something human-readable?
with open("zipped_ex.gz.2016") as f:
text = f.read()
print text
Note: I'm not searching for a way to go from zipper_ex_gz.2016 to the contents. Instead, I'm searching for a way to go from the bytestring to the contents.
import gzip
with gzip.GzipFile("zipped_ex.gz.2016") as f:
text = f.read()
print text
On the disk, the file is a binary blop that is humanly unreadable.
To work with the data inside the archive you need to some how extract it.
In this case, in memory via the GzipFile module that decompresses the archive "on the fly" so when you do f.read() you get the archive contents, not the binary content that is the archive on your disk.
The same module can be used on a bytes string:
import io
import gzip
f = io.BytesIO(b"Your compressed gzip-file content here")
with gzip.GzipFile(fileobj=f) as fh:
plain_text = fh.read()
print(plain_text)
Note: gzip files are in fact a single data unit compressed with the gzip format, obviously. But if you want to work with a tar file within the gzip file if you have numerous text files compressed via tar, have a look at this question: How do I compress a folder with the Python GZip module?

Can't write long JSON output to text file

I have a long string (8,315 characters) worth of JSON, but I can't seem to write it to a .txt file using Python without it being truncated.
I write the JSON to a text file and then upload it via FTP, but both the .txt file on my system and the .txt file on the FTP server are truncated.
Here's the code:
# Upload the results
host = ftputil.FTPHost('ftp.website.com', 'username', 'password')
jsonOutput = json.dumps(full_json)
f = open('C:/Comparison.txt', 'w')
f.write(jsonOutput)
host.upload('C:/Comparison.txt', '/public_html/Comparison.txt')
f.close()
print jsonOutput
The JSON output in the console is valid and whole, but it is truncated in the .txt file that is written (and then the .txt file after it is uploaded).
Most of the time, the output will end at http://www.digikey.com/product-detail/en/A000073/1050-10 when the full URL is actually http://www.digikey.com/product-detail/en/A000073/1050-1041-ND/3476357 (and then of course, it cuts off the rest of the JSON)
I'm not sure if this makes any difference, but I also tried f.write(re.escape(jsonOutput) with the same results.
Can anyone help with this?
with open('C:/Comparison.txt', 'w') as f:
json.dump(full_json, f)

Categories