I am working with a SQL Server database table similar to this
USER_ID varchar(50), FILE_NAME ntext, FILE_CONTENT ntext
sample data:
USER_ID: 1
FILE_NAME: (AttachedFiles:1)=file1.pdf
FILE_CONTENT: (AttachedFiles:1)=H4sIAAAAAAAAAOy8VXQcy7Ku….
Means regular expressions I have successfully isolated the "content" of the FILE_CONTENT field by removing the "(AttachedFiles:1)=" part resulting with a string similar to this:
content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc…"
My plan was to reconstruct the file using this string to download it from the database. During my investigation process, I found this post and proceeded to replicate the code like this:
content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
with open(os.path.expanduser('test.pdf'), 'wb') as f:
f.write(base64.decodestring(content_str))
...getting a TypeError: expected bytes-like object, not str
Investigating further, I found this other post and proceeded like this:
content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
encoded = content_str.encode('ascii')
with open(os.path.expanduser('test.pdf'), 'wb') as f:
f.write(base64.decodestring(encoded))
...resulting as a successful creation of a PDF. However, when trying to open it, I get an error saying that the file is corrupt.
I kindly ask you for any suggestions on how to proceed. I am even open to rethink the process I've came up with if necessary. Many thanks in advance!
The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.
import base64
content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="
with open(os.path.expanduser('test.pdf'), 'wb') as fp:
fp.write(base64.b64decode(content_str))
The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.
I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.
If your PDF reader does not accept the data as is, decompress it before saving it to file:
import gzip
# ...
with open(os.path.expanduser('test.pdf'), 'wb') as fp:
fp.write(gzip.decompress(base64.b64decode(content_str)))
Related
Simply, I need to work with files uploaded without saving it on server
working in cli script using open() every thing is fine,
using flask with file sent from data by ajax request
neither open() function nor stream.read() method helped to work with the csv
open throws an exception itself
csv_f = open(request.files['csvfile'].stream, 'rb')
TypeError: expected str, bytes or os.PathLike object, not SpooledTemporaryFile
using .read() I can print it
csv_f = request.files['csvfile'].stream.read()
data = csv.reader(csv_f, delimiter = ',')
print(csv_f)
b'...'
but Iterating also throws exception
for row in data:
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
I need only a way to work with csv files using csv module on the fly
I found out the problem
the file is going throw request as a binary stream not a normal text
that's why it has a read method but unuseful when itrating
I had to use .decode()
like this
request.files['csvfile'].stream.read().decode("utf-8")
instead of this
request.files['csvfile'].stream.read()
I'm working on a project, and a key thing that I'm stuck on is being able to read in encrypted data from a file. I've done some looking around, and I can't find anything specific about this issue.
Data is encrypted from a Python implementation of DES, and the encryption comes out from this return statement: return bytes.fromhex('').join(result). For example, encrypting b'This' gives this as a result:
b'\xc5lP\x04\x8c\xe2\xa8\x05'
I then place this encryption into a file (opened as "wb") using out_file.write(data).
My problem is that when I try to read the encrypted data from the file, nothing gets read. The code below shows that I can read in data the way I want when plaintext is used, but not when this formatting of encrypted text is. I need the read-in data as a bytes type.
with open(filename, "rb") as in_file:
buffer = in_file.read()
Using this on a file with the plaintext This, printing buffer looks like:
b'This'
However, doing this on a file with the encrypted plaintext formed from bytes.fromhex(''), printing buffer gives nothing:
b''
Are there any suggestions on how to either format the encrypted text to put it into a file so that it can be read, or reading data from a file in this particular format? I'm just not understanding why this format is not being interpreted properly as bytes when I read it in from a file.
Currently, I can't get the data being received from my other client software to write into a file that will append as well as add a space after each dump. I Tried quite a few different approaches but I'm left with this now and I'm a bit stumped.
At the moment I can no longer get a file to write and I'm not sure what I've done to destroy that part of my code.
while True:
data = s.recv(1024).decode('utf-8')
if data:
with open("data.txt", 'w') as f:
json.dump(data, f, ensure_ascii=False)
I am expecting a file will appear that will not be overwritten each time I receive new data, allowing me to develop my search and table features of my application.
What you are currently doing for each block:
Decode the block as UTF
Open a file, truncating the previous contents ('w' mode)
Re-encode the data
Dump it to the file
Why this is a bad way to do it:
Your blocks are not necessarily going to respect UTF code point boundaries. You need to accumulate all the data before you decode.
Not only are you tuncating theexisting file by using 'w' instead of 'a' mode, but opening and closing a file over and over is very inefficient and generally a bad idea.
You are not going to get the same result if the original block was off a UTF boundary. Worst case, your whole dataset will be trash.
You have no way of ending the stream. You probably want to close the file eventually and decode it.
How you should do it:
Open an output file (in binary mode)
Loop until the stream ends
Dump all your raw binary packets to a file
Close the file
Decode the file when you read it
Sample code:
with open('data.txt', 'wb') as file:
while True:
data = s.recv(1024)
if not data:
break
file.write(data)
If the binary stream contains UTF-8 encoded JSON data, that's what you will get in your file.
I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files
The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.
I'm using Python 3.5 on Windows.
I have this little piece of code that downloads close to one hundred CSV files from different URLs stored in Links.txt:
from urllib import request
new_lines = 'None'
def download_data(csv_url):
response = request.urlopen(csv_url)
csv = response.read()
csv_str = str(csv)
global new_lines
new_lines = csv_str.split("\\n")
with open('Links.txt') as file:
for line in file:
URL = line
file_name = URL[54:].rsplit('.ST', 1)[0]
download_data(URL)
save_destination = 'C:\\Download data\\Data\\' + file_name + '.csv'
fx = open(save_destination, "w")
for lines in new_lines:
fx.write(lines+"\n")
fx.close()
The problem is that the CSV files generated always starts with b ' and after the last line of the data follows another ' and a couple of empty rows to wrap things up. I do not see these characters when I look at the files from the browser (before I download them).
This creates problems when I want to import and use the data in a database. Do you have any idea on why this happens and how I can get the code to write the CSV files correctly?
Tips that can make the code faster/better, or adjustments for other flaws in the code are obviously very welcome.
What's happening is that urllib treats its stream as bytes - any string that looks like b'...' means it's a byte-string.
Your immediate problem could be solved by encoding the stream by calling decode('utf-8') (as Chedy2149 shows), which will convert the data's bytes.
However, you can complete elide this problem by downloading the file directly to disk. You go through the work of downloading it, splitting it, and writing it to disk, but all that seems unnecessary because your code just ultimately writes the file's contents to disk without additional work against them.
You can use urllib.request.urlretrieve and download to a file directly.
Here's an example, modified from your code.
import urllib.request
def download_data(url, file_to_save):
filename, rsp = urllib.request.urlretrieve(url, file_to_save)
# Assuming everything worked, the file has been downloaded to file_to_save
with open('Links.txt') as file:
for line in file:
url = line.rstrip() # adding this here to remove extraneous '\n' from string
file_name = url[54:].rsplit('.ST', 1)[0]
save_destination = 'C:\\Download data\\Data\\' + file_name + '.csv'
download_data(url, save_destination)
In the download_data function you need to convert the byte string csv response to a plain string.
Try replacing csv_str = str(csv) by csv_str = csv.decode('utf-8').
This should properly decode the byte string returned by response.read().
The problem is that your function returns a bytes object; str() doesn't convert it to a string the way you expect. Use csv_str = csv.decode() instead.