Flask handle uploaded csv on the fly by csv module

Flask handle uploaded csv on the fly by csv module - python

Simply, I need to work with files uploaded without saving it on server
working in cli script using open() every thing is fine,
using flask with file sent from data by ajax request
neither open() function nor stream.read() method helped to work with the csv
open throws an exception itself
csv_f = open(request.files['csvfile'].stream, 'rb')
TypeError: expected str, bytes or os.PathLike object, not SpooledTemporaryFile
using .read() I can print it
csv_f = request.files['csvfile'].stream.read()
data = csv.reader(csv_f, delimiter = ',')
print(csv_f)
b'...'
but Iterating also throws exception
for row in data:
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
I need only a way to work with csv files using csv module on the fly

I found out the problem
the file is going throw request as a binary stream not a normal text
that's why it has a read method but unuseful when itrating
I had to use .decode()
like this
request.files['csvfile'].stream.read().decode("utf-8")
instead of this
request.files['csvfile'].stream.read()

Related

Python: Cannot print JSON records from large (7gb+) JSON file

I am using Python 3.10.5 and Spyder. So I have been trying to convert a large (7gb) JSON file to a CSV format. I eventually want this to work on 1tb+ JSON files (I know...can't be helped), but I have had issues with manipulating the file. In order to preserve memory, I had the idea to stream the JSON records and append them to a CSV file piece by piece. I tried writing from the stream first and when that would not work I tried to simply print out individual JSON records to the console, but nothing happens at all, not even an error. This is my code:
For the CSV conversion:
with open(lrg_file, "rb") as f:
for record in ijson.items(f, "item"):
row_x = pd.Series(record, index = ['id','type','actor','repo','payload','public','created_at'])
row_x.to_csv(dump_out, mode='a', index=False, header=False)
For printing to console:
with open(lrg_file, "rb") as f:
for record in ijson.items(f, "item"):
obj = json.dumps(record, indent=4)
print(obj)
My code works on smaller files (2-25 MB) but the same code fails on the larger file; it simply runs silently without printing or throwing an error or doing anything at all. I did find that I can use the ijson.parse function and it will print from the larger file, using the following code:
with open(lrg_file, "rb") as f:
parser = ijson.parse(f)
for record in parser:
obj = json.dumps(record, indent=4)
print(obj)
That allows me to see the prefix/event/value trio for each event in the file, but does not make for an efficient means of converting to .CSV. Any explanation for why my larger file just wont work on my CSV conversion code?

Csv DictReader throwing error for fieldnames

I have a piece of code that has been working for a while that uses Python’s DictReader.
The code initializes the csv reader, csv_reader = csv.DictReader(my_csv) and then I access csv_reader.fieldnames. Historically this has been working fine.
However today it started throwing this error iterator should return strings, not bytes (did you open the file in text mode?) when I try to access csv_reader.fieldnames.
csv_reader.__dict__ shows an object with an attribute _fieldnames, and it is empty. I’m not sure why this changed or what I can do to resolve it, any suggestions are welcome.

You might need to specify your file's encoding explicitly:
with (open('my.csv', 'rt', encoding='utf-8')) as file:

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files

The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.

Reconstructing and files uploaded in SQL Server with python

I am working with a SQL Server database table similar to this
USER_ID varchar(50), FILE_NAME ntext, FILE_CONTENT ntext
sample data:
USER_ID: 1
FILE_NAME: (AttachedFiles:1)=file1.pdf
FILE_CONTENT: (AttachedFiles:1)=H4sIAAAAAAAAAOy8VXQcy7Ku….
Means regular expressions I have successfully isolated the "content" of the FILE_CONTENT field by removing the "(AttachedFiles:1)=" part resulting with a string similar to this:
content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc…"
My plan was to reconstruct the file using this string to download it from the database. During my investigation process, I found this post and proceeded to replicate the code like this:
content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
with open(os.path.expanduser('test.pdf'), 'wb') as f:
f.write(base64.decodestring(content_str))
...getting a TypeError: expected bytes-like object, not str
Investigating further, I found this other post and proceeded like this:
content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
encoded = content_str.encode('ascii')
with open(os.path.expanduser('test.pdf'), 'wb') as f:
f.write(base64.decodestring(encoded))
...resulting as a successful creation of a PDF. However, when trying to open it, I get an error saying that the file is corrupt.
I kindly ask you for any suggestions on how to proceed. I am even open to rethink the process I've came up with if necessary. Many thanks in advance!

The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.
import base64
content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="
with open(os.path.expanduser('test.pdf'), 'wb') as fp:
fp.write(base64.b64decode(content_str))
The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.
I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.
If your PDF reader does not accept the data as is, decompress it before saving it to file:
import gzip
# ...
with open(os.path.expanduser('test.pdf'), 'wb') as fp:
fp.write(gzip.decompress(base64.b64decode(content_str)))

Create hash table from the contents of a file

How can I open a text file, read the contents of the file and create a hash table from this content? So far I have tried:
import json
json_data = open(/home/azoi/Downloads/yes/1.txt).read()
data = json.loads(json_data)
pprint(data)

I suggest this solution:
import json
with open("/home/azoi/Downloads/yes/1.txt") as f:
data=json.load(f)
pprint(data)
The with statement ensures that your file is automatically closed whatever happens and that your program throws the correct exception if the open fails. The json.load function directoly loads data from an open file handle.
Additionally, I strongly suggest reading and understanding the Python tutorial. It's essential reading and won't take too long.

To open a file you have to use the open statment correctly, something like:
json_data=open('/home/azoi/Downloads/yes/1.txt','r')
where the first string is the path to the file and the second is the mode: r = read, w = write, a = append

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flask handle uploaded csv on the fly by csv module - python

I found out the problem the file is going throw request as a binary stream not a normal text that's why it has a read method but unuseful when itrating I had to use .decode() like this request.files['csvfile'].stream.read().decode("utf-8") instead of this request.files['csvfile'].stream.read()

Related

Python: Cannot print JSON records from large (7gb+) JSON file

Csv DictReader throwing error for fieldnames

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

Reconstructing and files uploaded in SQL Server with python

Create hash table from the contents of a file

Categories

Resources