Split huge (95Mb) JSON array into smaller chunks?

Split huge (95Mb) JSON array into smaller chunks? - python

I exported some data from my database in the form of JSON, which is essentially just one [list] with a bunch (900K) of {objects} inside it.
Trying to import it on my production server now, but I've got some cheap web server. They don't like it when I eat all their resources for 10 minutes.
How can I split this file into smaller chunks so that I can import it piece by piece?
Edit: Actually, it's a PostgreSQL database. I'm open to other suggestions on how I can export all the data in chunks. I've got phpPgAdmin installed on my server, which supposedly can accept CSV, Tabbed and XML formats.
I had to fix phihag's script:
import json
with open('fixtures/PostalCodes.json','r') as infile:
o = json.load(infile)
chunkSize = 50000
for i in xrange(0, len(o), chunkSize):
with open('fixtures/postalcodes_' + ('%02d' % (i//chunkSize)) + '.json','w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
dump:
pg_dump -U username -t table database > filename
restore:
psql -U username < filename
(I don't know what the heck pg_restore does, but it gives me errors)
The tutorials on this conveniently leave this information out, esp. the -U option which is probably necessary in most circumstances. Yes, the man pages explain this, but it's always a pain to sift through 50 options you don't care about.
I ended up going with Kenny's suggestion... although it was still a major pain. I had to dump the table to a file, compress it, upload it, extract it, then I tried to import it, but the data was slightly different on production and there were some missing foreign keys (postalcodes are attached to cities). Of course, I couldn't just import the new cities, because then it throws a duplicate key error instead of silently ignoring it, which would have been nice. So I had to empty that table, repeat the process for cities, only to realize something else was tied to cities, so I had to empty that table too. Got the cities back in, then finally I could import my postal codes. By now I've obliterated half my database because everything is tied to everything and I've had to recreate all the entries. Lovely. Good thing I haven't launched the site yet. Also "emptying" or truncating a table doesn't seem to reset the sequences/autoincrements, which I'd like, because there are a couple magic entries I want to have ID 1. So..I'd have to delete or reset those too (I don't know how), so I manually edited the PKs for those back to 1.
I would have ran into similar problems with phihag's solution, plus I would have had to import 17 files one at a time, unless I wrote another import script to match the export script. Although he did answer my question literally, so thanks.

In Python:
import json
with open('file.json') as infile:
o = json.load(infile)
chunkSize = 1000
for i in xrange(0, len(o), chunkSize):
with open('file_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)

I turned phihag's and mark's work into a tiny script (gist)
also copied below:
#!/usr/bin/env python
# based on http://stackoverflow.com/questions/7052947/split-95mb-json-array-into-smaller-chunks
# usage: python json-split filename.json
# produces multiple filename_0.json of 1.49 MB size
import json
import sys
with open(sys.argv[1],'r') as infile:
o = json.load(infile)
chunkSize = 4550
for i in xrange(0, len(o), chunkSize):
with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)

Assuming you have the option to go back and export the data again...:
pg_dump - extract a PostgreSQL database into a script file or other archive file.
pg_restore - restore a PostgreSQL database from an archive file created by pg_dump.
If that's no use, it might be useful to know what you're going to be doing with the output so that another suggestion can hit the mark.

I know this is question is from a while back, but I think this new solution is hassle-free.
You can use pandas 0.21.0 which supports a chunksize parameter as part of read_json. You can load one chunk at a time and save the json:
import pandas as pd
chunks = pd.read_json('file.json', lines=True, chunksize = 20)
for i, c in enumerate(chunks):
c.to_json('chunk_{}.json'.format(i))

Related

Name object/lists when combining json files in Python

I have a series of python scripts doing Oracle SQL database checks and outputting the results into individual json files per output. That's all working fine, albeit there is probably better ways to do it.
Then using the following which I found on here in merging the json files into a single json:
import json
import glob
result = []
for f in glob.glob("*.json"):
with open(f, "rb") as infile:
result.append(json.load(infile))
with open("merged_file.json", "w") as outfile:
json.dump(result, outfile)
This works roughly how I wanted the output to be, the only issue I'm facing is the "names" of the result sets (I'm sure my lack of terminology knowledge is what's made this hard).
To visualize what I mean, this is the output result without the data (I realise this isn't the correct output format, this is from a gui viewer of the file to better see formatting):
[]JSON
{}0
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
{}1
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
I'm looking to have the 0 and 1 be the labels of the result set, so that the output would look like:
[]JSON
{}LAX
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
{}LGW
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
Is that possible with Python? Or should I be looking at alternative solutions?
I've seen other suggestions for similar questions that suggest just taking the outputs directly into one file rather then merging multiple files, however the results are from different database connections and finding a solution for "queuing" database connections and storing outputs has been above my skill level.
Thanks!

REPL.it JSON Files

So recently I've been using REPL as python code source, but whenever I'm offline, any information stored in the JSON File is rolled back after a bit of time. Now I know this is a REPL specific problem after doing some research, but is there any way I can fix this? My code itself is quite a few lines long, so I would rather not want to use a completely different storage method.

To successfully store data in json files in replit.com, it's important to load and dump it the correct way.
An example of storing data in json files:
with open("sample.json", "r") as file:
sample = json.load(file)
sample["item"] = "Value"
with open("sample.json", "w") as file:
json.dump(sample, file)
Let me know if you've already followed these steps.

Import Multiple CSV Files to Postgresql

I'm currently learning how to code and I have run into this challenge that I have been trying to solve for the last couple of days.
I have over 2000 CSV files that I would like to import into a particular postgresql table at once instead using the import data function on pgadmin 4 which only allows one to import one CSV file at a time. How should I go about doing this? I'm using Windows OS.

Simple way is use Cygwin or inner Ubuntu shell for use this script
all_files=("file_1.csv" "file_2.csv") # OR u can change to * in dir
dir_name=<path_to_files>
export PGUSER=<username_here>
export PGPASSWORD=<password_here>
export PGHOST=localhost
export PGPORT=5432
db_name=<dbname_here>
echo "write db"
for file in ${all_files[*]}; do
psql -U$db_name -a -f $dir_name/"${file}"".sql" >/dev/null
done

If you want to do this purely in Python, then I have given an approach below. It's possible that you wouldn't need to chunk the list (that you could hold all of the files in memory at once and not need to do in batches). It's also possible that all the files are radically different sizes and you'd need something more sophisticated than just batches to prevent you creating an in-memory file object that exceeds your RAM. Or, you might choose to do it in 2000 separate transactions but I suspect some kind of batching will be faster (untested).
import csv
import io
import os
import psycopg2
CSV_DIR = 'the_csv_folder/' # Relative path here, might need to be an absolute path
def chunks(l, n):
"""
https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n = max(1, n)
return [l[i:i+n] for i in range(0, len(l), n)]
# Get a list of all the CSV files in the directory
all_files = os.listdir(CSV_DIR)
# Chunk the list of files. Let's go with 100 files per chunk, can be changed
chunked_file_list = chunks(all_files, 100)
# Iterate the chunks and aggregate the files in each chunk into a single
# in-memory file
for chunk in chunked_file_list:
# This is the file to aggregate into
string_buffer = io.StringIO()
csv_writer = csv.writer(string_buffer)
for file in chunk:
with open(CSV_DIR + file) as infile:
reader = csv.reader(infile)
data = reader.readlines()
# Transfer the read data to the aggregated file
csv_writer.writerows(data)
# Now we have aggregated the chunk, copy the file to Postgres
with psycopg2.connect(dbname='the_database_name',
user='the_user_name',
password='the_password',
host='the_host') as conn:
c = conn.cursor()
# Headers need to the table field names, in the order they appear in
# the csv
headers = ['first_name', 'last_name', ...]
# Now upload the data as though it was a file
c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
conn.commit()

How do i replace a specific value in a file in python

Im trying to replace the zero's with a value. So far this is my code, but what do i do next?
g = open("January.txt", "r+")
for i in range(3):
dat_month = g.readline()
Month: January
Item: Lawn
Total square metres purchased:
0
monthly value = 0

You could do that -
but that is not the usual approach, and certainly is not the correct approach for text files.
The correct way to do it is to write another file, with the information you want updated in place, and then rename the new file to the old one. That is the only sane way of doing this with text files, since the information size in bytes for the fields is variable.
As for the impresion that you are "writing 200 bytes to the disk" instead of a single byte, changing your value, don't let that fool you: at the Operating system level, all file access has to be done in blocks, which are usually a couple of kilobytes long (in special cases, and tunned filesystems it could be a couple hundred bytes). Anyway, you will never, in a user-space program, much less in a high level language like Python, trigger a diskwrite of less than a few hundred bytes.
Now, for the code:
import os
my_number = <number you want to place in the line you want to rewrite>
with open("January.txt", "r") as in_file, open("newfile.txt", "w") as out_file:
for line in in_file:
if line.strip() == "0":
out_file.write(str(my_number) + "\n")
else:
out_file.write(line)
os.unlink("January.txt")
os.rename("newfile.txt", "January.txt")
So - that is the general idea -
of course you should not write code with all values hardcoded in that way (i.e. the values to be checked and written fixed in the program code, as are the filenames).
As for the with statement - it is a special construct of the language wich is very appropriate to oppening files and manipulating then in a block, like in this case - but it is not needed.
Programing apart, the concept you have to keep in mind is this:
when you use an application that lets you edit a text file, a spreadsheet, an image, you, as user, may have the impression that after you are done and have saved your work, the updates are comitted to the same file. In the vast, vast majority of use cases, that is not what happens: the application uses internally a pattern like the one I presented above - a completly new file is written to disk and the old one is deleted, or renamed. The few exceptions could be simple database applications, which could replace fixed width fields inside the file itself on updates. Modern day databases certainly do not do that, resorting to appending the most recent, updated information, to the end of the file. PDF files are another kind that were not designed to be replaced entirely on each update, when being created: but also in that case, the updated information is written at the end of the file, even if the update is to take place in a page in the beginning of the rendered document.

dat_month = dat_month.replace("0", "45678")
To write to a file you do:
with open("Outfile.txt", "wt") as outfile:
And then
outfile.write(dat_month)

Try this:
import fileinput
import itertools
import sys
with fileinput.input('January.txt', inplace=True) as file:
beginning = tuple(itertools.islice(file, 3))
sys.stdout.writelines(beginning)
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.writelines(file)

Error with urlopen: new-line character seen in unquoted field

I am using urllib.urlopen with Python 2.7 to read csv files located on an external webserver:
# Try & Except statements removed for clarity
import urllib
import csv
url = ...
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
do_something()
All 100+ files can be read fine, except one that has been updated recently and that returns:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
The file is accessible here. According to my text editor, its mode is Mac (CR), as opposed to Windows (CRLF) for the other files.
I found that based on this thread, python urlopen will handle correctly all formats of newlines. Therefore, the problem is likely to come from somewhere else. I have no clue though. The file opens fine with all my text editors and my speadsheet editors.
Does any one have any idea how to diagnose the problem ?
* EDIT *
The creator of the file informed me by email that I was not the only one to experience such issues. Therefore, he decided to make it again. The code above now works fine again. Unfortunately, using a new file also means that the issue can no longer be reproduced, and the solutions tested properly.
Before closing the question, I want to thank all the stackers who dedicated some of their time to figure out a solution and post it here.

It might be a corrupt .csv file? Otherwise, this code runs perfectly.
#!/usr/bin/python
import urllib
import csv
url = "http://www.football-data.co.uk/mmz4281/1213/I1.csv"
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
print row
Credits to J.F. Sebastian for the .csv file.
Altough, you might want to consider sharing the specific .csv file with us? So we can try to re-create the error.

The following code runs without any error:
#!/usr/bin/env python
import csv
import urllib2
r = urllib2.urlopen('http://www.football-data.co.uk/mmz4281/1213/I1.csv')
for row in csv.reader(r):
print row

I was having the same problem with a downloaded csv.
I know the fix would be to use open with 'rU'. But I would rather not have to save the file to disk, just to open back up into a variable. That seems unnecessary.
file = open(filepath,'rU')
mydata = csv.reader(file)
So if someone has a better solution that would be nice. Stackoverflow links that got me this far:
CSV new-line character seen in unquoted field error
Open the file in universal-newline mode using the CSV Django module
I found what I actually wanted with stringIO, or cStringIO, or io:
Using Python, how do I to read/write data in memory like I would with a file?
I ended up getting io working,
import csv
import urllib2
import io
# warning its a 20MB csv
url = 'http://poweredgec.com/latest_poweredge-11g.csv'
urlRead = urllib2.urlopen(url).read()
ramFile = io.open(urlRead, mode='w')
openRamFile = open(ramFile, 'rU')
csvCurrent = csv.reader(openRamFile)
csvTuple = map(tuple, csvCurrent)
print csvTuple

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split huge (95Mb) JSON array into smaller chunks? - python

In Python: import json with open('file.json') as infile: o = json.load(infile) chunkSize = 1000 for i in xrange(0, len(o), chunkSize): with open('file_' + str(i//chunkSize) + '.json', 'w') as outfile: json.dump(o[i:i+chunkSize], outfile)

Related

Name object/lists when combining json files in Python

REPL.it JSON Files

Import Multiple CSV Files to Postgresql

How do i replace a specific value in a file in python

Error with urlopen: new-line character seen in unquoted field

Categories

Resources