I have a decent sized collection in MongoDB and I need to export the entire thing to JSON using PyMongo. Right now I'm just doing:
import json
results = db.collection_name.find()
with open('collection-data.json', 'w') as f:
json.dump(list(results), f)
This ends up crashing the kernel because it eats up all my memory. Is there a way to save the collection in chunks so that I don't retrieve all of the data at one time?
Try this in your shell:
mongoexport --db <database-name> --collection <collection-name> --out output.json
Related
TL;DR: asyncio vs multi-processing vs threading vs. some other solution to parallelize for loop that reads files from GCS, then appends this data together into a pandas dataframe, then writes to BigQuery...
I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.
Here is a not-parallel version of the function:
import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
# Create new table
output_df = pd.DataFrame()
fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
counter = 0
for file in files:
# read files from GCS
with fs.open(file, 'r') as f:
gcs_data = json.loads(f.read())
data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
this_df = pd.DataFrame(data)
output_df = output_df.append(this_df)
# Write to BigQuery for every 5K rows of data
counter += 1
if (counter % 5000 == 0):
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
output_df = pd.DataFrame() # and reset the dataframe
# Write remaining rows to BigQuery
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
This function is straightforward:
grab ['gcs_dir/file1.json', 'gcs_dir/file2.json', ...], the list of file names in GCS
loop over each file name, and:
read the file from GCS
converts the data into a pandas DF
appends to a main pandas DF
every 5K loops, write to BigQuery (since the appends get much slower as the DF gets larger)
I have to run this function on a few GCS directories each with ~500K files. Due to the bottleneck of reading/writing this many small files, this process will take ~24 hours for a single directory... It would be great if I could make this more parallel to speed things up, as it seems like a task that lends itself to parallelization.
Edit: The solutions below are helpful, but I am particularly interested in running in parallel from within the python script. Pandas is handling some data cleaning, and using bq load will throw errors. There is asyncio and this gcloud-aio-storage that both seem potentially useful for this task, maybe as better options than threading or multiprocessing...
Rather than add parallel processing to your python code, consider invoking your python program multiple times in parallel. This is a trick that lends itself more easily to a program that takes a list of files on the command line. So, for the sake of this post, let's consider changing one line in your program:
Your line:
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
New line:
files = sys.argv[1:] # ok, import sys, too
Now, you can invoke your program this way:
PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program
xargs will now take the file names output by get_gcs_file_list.py and invoke your_program up to 100 times in parallel, fitting as many file names as it can on each line. I believe the number of file names is limited to the maximum command size allowed by the shell. If 100 processes is not enough to process all your files, xargs will invoke your_program again (and again) until all file names it reads from stdin are processed. xargs ensures that no more than 100 invocations of your_program are running simultaneously. You can vary the number of processes based on the resources available to your host.
Instead of doing this, you can directly use bq command.
The bq command-line tool is a Python-based command-line tool for BigQuery.
When you use this command, loading takes place in google's network which is very fast than we creating a dataframe and loading to table.
bq load \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
mydataset.mytable \
gs://mybucket/my_json_folder/*.json
For more information - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table
I have a JSON file that is about 8GB in size. When I try to convert the file using this script:
import csv
import json
infile = open("filename.json","r")
outfile = open("data.csv","w")
writer = csv.writer(outfile)
for row in json.loads(infile.read()):
writer.write(row)
I get this error:
Traceback (most recent call last):
File "E:/Thesis/DataDownload/PTDataDownload/demo.py", line 9, in <module>
for row in json.loads(infile.read()):
MemoryError
I'm sure this has to do with the size of the file. Is there a way to ensure the file will convert to a CSV without the error?
This is a sample of my JSON code:
{"id":"tag:search.twitter.com,2005:905943958144118786","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:899030045234167808","link":"http://www.twitter.com/NAJajsjs3","displayName":"NAJajsjs","postedTime":"2017-08-19T22:07:20.000Z","image":"https://pbs.twimg.com/profile_images/905943685493391360/2ZavxLrD_normal.jpg","summary":null,"links":[{"href":null,"rel":"me"}],"friendsCount":23,"followersCount":1,"listedCount":0,"statusesCount":283,"twitterTimeZone":null,"verified":false,"utcOffset":null,"preferredUsername":"NAJajsjs3","languages":["tr"],"favoritesCount":106},"verb":"post","postedTime":"2017-09-08T00:00:45.000Z","generator":{"displayName":"Twitter for iPhone","link":"http://twitter.com/download/iphone"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/NAJajsjs3/statuses/905943958144118786","body":"#thugIyfe Beyonce do better","object":{"objectType":"note","id":"object:search.twitter.com,2005:905943958144118786","summary":"#thugIyfe Beyonce do better","link":"http://twitter.com/NAJajsjs3/statuses/905943958144118786","postedTime":"2017-09-08T00:00:45.000Z"},"inReplyTo":{"link":"http://twitter.com/thugIyfe/statuses/905942854710775808"},"favoritesCount":0,"twitter_entities":{"hashtags":[],"user_mentions":[{"screen_name":"thugIyfe","name":"dari.","id":40542633,"id_str":"40542633","indices":[0,9]}],"symbols":[],"urls":[]},"twitter_filter_level":"low","twitter_lang":"en","display_text_range":[10,27],"retweetCount":0,"gnip":{"matching_rules":[{"tag":null,"id":6134817834619900217,"id_str":"6134817834619900217"}]}}
(sorry for the ugly formatting)
An alternative may be that I have about 8000 smaller json files that I combined to make this file. They are each within their own folder with just the single json in the folder. Would it be easier to convert each of these individually and then combine them into one csv?
The reason I am asking this is because I have very basic python knowledge and all the answers to similar questions that I have found are way more complicated than I can understand. Please help this new python user to read this json as a csv!
Would it be easier to convert each of these individually and then combine them into one csv?
Yes, it certainly would
For example, this will put each JSON object/array (whatever is loaded from the file) onto its own line of a single CSV.
import json, csv
from glob import glob
with open('out.csv', 'w') as f:
for fname in glob("*.json"): # Reads all json from the current directory
with open(fname) as j:
f.write(str(json.load(j)))
f.write('\n')
Use glob pattern **/*.json to find all json files in nested folders
Not really clear what for row in ... was doing for your data since you don't have an array. Unless you wanted each JSON key to be a CSV column?
Yes, it is absolutely can be done in a very easy way. I opened a 4GB json file in a few seconds. For me, I dont need to convert to csv. But it can be done in a very easy way.
start the mongodb with Docker.
create a temporary database on mongodb, e.g. test
copy the json file to into the Docker container
run mongoimport command
docker exec -it container_id mongoimport --db test --collection data --file /tmp/data.json --jsonArray
run the mongo export command to export to csv
docker exec -it container_id mongoexport --db test --collection data --csv --out data.csv --fields id,objectType
I'm trying to load an extremely large JSON file in Python. I've tried:
import json
data = open('file.json').read()
loaded = json.loads(data)
but that gives me a SIGKILL error.
I've tried:
import pandas as pd
df = pd.read_json('file.json')
and I get an out-of-memory error.
I'd like to try to use ijson to stream my data and only pull a subset into it at a time. However, you need to know what the schema of the JSON file is so that you know what events to look for. I don't actually know what the schema of my JSON file is. So, I have two questions:
Is there a way to load or stream a large json file in Python without knowing the schema? Or a way to convert a JSON file into another format (or into a postgresql server, for example)?
Is there a tool for spitting out what the schema of my JSON file is?
UPDATE:
Used head file.json to get an idea of what my JSON file looks like. From there it's a bit easier.
I would deal with smaller pieces of the file. Take a look at Lazy Method for Reading Big File in Python?. You can adapt the proposed answer to parse your JSON object by object.
You can read in chunks, something like this
f=open("file.json")
while True:
data = f.read(1024)
if not data:
break
yield data
Line by line option
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
Also look at
https://www.dataquest.io/blog/python-json-tutorial/
Look for more answers with jsonline
I'm relatively new to Python, and extremely new to MongoDB (as such, I'll only be concerned with taking the text files and converting them). I'm currently trying to take a bunch of .txt files that are in JSON to move them into MongoDB. So, my approach is to open each file in the directory, read each line, convert it from JSON to a dictionary, and then over-write that line that was JSON as a dictionary. Then it'll be in a format to send to MongoDB
(If there's any flaw in my reasoning, please point it out)
At the moment, I've written this:
"""
Kalil's step by step iteration / write.
JSON dumps takes a python object and serializes it to JSON.
Loads takes a JSON string and turns it into a python dictionary.
So we return json.loads so that we can take that JSON string from the tweet and save it as a dictionary for Pymongo
"""
import os
import json
import pymongo
rootdir='~/Tweets'
def convert(line):
line = file.readline()
d = json.loads(lines)
return d
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines = f.readlines()
f.close()
f=open(file, 'w')
for line in lines:
newline = convert(line)
f.write(newline)
f.close()
But it isn't writing.
Which... As a rule of thumb, if you're not getting the effect that you're wanting, you're making a mistake somewhere.
Does anyone have any suggestions?
When you decode a json file you don't need to convert line by line as the parser will iterate over the file for you (that is unless you have one json document per line).
Once you've loaded the json document you'll have a dictionary which is a data structure and cannot be directly written back to file without first serializing it into a certain format such as json, yaml or many others (the format mongodb uses is called bson but your driver will handle the encoding for you).
The overall process to load a json file and dump it into mongo is actually pretty simple and looks something like this:
import json
from glob import glob
from pymongo import Connection
db = Connection().test
for filename in glob('~/Tweets/*.txt'):
with open(filename) as fp:
doc = json.load(fp)
db.tweets.save(doc)
a dictionary in python is an object that lives within the program, you can't save the dictionary directly to a file unless you pickle it (pickling is a way to save objects in files so you can retrieve it latter). Now I think a better approach would be to read the lines from the file, load the json which converts that json to a dictionary and save that info into mongodb right away, no need to save that info into a file.
I exported some data from my database in the form of JSON, which is essentially just one [list] with a bunch (900K) of {objects} inside it.
Trying to import it on my production server now, but I've got some cheap web server. They don't like it when I eat all their resources for 10 minutes.
How can I split this file into smaller chunks so that I can import it piece by piece?
Edit: Actually, it's a PostgreSQL database. I'm open to other suggestions on how I can export all the data in chunks. I've got phpPgAdmin installed on my server, which supposedly can accept CSV, Tabbed and XML formats.
I had to fix phihag's script:
import json
with open('fixtures/PostalCodes.json','r') as infile:
o = json.load(infile)
chunkSize = 50000
for i in xrange(0, len(o), chunkSize):
with open('fixtures/postalcodes_' + ('%02d' % (i//chunkSize)) + '.json','w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
dump:
pg_dump -U username -t table database > filename
restore:
psql -U username < filename
(I don't know what the heck pg_restore does, but it gives me errors)
The tutorials on this conveniently leave this information out, esp. the -U option which is probably necessary in most circumstances. Yes, the man pages explain this, but it's always a pain to sift through 50 options you don't care about.
I ended up going with Kenny's suggestion... although it was still a major pain. I had to dump the table to a file, compress it, upload it, extract it, then I tried to import it, but the data was slightly different on production and there were some missing foreign keys (postalcodes are attached to cities). Of course, I couldn't just import the new cities, because then it throws a duplicate key error instead of silently ignoring it, which would have been nice. So I had to empty that table, repeat the process for cities, only to realize something else was tied to cities, so I had to empty that table too. Got the cities back in, then finally I could import my postal codes. By now I've obliterated half my database because everything is tied to everything and I've had to recreate all the entries. Lovely. Good thing I haven't launched the site yet. Also "emptying" or truncating a table doesn't seem to reset the sequences/autoincrements, which I'd like, because there are a couple magic entries I want to have ID 1. So..I'd have to delete or reset those too (I don't know how), so I manually edited the PKs for those back to 1.
I would have ran into similar problems with phihag's solution, plus I would have had to import 17 files one at a time, unless I wrote another import script to match the export script. Although he did answer my question literally, so thanks.
In Python:
import json
with open('file.json') as infile:
o = json.load(infile)
chunkSize = 1000
for i in xrange(0, len(o), chunkSize):
with open('file_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
I turned phihag's and mark's work into a tiny script (gist)
also copied below:
#!/usr/bin/env python
# based on http://stackoverflow.com/questions/7052947/split-95mb-json-array-into-smaller-chunks
# usage: python json-split filename.json
# produces multiple filename_0.json of 1.49 MB size
import json
import sys
with open(sys.argv[1],'r') as infile:
o = json.load(infile)
chunkSize = 4550
for i in xrange(0, len(o), chunkSize):
with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
Assuming you have the option to go back and export the data again...:
pg_dump - extract a PostgreSQL database into a script file or other archive file.
pg_restore - restore a PostgreSQL database from an archive file created by pg_dump.
If that's no use, it might be useful to know what you're going to be doing with the output so that another suggestion can hit the mark.
I know this is question is from a while back, but I think this new solution is hassle-free.
You can use pandas 0.21.0 which supports a chunksize parameter as part of read_json. You can load one chunk at a time and save the json:
import pandas as pd
chunks = pd.read_json('file.json', lines=True, chunksize = 20)
for i, c in enumerate(chunks):
c.to_json('chunk_{}.json'.format(i))