I have a bson file: xyz.bson full of useful data and I'd like to query/process the data using python. Is there a simple example/tutorial out there I can get started with?
I don't understand this one.
You could use the mongorestore command to import the data into a mongoDB server and then query it by connecting to that server.
If you want to stream the data as though it were a flat JSON file on disk rather than loading it into a mongod, you can use this small python-bson-streaming library:
https://github.com/bauman/python-bson-streaming
from bsonstream import KeyValueBSONInput
from sys import argv
for file in argv[1:]:
f = open(file, 'rb')
stream = KeyValueBSONInput(fh=f, fast_string_prematch="somthing") #remove fast string match if not needed
for id, dict_data in stream:
if id:
...process dict_data...
You may use sonq to query .bson file directly from bash, or you can import and use the lib in Python.
A few examples:
Query a .bson file
sonq -f '{"name": "Stark"}' source.bson
Convert query results to a newline separated .json file
sonq -f '{"name": {"$ne": "Stark"}}' -o target.json source.bson
Query a .bson file in python
from sonq.operation import query_son
record_list = list(query_son('source.bson', filters={"name": {"$in": ["Stark"]}}))
Related
I want to import a directory with multiple sub directories and a lot of JSON-files into a MongoDB via a python script. However I can only import multiple JSON via GUI in Compass or one file at a time using a script using the following code I gathered from another question at stackoverflow(How to import JSON file to MongoDB using Python):
import json
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db=client['acme']
collection_posts = db ['posts']
with open('9995-f0763044.json') as f:
file_data = json.load(f)
collection_posts.insert_one(file_data)
client.close()
How can I change this so I can loop through an entire directory and import all of the JSON files? I have seen the insert_many() method but as far I understood it the specific filenames still have to be written into the code. In my perfect scenario I would just enter a directory in the script and it will scan and upload all the JSON-files in that directory. Is this even possible? Thanks for your help
something like this?
import glob
filelist = glob.glob('your/path/*.json')
for filename in filelist:
with open(filename) as f:
file_data = json.load(f)
collection_posts.insert_one(file_data)
client.close()
I am using this code to read a json file in subprocess. It does work for only small jsons, If it exceeds over 33766 count. it will show a error showing
FileNotFoundError: [WinError 206] The filename or extension is too long.
this is beccause of exceeding 33766 count. so how to read the json file using popen .Read that this can solve the problem. Help me with suggestions. I am new here :\
import subprocess
import json
import os
from pprint import pprint
auth = "authorization: token 1234
file = "jsoninput11.json"
fd=open("jsoninput11.json")
json_content = fd.read()
fd.close()
subprocess.run(["grpcurl", "-plaintext","-H", auth,"-d","#",json_content,"-format","json","100.20.20.1:5000","api.Service/Method"])
I am not sure but maybe the problem is related to the bufsize (check this:
Very large input and piping using subprocess.Popen )
Does it works with capture_output=False?
subprocess.run(["grpcurl", "-plaintext","-H", auth,"-d","#",json_content,"-format","json","100.20.20.1:5000","api.Service/Method"], capture_output=False)
On the other side, if you need the output you may need to deal with the PIPE of Popen.
I know I should have a code but I have nothing useful yet.
There is ~300GB JSON file on my GCS gs://path/listings_all.json ultimately I'm trying to import it into BigQuery but it has some wrong data structure (I have sourced it by mongoexport from MongoDB)
invalid field name "$date". Fields must contain
only letters, numbers, and underscores, start with a letter or
underscore, and be at most 128 characters long
So, now my approach is to somehow read source file line by line from GCS process it and upload each processed line to BigQuery using python API.
Below simple reader I have put together to test with sample 100 lines from the original huge file:
import json
from pprint import pprint
with open('schema_in_10.json') as f:
for line in f:
j_content = json.loads(line)
# print(j_content['id'], j_content['city'], j_content['country'], j_content['state'], j_content['country_code'], j_content['smart_location'], j_content['address'], j_content['market'], j_content['neighborhood'])
# // geo { lat, lng}'])
print('------')
pprint(j_content['is_location_exact'])
pprint(j_content['zipcode'])
pprint(j_content['name'])
Can you please help me on how can I read or stream a huge JSON line by line from Google Cloud Storage with Python3?
smart_open now has support for streaming GCS files.
from smart_open import open
# stream from GCS
with open('gs://my_bucket/my_file.txt') as fin:
for line in fin:
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
fout.write(b'hello world')
Reading it line by line and then trying to stream to BigQuery won't scale with 300GB on your local machine, and you'll struggle to get this working TBH.
There's a couple of scalable options:
Write a Cloud Dataflow pipeline to read your file from GCS (it will scale for you and read in parallel), correct the field name, and then write to BigQuery. See here.
Load it directly into BigQuery using CSV instead JSON as the format and using a delimiter that doesn't appear in your data. This will load each record into a single String column and then you can use BigQuery's JSON functions to extract what you need. See here.
Here is an example implementation of a solution in GCP Dataflow that corresponds to the first suggestion in the accepted answer. You'll need to implement the json correction in function json_processor. You can run this code in a Datalab notebook.
# Datalab might need an older version of pip
# !pip install pip==9.0.3
import apache_beam as beam
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions
project_id = 'my-project'
bigquery_dataset_name = 'testdataset' # needs to exist
table_name = 'testtable'
bucket_name = 'my-bucket'
json_file_gcs_path = 'gs://path/to/my/file.json'
schema = "name:STRING,zipcode:STRING"
def json_processor(row):
import json
d = json.loads(row)
return {'name': d['name'], 'zipcode': d['zipcode']}
options = beam.options.pipeline_options.PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = project_id
google_cloud_options.job_name = "myjob"
google_cloud_options.staging_location = 'gs://{}/binaries'.format(bucket_name)
google_cloud_options.temp_location = 'gs://{}/temp'.format(bucket_name)
options.view_as(StandardOptions).runner = 'DataflowRunner'
google_cloud_options.region = "europe-west1"
p = beam.Pipeline(options=options)
(p | "read_from_gcs" >> beam.io.ReadFromText(json_file_gcs_path)
| "json_processor" >> beam.Map(json_processor)
| "write_to_bq" >> beam.io.Write(beam.io.gcp.bigquery.BigQuerySink(table=table_name,
dataset=bigquery_dataset_name,
project=project_id,
schema=schema,
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_EMPTY'))
)
p.run()
Parsing a json file line by line with the builtin json parser is not going to work (unless it's actually a "json lines" doc of course), so you want a streaming parser instead.
But while this will solve the memory use issue, it won't fix invalid json, so your best bet is to first fix the invalid json source as a pure text file, either in python or using sed or some similar tool, then use the incremental parser to parse your content.
def fixfile(sourcepath, destpath):
with open(sourcepath) as source, open(destpath, "w") as dest:
for line in source:
# you may want to use a regexp if this simple solution
# breaks something else
line = line.replace("$date", "date")
dest.write(line)
I am trying to store the jsonas text file , I am able to print the file but am not able to store the file and also the o/p is coming wiht unicode charatcer.
PFB code.
import json
from pprint import pprint
with open('20150827_abc_json') as data_file:
f=open("file.txt","wb")
f.write(data=json.load(data_file))
print (data)>f
f.close()
When i execute it , the file gets created but its of zero byte and also how can i get rid of unicode character and also store the output.
o/p
u'Louisiana', u'city': u'New Olreans'
To serialize JSON to file you should use json.dump function. Try to use following code
import json
from pprint import pprint
with open('20150827_abc_json') as data_file, open('file.txt','w') as f:
data=json.load(data_file)
print data
json.dump(data,f)
the print syntax is wrong, you put only a single > while there should be two of them >>.
in python 3 (or python2 if you from __future__ import print_function) you can also write, in a more explicit way:
print("blah blah", file=yourfile)
I would also suggest to use a context manager for both files:
with open('20150827_abc_json') as data_file:
with open("file.txt","wb") as file:
...
otherwise you risk that an error will leave you destination file pending.
I'm trying to backup my database using the following script:
import xmlrpclib
sock = xmlrpclib.ServerProxy('http://localhost:8069/xmlrpc/db')
backup_file = open('backup.dump', 'wb') # Same extension used by Odoo
backup_file.write(sock.dump('mypassword', 'mydb'))
backup_file.close()
At this point the content of the file is something like this:
UEsDBBQAAAAIADGEbkVAyAv5JvGAAMH+wQEIAAAAZHVtcC5zcWzsvWtz3EaSNvrdvwLxbrzH5K7N
GWv3ndjjGc8GTdG2ZinJI9LWzjlxogPsRlMYo4E2gJZE//pTV6CuQFUhE/RcFDFjNrrxZNYt88ms
2+eff/L559n3Tdc/tMXtn2+yXd7n93lXZLvT4Ui+++ST2+u7rOvzvjgUdb/py0PRnPrsq+y3v2df
Vc32J/vptirpr4t62+zK+oF88ekPd9/856e/l3D1Lm93m21T75v2QH6x6fqW/Kcjv2xqgfGuIND7
U73ty6be3BOkgn6/z6uu0MQQgM2h6Lr8gf3gQ97WBOv3n1D9SfFe5Yfiy+xYHR+6n6vfZ3ePR/Lx
+n/url/dvnj96vfZLZF0yL/MPv999vpDXbTkL1byqzfXl3fX4y+zF99kr17fkQcvbu9uJWD29sXd
d9nt1XfXLy+z48NmS2qwaqh0TfyIYihy9frly+tXdxNq8B9k5FULJHtxm336/c1vjg+08Y5tsy12
pzavsiqvH06kPj6lerA6L/J2+25zzPt3pIqOp/uq3H6m60t/tiv2+aki7ZzfV0V3zLcFbbtPjW8/
lP27TVPulObQCptvt82JNIz4ryzq3eXXN9djQbkSY2nJzwapX2ZqE7AXTdTs7JOM/Ct3WVn3xUPR
ssZ59cPNzWfsi2Pe0s5RFfte/kL7oi0f3hnfkN5akH6Xt/m2J3jv8/aRdKSz3/3HuYG9bQsyIjZk
tBQZ7fykRx+OGa0WOgzok+yXpi74j9uC9PNtWRXZfdNURV4LjFNL9Ng+bsYSaOAn8/mHtnQ9PnVF
...
...
When backing up through the Odoo Database Management I get a zipped file which is what I'm trying to achieve. For example test_2014-11-12_16-06-35Z.dump:
Is there a way to "reconstruct" all those bytes to a valid Odoo backup file? I tried with StringIO and ByteIO with no success. Any help will be much appreciated.
Solution
Thanks to #André I finally have a solution:
import base64
import xmlrpclib
sock = xmlrpclib.ServerProxy('http://localhost:8069/xmlrpc/db')
backup_file = open('backup.dump', 'wb')
backup_file.write(base64.b64decode(sock.dump('mypassword', 'mydb')))
backup_file.close()
The dump() function encodes the file in Base64 before returning it. You can decode it with the base64 command:
base64 -d [dump file] > [decoded file]