How can I import JSON files - python

I have a problem. I have several JSON files. I do not want to create manually Collections and import these files. I found this question Bulk import of .json files in arangodb with python, but unfortunately I got an error [OUT] AttributeError: 'Database' object has no attribute 'collection'.
How can I import several JSON files and import them fully automatically via Python in Collections?
from pyArango.connection import *
conn = Connection(username="root", password="")
db = conn.createDatabase(name="test")
a = db.collection('collection_name') # <- here is the error
for x in list_of_json_files:
with open(x,'r') as json_file:
data = json.load(json_file)
a.import_bulk(data)
I also looked at the documentation from ArangoDB https://www.arangodb.com/tutorials/tutorial-python/

There is no "collection" method in db instance, which you try to call in your code on this line:
a = db.collection('collection_name') # <- here is the error
According to docs you should use db.createCollection method of db instance.
studentsCollection = db.createCollection(name="Students")

Related

why can't connect to databases via airflow connection?

After pushing my DAG I get this error
I am new to data engineering. I tried to solve this error in different ways at the expense of my knowledge, but nothing worked. I want to write a DAG that consists of two tasks, the first is to export data from one database table on one server as CSV files and import these CSV files into database tables on another server. The variable contains DAG configuration and SQL scripts for exporting and importing data.
Please tell me how can I solve this error?
I have this exporting code:
def export_csv():
import json
from airflow.models import Variable
import pandas as pd
instruction_data = json.loads(Variable.get('MAIN_SOURCE_DAMDI_INSTRUCTIONS'))
requirement_data = instruction_data['requirements']
lst = requirement_data['scripts']
ms_hook = MsSqlHook(mssql_conn_id='OKTELL')
connection = ms_hook.get_conn()
cursor = connection.cursor()
for i in lst:
result = cursor.execute(i['export_script'])
df = pd.DataFrame(result)
df.to_csv(i['filename'], index=False, header=None, sep=',', encoding='utf-8')
cursor.close()
And this is my task for exporting:
export_csv_func = PythonOperator(
task_id='export_csv_func',
python_callable=export_csv,
mssql_conn_id='OKTELL'
P.S. I imported the libraries and airflow variables inside the function because before that there was a lot of load on the server and this method helped to reduce the load.
When using the PythonOperator you pass args to a callable via op_args and/or op_kwargs. In this case, if you wanted to pass the mssql_conn_id arg you can try:
export_csv_func = PythonOperator(
task_id='export_csv_func',
python_callable=export_csv,
op_kwargs={'mssql_conn_id': 'OKTELL'},
)
Then you need to update the export_csv() function signature to accept this kwarg too.

API Calls bombs in loop when object not found

My code is erroring out when the object "#odata.nextLink" is not found in the JSON. I thought the while loop was supposed to account for this? I apologize if this is rudimentary but this is my first python project, so i dont know the stuffs yet.
Also, for what it is worth, the api results are quite limited, there now "total pages" value I can extract
# Import Python ODBC module
import pyodbc
import requests
import json
import sys
cnxn = pyodbc.connect(driver="{ODBC Driver 17 for SQL Server}",server="theplacewherethedatais",database="oneofthosedbthings",uid="u",pwd="pw")
cursor = cnxn.cursor()
storedProc = "exec GetGroupsAndCompanies"
for irow in cursor.execute(storedProc):
strObj = str(irow[0])
strGrp = str(irow[1])
print(strObj+" "+strGrp)
response = requests.get(irow[2], timeout=300000, auth=('u', 'pw')).json()
data = response["value"]
while response["#odata.nextLink"]:
response = requests.get(response["#odata.nextLink"], timeout=300000, auth=('u', 'pw')).json()
data.extend(response["value"])
cnxn.commit()
cnxn.close()
You can use the in keyword to test if a key is present:
while "#odata.nextLink" in response:

How to load mongodb databases with special character in their names in pandas dataframe?

I'm trying to import the mongodb collection data in a pandas dataframe. When the database name is simple like 'admin', it's able to load in the dataframe. However when I try with one of my required databases named asdev-Admin (line 5), I get an empty dataframe. Apparently the error's somewhere related to the special character in the db name, but I don't know how to get around it. How do I resolve this??
import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.asdev-Admin
collection = db.system.groups
data = pd.DataFrame(list(collection.find()))
print(data)
The error states: NameError: name 'Admin' is not defined
You can change db = client.asdev-Admin to db = client['asdev-Admin'].

Reading metadata results from pyoai

I'm working with pyoai library on python3.7 to harvest metadata using oai-pmh protocol but i'm getting troubles at the moment of read list of records
from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader
URL = 'http://revista-iberoamericana.pitt.edu/ojs/index.php/Iberoamericana/oai'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)
for record in client.listRecords(metadataPrefix='oai_dc'):
print(record)
i was especting a kind of xml file on tuples, but the results are like this:
(<oaipmh.common.Header object at 0x00000251FAA16A20>, <oaipmh.common.Metadata object at 0x00000251FAA160B8>, None)
(<oaipmh.common.Header object at 0x00000251FA9DB5C0>, <oaipmh.common.Metadata object at 0x00000251FA9C6518>, None)
(<oaipmh.common.Header object at 0x00000251FA9DB0F0>, <oaipmh.common.Metadata object at 0x00000251FA9DB208>, None)
could you tellme if i'm forgetting something
You can use record[1].getMap()
https://tinker.edu.au/resources/recipes/api-via-oai-pmh/

TypeError when importing json data with pymongo

I am trying to import json data from a link containing valid json data to MongoDB.
When I run the script I get the following error:
TypeError: document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
What am I missing here or doing wrong?
import pymongo
import urllib.parse
import requests
replay_url = "http://live.ksmobile.net/live/getreplayvideos?"
userid = 769630584166547456
url2 = replay_url + urllib.parse.urlencode({'userid': userid}) + '&page_size=1000'
print(f"Replay url: {url2}")
raw_replay_data = requests.get(url2).json()
uri = 'mongodb://testuser:password#ds245687.mlab.com:45687/liveme'
client = pymongo.MongoClient(uri)
db = client.get_default_database()
replays = db['replays']
replays.insert_many(raw_replay_data)
client.close()
I saw that you are getting the video information data for 22 videos.
You can use :
replays.insert_many(raw_replay_data['data']['video_info'])
for saving them
You can make one field as _id for mongodb document
use the following line before insert_many
for i in raw_replay_data['data']['video_info']:
i['_id'] = i['vid']
this will make the 'vid' field as your '_id'. Just make sure that the 'vid' is unique for all videos.

Categories