I have a Python script that pulls all records from a SQL table, transforms it and dumps it to a JSON file. It looks something like this.
data = []
# accumulate data into one large list
for record in get_sql_data():
obj = MyObject(record)
transformed_obj = transform(obj)
data.append(transformed_obj)
# dump all data at once to json file
with open("result.json", "w") as f:
json.dump({
"name": "My data",
"timestamp": datetime.datetime.now(),
"data": data,
}, f)
This used to work fine when my SQL table was small. Now it has millions of records and the script crashes because it runs out of memory when adding objects to the data list.
Is there anyway to incrementally dump these objects to the file so I don't need to create this massive data list? Something like this:
with open("result.json", "w") as f:
for record in get_sql_data():
obj = MyObject(record)
transformed_obj = transform(obj)
# write transformed_obj to json file
Note, the object transformation step is very complex so I can't do that step in the database. I need to do it in Python.
Related
The question is very self explanatory.
I need to write or append at a specific key/value of an object in json via python.
I'm not sure how to do it because I'm not good with JSON but here is an example of how I tried to do it (I know it is wrong).
with open('info.json', 'a') as f:
json.dumps(data, ['key1'])
this is the json file:
{"key0":"xxxxx#gmail.com","key1":"12345678"}
A typical usage pattern for JSONs in Python is to load the JSON object into Python, edit that object, and then write the resulting object back out to file.
import json
with open('info.json', 'r') as infile:
my_data = json.load(infile)
my_data['key1'] = my_data['key1'] + 'random string'
# perform other alterations to my_data here, as appropriate ...
with open('scratch.json', 'w') as outfile:
json.dump(my_data, outfile)
Contents of 'info.json' are now
{"key0": "xxxxx#gmail.com", "key1": "12345678random string"}
The key operations were json.load(fp), which deserialized the file into a Python object in memory, and json.dump(obj, fp), which reserialized the edited object to the file being written out.
This may be unsuitable if you're editing very large JSON objects and cannot easily pull the entire object into memory at once, but if you're just trying to learn the basics of Python's JSON library it should help you get started.
An example for appending data to a json file using json library:
import json
raw = '{ "color": "green", "type": "car" }'
data_to_add = { "gear": "manual" }
parsed = json.loads(raw)
parsed.update(data_to_add)
You can then save your changes with json.dumps.
I am trying to parse a deeply nested json data which is saved as .dms file. I saved some transactions of the file as a .json file. When I try json.load() function to read the .json file. I am getting the error as
JSONDecodeError: Extra data: line 2 column 1 (char 4392)
Opening the .dms file in text editor, I copied 3 transactions from it and saved it as .json file. The transactions in the file are not separated by commas. It is separated by new lines. When I used 1 transaction of it as a .json file and used json.load() function, it successfully read. But when I try the json file with 3 transactions, its showing error.
import json
d = json.load(open('t3.json')) or
with open('t3.json') as f:
data = json.load(f)
print(data)
the example transaction is :
{
"header":{
"msgType":"SOURCE_EVENT",
},
"content":{
"txntype":"ums",
"ISSUE":{
"REQUEST":{
"messageTime":"2019-06-06 21:54:11.492",
"Code":"655400",
},
"RESPONSE":{
"Time":"2019-06-06 21:54:11.579",
}
},
"DATA":{
"UserId":"021",
},
{header:{.....}}}
{header:{......}}}
This is how my json data from an API looks like. I wrote it in a readable way. But its all continuously written and whenever a header starts it starts from a new line. and the .dms file has 3500 transactions. the two transactions are not even seperated by commas. Its separated by new lines. But within a transaction there are extra spaces in a value. for eg; "company": "Target Chips 123 CA"
The output I need:
I need to make a csv by extracting values of keys messageType, messageTime, userid from the data for each transaction.
Please help out to clear the error and suggest ways to extract the data I need from these transactions for every transaction and put in .csv file for me to do further analysis and machine learning modeling.
If each object is contained within a single line, then read one line at a time and decode each line separately:
with open(fileName, 'r') as file_to_read:
for line in filetoread:
json_line = json.loads(line)
If objects are spread over multiple lines, then ideally try and fix the source of the data, otherwise use my library jsonfinder. Here is an example answer that may help.
I would like to ask how to use python to convert a .txt file into MongoDB.
The .txt file is huge (ca. 800M) but has a simple data structure:
title1...TAB...text1text1text1text1text1text1\n
title2...TAB...text2text2text2text2text2text2\n
title3...TAB...text3text3text3text3text3text3\n
The ...TAB... means there is a tab key, or a big space. (Sorry I don't know exactly how to describe it.)
The desired MongoDB format should be look like:
{
“title”: title1,
“description”: text1text1text1text1text1text1\n,
“extra”: EMPTY
}
... and so on.
I tried with the code from storing full text from txt file into mongodb
from pymongo import MongoClient
client = MongoClient()
db = client.test_database # use a database called "test_database"
collection = db.files # and inside that DB, a collection called "files"
f = open('F:\\ttt.txt') # open a file
text = f.read() # read the entire contents, should be UTF-8 text
# build a document to be inserted
text_file_doc = {"file_name": "F:\\ttt.txt", "contents" : text }
# insert the contents into the "file" collection
collection.insert(text_file_doc)
To be honest, as a newbie I don't quite understand what the code means. So it is not surprise that the code above doesn't work for my purpose.
Could anybody please help me out of this problem? Any help will be highly appreciated!
It boils down to how your input file is formatted.
If it consistently follows the format you outlined, i.e. there's no tabs/whitespace characters in the title portion and the "extra" field will always be empty, you could go for sth. like this:
import json
# your mongo stuff goes here
file_content = []
with open("ttt.txt") as f:
for line in f:
# assuming tabs and not multiple space characters
title, desc = line.strip().split("\t", maxsplit=1)
file_content.append({"title": title, "description": desc, "extra": None})
collection.insert(json.dumps(file_content))
So I have a function in python which generates a dict like so:
player_data = {
"player": "death-eater-01",
"guild": "monster",
"points": 50
}
I get this data by calling a function. Once I get this data I want to write this into a file, so I call:
g = open('team.json', 'a')
with g as outfile:
json.dump(player_data, outfile)
This works fine. However my problem is that since a team consists of multiple players I call the function again to get a new player data:
player_data = {
"player": "moon-master",
"guild": "mage",
"points": 250
}
Now when I write this data into the same file, the JSON breaks... as in, it show up like so (missing comma between two nodes):
{
"player": "death-eater-01",
"guild": "monster",
"points": 50
}
{
"player": "moon-master",
"guild": "mage",
"points": 250
}
What I want is to store both this data as a proper JSON into the file. For various reasons I cannot prepare the full JSON object upfront and then save in a single shot. I have to do it incrementally due to network breakage, performance and other issues.
Can anyone guide me on how to do this? I am using Python.
You shouldn't append data to an existing file. Rather, you should build up a list in Python first which contains all the dicts you want to write, and only then dump it to JSON and write it to the file.
If you really can't do that, one option would be to load the existing file, convert it back to Python, then append your new dict, dump to JSON and write it back replacing the whole file.
To produce valid JSON you will need to load the previous contents of the file, append the new data to that and then write it back to the file.
Like so:
def append_player_data(player_data, file_name="team.json"):
if os.path.exists(file_name):
with open(file_name, 'r') as f:
all_data = json.load(f)
else:
all_data = []
all_data.append(player_data)
with open(file_name, 'w') as f:
json.dump(all_data, f)
I am working with JSON files in python. I have keys in a list, for each key I retrieve a JSON record, which I store in a file. Each JSON record contains that key inside.
In a separate process, I need to read that JSON file and retrieve each record, get key and do something else.
The problem is I cannot read JSON lines from file I stored this way:
dataFile = open(output_file, "w")
for line in keys_list:
json_line = fetch_json_str(line)
data = simplejson.loads(json_line)
dataFile.write(simplejson.dumps(data, sort_keys = True))
It reads all into one line, and the length of returned list is 0, len(json_lines).
json_lines = [line.strip() for line in open(tmp_load_file)]
for line in json_lines
data = simplejson.loads(simplejson.dumps(line))
What do I do wrong? Is there any way to get back JSON lines without changing the way I stored them, because that will call for re-processing all the json files I stored this way.
You need to add a line separator when you write the json string
dataFile.write(simplejson.dumps(data, sort_keys = True) + '\n')
Otherwise, you just get all of the json records in a single line and the json parser can't figure it out.