Related
When working with json.dump() I noticed that it appears to be rewriting the entire document. Is this correct, and is there another way to append to the dictionary like .append() deos with lists?
When I write the function like this and change the key value (name), it would appear that the item is being appended.
filename = "infohere.json"
name = "Bob"
numbers = 20
#Write to JSON
def writejson(name = name, numbers = numbers):
with open(filename, "r") as info:
xdict = json.load(info)
xdict[name] = numbers
with open(filename, "w") as info:
json.dump(xdict, info)
When you write it out like this however, you can see that the code clearly writes over the entire dictionary/json file.
filename = infohere.json
dict = {"Bob":23, "Mark":50}
dict2 = {Ricky":40}
#Write to JSON
def writejson2(dict):
with open(filehere, "w") as info:
json.dump(dict, info)
writejson(dict)
writejson(dict2)
In the second example it only ever shows up the last date input leading me to believe that this is rewriting the entire document. If the case is that it writes the whole document during each json.dump, does this cause issues with larger json file, if so is there another method like .append() but for dealing with json.
Thanks in advance.
Neither.
json.dump doesn't decide whether to delete prior content when it writes to a file. That decision happens when you run open(filehere, "w"); that is what deletes old content.
But: Normal JSON isn't amenable to appends.
A single JSON document is one object. There are variants on the format that allow multiple documents in one file, the most common of which is JSONL (which has one JSON document per line). Unless you're using such a format, trying to append JSON to a non-empty file usually won't result in something that can be successfully parsed.
I'm fairly new to Python and I am attempting to group various postcodes together under predefined labels. For example "SA31" would be labelled a "HywelDDAPostcode"
I have some code where I read lots of postcodes from a singled columned file into a list and compare them with postcodes that are in predefined lists. However, when I output my postcode labels only the Label "UKPostcodes" is outputted for every postcode in my original file. It would appear that the first two conditions in my code always evaluate to false no matter what. Am I doing the right thing using "in"? Or perhaps it's a file reading issue? I'm not sure
The input file is simply a file which contains a list of postcodes ( in reality it has thousands of rows)
The CSV file
Here is my code:
import csv
with open('postcodes.csv', newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
my_list =[]
HywelDDAPostcodes=["SA46","SY23","SY24","SA18","SA16","SA43","SA31","SA65","SA61","SA62","SA17","SA48","SA40","SA19","SA20","SA44","SA15","SA14","SA73","SA32","SA67","SA45",
"SA38","SA42","SA41","SA72","SA71","SA69","SA68","SA33","SA70","SY25","SA34","LL40","LL42","LL36","SY18","SY17","SY20","SY16","LD6"]
NationalPostcodes=["LL58","LL59","LL60","LL61","LL62","LL63","LL64","LL65","LL66","LL67","LL68","LL69","LL70","LL71","LL72","LL73","LL74","LL75","LL76","LL77","LL78",
"NP1","NP2","NP23","NP3","CF31","CF32","CF33","CF34","CF35","CF36","CF3","CF46","CF81","CF82","CF83","SA35","SA39","SA4","SA47","LL16","LL18","LL21","LL22","LL24","LL25","LL26","LL27","LL28","LL29","LL30","LL31","LL32","LL33","LL34","LL57","CH7","LL11","LL15","LL16","LL17","LL18","LL19","LL20","LL21","LL22","CH1","CH4","CH5","CH6","CH7","LL12","CF1","CF32","CF35","CF5","CF61","CF62","CF63","CF64","CF71","LL23","LL37","LL38","LL39","LL41","LL43","LL44","LL45","LL46","LL47","LL48","LL49","LL51","LL52","LL53","LL54","LL55","LL56","LL57","CF46","CF47","CF48","NP4","NP5","NP6","NP7","SA10","SA11","SA12","SA13","SA8","CF3","NP10","NP19","NP20","NP9","SA36","SA37","SA63","SA64","SA66","CF44","CF48","HR3","HR5","LD1","LD2","LD3","LD4","LD5","LD7","LD8","NP8","SY10","SY15","SY19","SY21","SY22","SY5","CF37","CF38","CF39","CF4","CF40","CF41","CF42","CF43","CF45","CF72","SA1","SA2","SA3","SA4","SA5","SA6","SA7","SA1","NP4","NP44","NP6","LL13","LL14","SY13","SY14"]
NationalPostcodes2= list(dict.fromkeys(NationalPostcodes))
labels=["HywelDDA","NationalPostcodes","UKPostcodes"]
for postcode in your_list:
#print(postcode)
if postcode in HywelDDAPostcodes:
my_list.append(labels[0])
if postcode in NationalPostcodes2:
my_list.append(labels[1])
else:
my_list.append(labels[2])
with open('DiscretisedPostcodes.csv','w') as result_file:
wr = csv.writer(result_file, dialect='excel')
for item in my_list:
wr.writerow([item,])
If anyone has any advice as to what could be causing the issue or just any advice surrounding Python, in general, I would very much appreciate it. Thank you!
The reason why your comparison block isn't working is that when you use csv reader to read your file, each line is being added to your_list as a list. So you are making a list of lists and when you compare those things it doesn't match.
['LL58'] == 'LL58' # fails
So, inspect your_list and see what I mean. You should make a shell your_list before you read the file and append each new reading to it. Then inspect that to make sure it looks good. It would also behoove you to use the strip() command to strip off whitespace from each item. I can't recall if csv reader does that automatically.
Also... a better structure for testing for membership is to use sets instead of lists. in will work for lists, but it is MUCH faster for sets, so I would put your comparison items into sets.
Lastly, it isn't clear what you are trying to do with NationalPostcodes2. Just use your NationalPostcodes, but put them in a set with {}.
#Jeff H's answer is correct, but for what it's worth here's how I might write this code (untested):
# Note: Since, as you wrote, these are only single-column files I did not use the csv
# module, as it will just add additional unnecessary overhead.
# Read the known data from files--this will always be more flexible and maintainable than
# hard-coding them in your code. This is just one possible scheme for doing this; e.g.
# you could also put all of them into a single JSON file
standard_postcode_files = {
'HywelDDA': 'hyweldda.csv',
'NationalPostcodes': 'nationalpostcodes.csv',
'UKPostcodes': 'ukpostcodes.csv'
}
def read_postcode_file(filename):
with open(filename) as f:
# exclude blank lines and strip additional whitespace
return [line.strip() for line in f if line.strip()]
standard_postcodes = {}
for key, filename in standard_postcode_files.items():
standard_postcodes[key] = set(read_postcode_file(filename))
# Assuming all post codes are unique to a set, map postcodes to the set they belong to
postcodes_reversed = {v: k for k, s in standard_postcodes.items() for v in s}
your_postcodes = read_postcode_file('postcodes.csv')
labels = [postcodes_reversed[code] for code in your_postcodes]
with open('DiscretisedPostCodes.csv', 'w') as f:
for label in labels:
f.write(label + '\n')
I would probably do other things like not make the input filename hard-coded. If you need to work with multiple columns using the csv module would also be fine with minimal additional changes, but since you're just writing one item per line I figured it was unnecessary.
I have a huge text file that contains several JSON objects inside of it that I want to parse into a csv file. Just because i'm dealing with someone else's data I cannot really change the format its being delivered in.
Since I dont know how many objects JSON objects I just can create a couple set of dictionaries, wrap them in a list and then json.loads() the list.
Also, since all the objects are in a single text line I can't a regex expression to separete each individual json object and then put them on a list.(It's a super complicated and sometimes triple nested json at some points.
Here's, my current code
def json_to_csv(text_file_name,desired_csv_name):
#Cleans up a bit of the text file
file = fileinput.FileInput(text_file_name, inplace=True)
ile = fileinput.FileInput(text_file_name, inplace=True)
for line in file:
sys.stdout.write(line.replace(u'\'', u'"'))
for line in ile:
sys.stdout.write(re.sub(r'("[\s\w]*)"([\s\w]*")', r"\1\2", line))
#try to load the text file to content var
with open(text_file_name, "rb") as fin:
content = json.load(fin)
#Rest of the logic using the json data in content
#that uses it for the desired csv format
This code gives a ValueError: Extra data: line 1 column 159816 because there is more than one object there.
I seen similar questions in Google and StackOverflow. But none of those solutions none because of the fact that it's just one really long line in a text file and I dont know how many objects there are in the file.
If you are trying to split apart the highest level braces you could do something like
string = '{"NextToken": {"value": "...'
objects = eval("[" + string + "]")
and then parse each item in the list.
I have a file that I wish to parse. It has data in the json format, but the file is not a json file. I want to loop through the file, and pull out the ID where totalReplyCount is greater than 0.
{ "totalReplyCount": 0,
"newLevel":{
"main":{
"url":"http://www.someURL.com",
"name":"Ronald Whitlock",
"timestamp":"2016-07-26T01:22:03.000Z",
"text":"something great"
},
"id":"z12wcjdxfqvhif5ee22ys5ejzva2j5zxh04"
}
},
{ "totalReplyCount": 4,
"newLevel":{
"main":{
"url":"http://www.someUR2L.com",
"name":"other name",
"timestamp":"2016-07-26T01:22:03.000Z",
"text":"something else great"
},
"id":"kjsdbesd2wd2eedd23rf3r3r2e2dwe2edsd"
}
},
My initial attempt was to do the following
def readCsv(filename):
with open(filename, 'r') as csvFile:
for row in csvFile["totalReplyCount"]:
print row
but I get an error stating
TypeError: 'file' object has no attribute 'getitem'
I know this is just an attempt at printing and not doing what I want to do, but I am a novice at python and lost as to what I am doing wrong. What is the correct way to do this? My end result should look like this for the ids:
['insdisndiwneien23e2es', 'lsndion2ei2esdsd',....]
EDIT 1- 7/26/16
I saw that I made a mistake in my formatting when I copied the code (it was late, I was tired..). I switched it to a proper format that is more like JSON. This new edit properly matches file I am parsing. I then tried to parse it with JSON, and got the ValueError: Extra data: line 2 column 1 - line X column 1:, where line X is the end of the line.
def readCsv(filename):
with open(filename, 'r') as file:
data=json.load(file)
pprint(data)
I also tried DictReader, and got a KeyError: 'totalReplyCount'. Is the dictionary un-ordered?
EDIT 2 -7/27/16
After taking a break, coming back to it, and thinking it over, I realized that what I have (after proper massaging of the data) is a CSV file, that contains a proper JSON object on each line. So, I have to parse the CSV file, then parse each line which is a top level, whole and complete JSON object. The code I used to try and parse this is below but all I get is the first string character, an open curly brace '{' :
def readCsv(filename):
with open(filename, 'r') as csvfile:
for row in csv.DictReader(csvfile):
for item in row:
print item[0]
I am guessing that the DictReader is converting the json object to a string, and that is why I am only getting a curly brace as opposed to the first key. If I was to do print item[0:5] I would get a mish mash of the first 4 characters in an un-ordered fashion on each line, which I assume is because the format has turned into an un-ordered list? I think I understand my problem a little bit better, but still wrapping my head around the data structures and the methods used to parse them. What am I missing?
After reading the question and all the above answers, please check if this is useful to you.
I have considered input file as simple file not as csv or json file.
Flow of code is as follow:
Open and read a file in reverse order.
Search for ID in line. Extract ID and store in temp variable.
Go on reading file line by line and search totalReplyCount.
Once you got totalReplyCount, check it if it greater than 0.
If yes, then store temp ID in id_list and re-initialize temp variable.
import re
tmp_id_to_store = ''
id_list = []
for line in reversed(open("a.txt").readlines()):
m = re.search('"id":"(\w+)"', line.rstrip())
if m:
tmp_id_to_store = m.group(1)
n = re.search('{ "totalReplyCount": (\d+),', line.rstrip())
if n:
fou = n.group(1)
if int(fou) > 0:
id_list.append(tmp_id_to_store)
tmp_id_to_store = ''
print id_list
More check points can be added.
As the error stated, Your csvFile is a file object, it is not a dict object, so you can't get an item out of it.
if your csvFile is in CSV format, you can use the csv module to read each line of the csv into a dict :
import csv
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print row['totalReplyCount']
note the DictReader method from the csv module, it will read your csv line and parse it into dict object
If your input file is JSON why not just use the JSON library to parse it and then run a for loop over that data. Then it is just a matter of iterating over the keys and extracting data.
import json
from pprint import pprint
with open('data.json') as data_file:
data = json.load(data_file)
pprint(data)
Parsing values from a JSON file using Python?
Look at Justin Peel's answer. It should help.
Parsing values from a JSON file in Python , this link has it all # Parsing values from a JSON file using Python? via stackoverflow.
Here is a shell one-liner, should solve your problem, though it's not python.
egrep -o '"(?:totalReplyCount|id)":(.*?)$' filename | awk '/totalReplyCount/ {if ($2+0 > 0) {getline; print}}' | cut -d: -f2
output:
"kjsdbesd2wd2eedd23rf3r3r2e2dwe2edsd"
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T