I have a csv script that runs within a sequence on a set of gathered urls like so: threaded(urls, write_csv, num_threads=5). The script writes to the csv correctly but seems to rewrite the first row for each url rather than writing to new rows for each subsequent url that is passed. The result is that the final csv has one row with the data from the last url. Do I need to add a counter and index to accomplish this or restructure the program entirely? Here's the relevant code:
import csv
from thready import threaded
def get_links():
#gather urls
threaded(urls, write_csv, num_threads=5)
def write_csv(url):
#the data dict with values that were previously assigned is defined here
data = {
'scrapeUrl': url,
'model': final_model_num,
'title': final_name,
'description': final_description,
'price': str(final_price),
'image': final_first_image,
'additional_image': final_images,
'quantity': '1',
'subtract': '1',
'minimum': '1',
'status': '1',
'shipping': '1'
}
#currently this writes the values but only to one row even though multiple urls are passed in
with open("local/file1.csv", "w") as f:
writer=csv.writer(f, delimiter=",")
writer.writerows([data.keys()])
writer.writerow([s.encode('ascii', 'ignore') for s in data.values()])
if __name__ == '__main__':
get_links()
It appears that one problem is this line...
with open("local/file1.csv", "w") as f:
The output file is overwritten on each function call ("w" indicates the file mode is write). When an existing file is opened in write mode it is cleared. Since the file is cleared every time the function is called it's giving the appearance of only writing one row.
The bigger issue is that it is not good practice for multiple threads to write to a single file.
You could try this...
valid_chars = "-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
filename = ''.join(c for c in url if c in valid_chars)
with open("local/%s.csv" % filename, "w") as f:
# rest of code...
...which will write each url to a different file (assuming the urls are unique). You could then recombine the files later. A better approach would be to put the data in a Queue and write it all after the call to threaded. Something like this...
import Queue
output_queue = Queue.Queue()
def get_links():
#gather urls
urls = ['www.google.com'] * 25
threaded(urls, write_csv, num_threads=5)
def write_csv(url):
data = {'cat':1,'dog':2}
output_queue.put(data)
if __name__ == '__main__':
get_links() # thready blocks until internal input queue is cleared
csv_out = csv.writer(file('output.csv','wb'))
while not output_queue.empty():
d = output_queue.get()
csv_out.writerow(d.keys())
csv_out.writerow(d.values())
Opening a file in write mode erases whatever was already in the file (as documented here). If you have multiple threads opening the same file, whichever one opens the file last will "win" and write its data to the file. The others will have their data overwritten by the last one.
You should probably rethink your approach. Multithreaded access to external resources like files is bound to cause problems. A better idea is to have the threaded portion of your code only retrieve the data from the urls, and then return it to a single-thread part that writes the data sequentially to the file.
If you only have a small number of urls, you could dispense with threading altogether and just write a direct loop that iterates over the urls, opens the file once, and writes all the data.
Related
I am running a loop with data coming in and writing the data to a json file. Here's what it looks like in a minimal verifiable concrete example.
import json
import random
import string
dc_master= {}
for i in range(100):
# Below mimics an API call that returns new data.
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
dc_info= {}
dc_info['Height'] = 'NA'
dc_master[name] = dc_info
with open("myfile.json", "w") as filehandle:
filehandle.write(json.dumps(dc_master))
As you can see from the above, every time it loops, it creates a new dc_info. That becomes the value of the new key-value pair (with the key being the name) that gets written to the json file.
The one disadvantage of the above is that when it fails and I restart again, I have to do it from the very beginning. Should I do a open for reading of the json file to dc_master, then add a name:dc_info to the dictionary, followed by writing the dc_master back to the json file at every turn of the loop? Should I just append to the json file even if it's a duplicate and let the fact that when I need to use it, I will load it back into a dictionary and that takes care of duplicates automatically?
Additional information: There are occasionally timeouts, so I want to be able to start somewhere in the middle if needed. Number of key value pairs in the dc_info is about 30 and number of overall name:dc_info pairs is about 1000. So it's not huge. Reading it out and writing it back in again is not onerous. But I do like to know if there's a more efficient way of doing it.
I think the full script of fetching and storing API results should look like example code below. At least I always do same code for long-processing set of tasks.
I put each result of API call as a separate JSON single line in result file.
Script may be stopped in the middle e.g. due to exception, file will be correctly closed and flushed thanks to with manager. Then on restart script will read already processed result lines from file.
Only those results that have not being processed already (if their id not in processed_ids) should and will be fetched from API. id-field may be anything that identifies uniquely each API call result.
Each next result will be appended to json-lines file thanks to a mode (append mode). buffering specifies write-buffer size in bytes, file will be flushed and written in this size of blocks, this is not to stress disk with sequent one-line-100-bytes writes. Using large buffering is totally alright because Python's with block correctly flushes and writes out all bytes whenever block exits due to exception or any other reason, so you'll never lose even a single small result or byte that already has being written by f.write(...).
Final results will be printed to console.
Because your task is very interesting and important (at least I had similar tasks many times), I've decided to also implement multi-threaded version of the single-threaded code located below, it is aspecially needed for the case of fetching data from Internet, as it is usually necessary to download data in several parallel threads. Multi-threaded version can be found and run here and here. This multi-threading can be extended to multi-processing too for efficiency by using ideas from my another answer.
Next is single-threaded version of code:
Try next code online here!
import json, os, random, string
fname = 'myfile.json'
enc = 'utf-8'
id_field = 'id'
def ReadResults():
results = []
processed_ids = set()
if os.path.exists(fname):
with open(fname, 'r', encoding = enc) as f:
data = f.read()
results = [json.loads(line) for line in data.splitlines() if line.strip()]
processed_ids = {r[id_field] for r in results}
return (results, processed_ids)
# First read already processed elements
results, processed_ids = ReadResults()
with open(fname, 'a', buffering = 1 << 20, encoding = enc) as f:
for id_ in range(100):
# !!! Only process ids that are not in processed_ids !!!
if id_ in processed_ids:
continue
# Below mimics an API call that returns new data.
# Should fetch only those objects that correspond to id_.
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
# Fill necessary result fields
result = {}
result['id'] = id_
result['name'] = name
result['field0'] = 'value0'
result['field1'] = 'value1'
cid = result[id_field] # There should be some unique id field
assert cid not in processed_ids, f'Processed {cid} twice!'
f.write(json.dumps(result, ensure_ascii = False) + '\n')
results.append(result)
processed_ids.add(cid)
print(ReadResults()[0])
I think you're fine and I'd loop over the whole thing and keep writing to the file, as it's cheap.
As for retries, you would have to check for a timeout and then see if the JSON file is already there, load it up, count your keys and then fetch the missing number of entries.
Also, your example can be simplified a bit.
import json
import random
import string
dc_master = {}
for _ in range(100):
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
dc_master.update({name: {"Height": "NA"}})
with open("myfile.json", "w") as jf:
json.dump(dc_master, jf, sort_keys=True, indent=4)
EDIT:
On second thought, you probably want to use a JSON list instead of a dictionary as the top level element, so it's easier to check how much you've got already.
import json
import os
import random
import string
output_file = "myfile.json"
max_entries = 100
dc_master = []
def do_your_stuff(data_container, n_entries=max_entries):
for _ in range(n_entries):
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
data_container.append({name: {"Height": "NA"}})
return data_container
def dump_data(data, file_name):
with open(file_name, "w") as jf:
json.dump(data, jf, sort_keys=True, indent=4)
if not os.path.isfile(output_file):
dump_data(do_your_stuff(dc_master), output_file)
else:
with open(output_file) as f:
data = json.load(f)
if len(data) < max_entries:
new_entries = max_entries - len(data)
dump_data(do_your_stuff(data, new_entries), output_file)
print(f"Added {new_entries} entries.")
else:
print("Nothing to update.")
I've got the following script which puts the results of a list of several API calls into a list and then writes that list to a JSON file, but I'm restricted to 2 calls per second.
with open('data.json', 'a') as fp:
json.dump([requests.get(url).json() for url in urls], fp, indent=2)
Is it possible to achieve this with time.sleep(0.5)? If so, I'm not quite sure how to as this stage.
Any help would be appreciated!
You can first collect the data, then in the end JSON-encode it and write it down:
import json
import requests
import time
results = [] # a list to hold the results
for url in urls: # iterate over your URLs sequence
results.append(requests.get(url).json()) # fetch and append to the results
time.sleep(0.5) # sleep for at least half a second
with open("data.json", "a") as f: # not sure if you want to append, tho
json.dump(results, f, indent=2) # write everything down as JSON
Effectively, you're doing the same, anyway, you just needed to unravel the list comprehension part so that you can inject time.sleep() to it.
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T
When I use pickle, it works fine and I can dump any load.
The problem is if I close the program and try to dump again, it replaces the old file data with the new dumping. Here is my code:
import pickle
import os
import time
dictionary = dict()
def read():
with open('test.txt', 'rb') as f:
a = pickle.load(f)
print(a)
time.sleep(2)
def dump():
chs = raw_input('name and number')
n = chs.split()
dictionary[n[0]] = n[1]
with open('test.txt', 'wb') as f:
pickle.dump(dictionary, f)
Inpt = raw_input('Option : ')
if Inpt == 'read':
read()
else:
dump()
When you open a file in w mode (or wb), that tells it to write a brand-new file, erasing whatever was already there.
As the docs say:
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending…
In other words, you want to use 'ab', not 'wb'.
However, when you append new dumps to the same file, you end up with a file made up of multiple separate values. If you only call load once, it's just going to load the first one. If you want to load all of them, you need to write code that does that. For example, you can load in a loop until EOFError.
Really, it looks like what you're trying to do is not to append to the pickle file, but to modify the existing pickled dictionary.
You could do that with a function that loads and merges all of the dumps together, like this:
def Load():
d = {}
with open('test.txt', 'rb') as f:
while True:
try:
a = pickle.load(f)
except EOFError:
break
else:
d.update(a)
# do stuff with d
But that's going to get slower and slower the more times you run your program, as you pile on more and more copies of the same values. To do that right you need to load the old dictionary, modify that, and then dump the modified version. And for that, you want w mode.
However, a much better way to persist a dictionary, at least if the keys are strings, is to use dbm (if the values are also strings) or shelve (otherwise) instead of a dictionary in the first place.
Opening a file in "wb" mode truncates the file -- that is, it deletes the contents of the file, and then allows you to work on it.
Usually, you'd open the file in append ("ab") mode to add data at the end. However, Pickle doesn't support appending, so you'll have to save your data to a new file (come up with a different file name -- ask the user or use a command-line parameter such as -o test.txt?) each time the program is run.
On a related topic, don't use Pickle. It's unsafe. Consider using JSON instead (it's in the standard lib -- import json).
I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards