Best way to update a json file as data is coming in - python

I am running a loop with data coming in and writing the data to a json file. Here's what it looks like in a minimal verifiable concrete example.
import json
import random
import string
dc_master= {}
for i in range(100):
# Below mimics an API call that returns new data.
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
dc_info= {}
dc_info['Height'] = 'NA'
dc_master[name] = dc_info
with open("myfile.json", "w") as filehandle:
filehandle.write(json.dumps(dc_master))
As you can see from the above, every time it loops, it creates a new dc_info. That becomes the value of the new key-value pair (with the key being the name) that gets written to the json file.
The one disadvantage of the above is that when it fails and I restart again, I have to do it from the very beginning. Should I do a open for reading of the json file to dc_master, then add a name:dc_info to the dictionary, followed by writing the dc_master back to the json file at every turn of the loop? Should I just append to the json file even if it's a duplicate and let the fact that when I need to use it, I will load it back into a dictionary and that takes care of duplicates automatically?
Additional information: There are occasionally timeouts, so I want to be able to start somewhere in the middle if needed. Number of key value pairs in the dc_info is about 30 and number of overall name:dc_info pairs is about 1000. So it's not huge. Reading it out and writing it back in again is not onerous. But I do like to know if there's a more efficient way of doing it.

I think the full script of fetching and storing API results should look like example code below. At least I always do same code for long-processing set of tasks.
I put each result of API call as a separate JSON single line in result file.
Script may be stopped in the middle e.g. due to exception, file will be correctly closed and flushed thanks to with manager. Then on restart script will read already processed result lines from file.
Only those results that have not being processed already (if their id not in processed_ids) should and will be fetched from API. id-field may be anything that identifies uniquely each API call result.
Each next result will be appended to json-lines file thanks to a mode (append mode). buffering specifies write-buffer size in bytes, file will be flushed and written in this size of blocks, this is not to stress disk with sequent one-line-100-bytes writes. Using large buffering is totally alright because Python's with block correctly flushes and writes out all bytes whenever block exits due to exception or any other reason, so you'll never lose even a single small result or byte that already has being written by f.write(...).
Final results will be printed to console.
Because your task is very interesting and important (at least I had similar tasks many times), I've decided to also implement multi-threaded version of the single-threaded code located below, it is aspecially needed for the case of fetching data from Internet, as it is usually necessary to download data in several parallel threads. Multi-threaded version can be found and run here and here. This multi-threading can be extended to multi-processing too for efficiency by using ideas from my another answer.
Next is single-threaded version of code:
Try next code online here!
import json, os, random, string
fname = 'myfile.json'
enc = 'utf-8'
id_field = 'id'
def ReadResults():
results = []
processed_ids = set()
if os.path.exists(fname):
with open(fname, 'r', encoding = enc) as f:
data = f.read()
results = [json.loads(line) for line in data.splitlines() if line.strip()]
processed_ids = {r[id_field] for r in results}
return (results, processed_ids)
# First read already processed elements
results, processed_ids = ReadResults()
with open(fname, 'a', buffering = 1 << 20, encoding = enc) as f:
for id_ in range(100):
# !!! Only process ids that are not in processed_ids !!!
if id_ in processed_ids:
continue
# Below mimics an API call that returns new data.
# Should fetch only those objects that correspond to id_.
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
# Fill necessary result fields
result = {}
result['id'] = id_
result['name'] = name
result['field0'] = 'value0'
result['field1'] = 'value1'
cid = result[id_field] # There should be some unique id field
assert cid not in processed_ids, f'Processed {cid} twice!'
f.write(json.dumps(result, ensure_ascii = False) + '\n')
results.append(result)
processed_ids.add(cid)
print(ReadResults()[0])

I think you're fine and I'd loop over the whole thing and keep writing to the file, as it's cheap.
As for retries, you would have to check for a timeout and then see if the JSON file is already there, load it up, count your keys and then fetch the missing number of entries.
Also, your example can be simplified a bit.
import json
import random
import string
dc_master = {}
for _ in range(100):
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
dc_master.update({name: {"Height": "NA"}})
with open("myfile.json", "w") as jf:
json.dump(dc_master, jf, sort_keys=True, indent=4)
EDIT:
On second thought, you probably want to use a JSON list instead of a dictionary as the top level element, so it's easier to check how much you've got already.
import json
import os
import random
import string
output_file = "myfile.json"
max_entries = 100
dc_master = []
def do_your_stuff(data_container, n_entries=max_entries):
for _ in range(n_entries):
name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
data_container.append({name: {"Height": "NA"}})
return data_container
def dump_data(data, file_name):
with open(file_name, "w") as jf:
json.dump(data, jf, sort_keys=True, indent=4)
if not os.path.isfile(output_file):
dump_data(do_your_stuff(dc_master), output_file)
else:
with open(output_file) as f:
data = json.load(f)
if len(data) < max_entries:
new_entries = max_entries - len(data)
dump_data(do_your_stuff(data, new_entries), output_file)
print(f"Added {new_entries} entries.")
else:
print("Nothing to update.")

Related

python: efficient way to check only a couple of rows in a csvreader iterator?

I have a (very large) CSV file that looks something like this:
header1,header2,header3
name0,rank0,serial0
name1,rank1,serial1
name2,rank2,serial2
I've written some code that processes the file, and writes it out (using csvwriter) modified as such, with some information I compute appended to the end of each row:
header1,header2,header3,new_hdr4,new_hdr5
name0,rank0,serial0,salary0,base0
name1,rank1,serial1,salary1,base1
name2,rank2,serial2,salary2,base2
What I'm trying to do is structure the script so that it auto-detects whether or not the CSV file it's reading has already been processed. If it has been processed, I can skip a lot of expensive calculations later. I'm trying to understand whether there is a reasonable way of doing this within the reader loop. I could just open the file once, read in enough to do the detection, and then close and reopen it with a flag set, but this seems hackish.
Is there a way to do this within the same reader? The logic is something like:
read first N lines ###(N is small)
if (some condition)
already_processed = TRUE
read_all_csv_without_processing
else
read_all_csv_WITH_processing
I can't just use the iterator that reader gives me, because by the time I've gotten enough lines to do my conditional check, I don't have any good way to go back to the beginning of the CSV. Is closing and reopening it really the most elegant way to do this?
If you're using the usual python method to read the file (with open("file.csv","r") as f: or equivalent), you can "reset" the file reading by calling f.seek(0).
Here is a piece of code that should (I guess) look a bit more like the way you're reading your file. It demonstate that reseting csvfile with csvfile.seek(0) will also reset csvreader:
with open('so.txt', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
for row in csvreader:
print('Checking if processed')
print(', '.join(row))
#if condition:
if True:
print('File already processed')
already_processed = True
print('Reseting the file')
csvfile.seek(0)
for row in csvreader:
print(', '.join(row))
break
I suppose if you do not want to just test the first few lines of the file, you could create a single iterator from a list and then the continuation of the csv reader.
Given:
header1,header2,header3
name0,rank0,serial0
name1,rank1,serial1
name2,rank2,serial2
You can do:
import csv
from itertools import chain
with open(fn) as f:
reader=csv.reader(f)
header=next(reader)
N=2
p_list=[]
for i in range(N): # N is however many you need to set processed flag
p_list.append(next(reader))
print("p_list:", p_list)
# now use p_list to determine if processed however processed=check(p_list)
for row in chain(iter(p_list), reader): # chain creates a single csv reader...
# handler processed or not here from the stream of rows...
# if not processed:
# process
# else:
# handle already processed...
# print row just to show csv data is complete:
print(row)
Prints:
p_list: [['name0', 'rank0', 'serial0'], ['name1', 'rank1', 'serial1']]
['name0', 'rank0', 'serial0']
['name1', 'rank1', 'serial1']
['name2', 'rank2', 'serial2']
I think what you're trying to achieve is to use the first lines to decide between the type of processing, then re-use those lines for your read_all_csv_WITH_processing or read_all_csv_without_processing, while still not loading the full csv file in-memory. To achieve that you can load the first lines in a list and concatenate that with the rest of the file with itertools.chain, like this:
import itertools
top_lines = []
reader_iterator = csv.reader(fil)
do_heavy_processing = True
while True:
# Can't use "for line in reader_iterator" directly, otherwise we're
# going to close the iterator when going out of the loop after the first
# N iterations
line = reader_iterator.__next__()
top_lines.append(line)
if some_condition(line):
do_heavy_processing = False
break
elif not_worth_going_further(line)
break
full_file = itertools.chain(top_lines, reader_iterator)
if do_heavy_processing:
read_all_csv_WITH_processing(full_file)
else:
read_all_csv_without_processing(full_file)
I will outline what I consider a much better approach. I presume this is happening over various runs. What you need to do is persist the files seen between runs and only process what has not been seen:
import pickle
import glob
def process(fle):
# read_all_csv_with_processing
def already_process(fle):
# read_all_csv_without_processing
try:
# if it exists, we ran the code previously.
with open("seen.pkl", "rb") as f:
seen = pickle.load(f)
except IOError as e:
# Else first run, so just create the set.
print(e)
seen = set()
for file in glob.iglob("path_where_files_are/*.csv"):
# if not seen before, just process
if file not in seen:
process(file)
else:
# already processed so just do whatever
already_process(file)
seen.add(file)
# persist the set.
with open("seen.pkl", "w") as f:
pickle.dumps(seen, f)
Even if for some strange reason you somehow process the same files in the same run, all you need to do then is implement the seen set logic.
Another alternative would be to use a unique marker in the file that you add at the start if processed.
# something here
header1,header2,header3,new_hdr4,new_hdr5
name0,rank0,serial0,salary0,base0
name1,rank1,serial1,salary1,base1
name2,rank2,serial2,salary2,base2
Then all you would need to process is the very first line. Also if you want to get the first n lines from a file even if you wanted to start from a certain row, use itertools.islce
To be robust you might want to wrap your code in a try/finally in case it errors so you don't end up going over the same files already processed on the next run:
try:
for file in glob.iglob("path_where_files_are/*.csv"):
# if not seen before, just process
if file not in seen:
process(file)
else:
# already processed so just do whatever
already_process(file)
seen.add(file)
finally:
# persist the set.
with open("seen.pkl", "wb") as f:
pickle.dump(seen, f)

How to read a JSON file in python? [duplicate]

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Storing data of a dict in Json

I have a simple program that take input from user and put them into a dict. After then I want to store that data into json file (I searched and found only json useful)
for example
mydict = {}
while True:
user = input("Enter key: ")
if user in mydict.keys(): #if the key already exist only print
print ("{} ---> {}".format(user,mydict[user]))
continue
mn = input("Enter value: ")
app = input ("Apply?: ")
if app == "y":
mydict[user] = mn
with open ("mydict.json","a+") as f:
json.dump(mydict,f)
with open ("mydict.json") as t:
mydict = json.load(t)
Every time user enter a key and value, I want to add them into dict, after then store that dict in json file. And every time I want to read that json file so I can refresh the dict in program.
Those codes above raised ValueError: Extra data: . I understood error occured because I'm adding the dict to json file every time so there are more than one dict. But how can I add whole dict at once? I didn't want to use w mode because I don't want to overwrite the file and I'm new in Json.
Program must be infinite and I have to refresh dict every time, that's why I couldn't find any solution or try anything, since I'm new on Json.
If you want to use JSon, then you will have to use the 'w' option when opening the file for writing. The 'a+' option will append your full dict to the file right after its previously saved version.
Why wouldn't you use csv instead ? With 'a+' option, any newly entered user info will be appended to the end of the file and transforming its content at reading time to a dict is quite easy and should look something like:
import csv
with open('your_dict.json', 'r') as fp:
yourDict = {key: value for key,value in csv.reader(fp, delimiter='\t')
while the saving counterpart would look like:
yourDictWriter = csv.writer( open('your_dict.json','a+'), delimiter='\t') )
#...
yourDictWriter.writerow([key, value])
Another approach would be to use MongoDB, a database designed for storing json documents. Then you won't have to worry about overwriting files, encoding json, and so on, since the database and driver will manage this for you. (Also note that it makes your code more concise.) Assuming you have MongoDB installed and running, you could use it like this:
import pymongo
client = MongoClient()
db = client.test_database.test_collection
while True:
user = input("Enter key: ")
if db.find_one({'user': user}) #if the key already exist only print
print ("{} ---> {}".format(user, db.find_one({'user':user})['value'])
continue
mn = input("Enter value: ")
app = input ("Apply?: ")
if app == "y":
db.insert({'user':user, 'value':value})
With your code as it is right now, you have no reason to append to the file. You're converting the entire dict to JSON and writing it all to file anyway, so it doesn't matter if you lose the previous data. a is no more efficient than w here. In fact it's worse because the file will take much more space on disk.
Using the CSV module as Schmouk said is a good approach, as long as your data has a simple structure. In this case you just have a table with two columns and many rows, so a CSV is appropriate, and also more efficient.
If each row has a more complex structure, such as nested dicts and or lists, or different fields for each row, then JSON will be more useful. You can still append one row at a time. Just write each row into a single line, and have each row be an entire JSON object on its own. Reading files one line at a time is normal and easy in Python.
By the way, there is no need for the last two lines. You only need to read in the JSON when the program starts (as Moses has shown you) so you have access to data from previous runs. After that you can just use the variable mydict and it will remember all the keys you've added to it.

Python removing duplicates and saving the result

I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3rd column.
from operator import itemgetter
import sys
input = sys.argv[1]
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
for line in input.splitlines():
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
file = open(output, "w")
file.write(data)
file.close()
First, I get error
key = ig(line.split())
IndexError: list index out of range
Also, I can't see how to save the result to output.txt
People say saving to output.txt is a really basic matter. But no tutorial helped.
I tried methods that use codec, those that use with, those that use file.write(data) and all didn't help.
I could learn MatLab quite easily. The online tutorial was fantastic and a series of Googling always helped a lot.
But I can't find a helpful tutorial of Python yet. This is obviously because I am a complete novice. For complete novices like me, what would be the best tutorial with 1) comprehensiveness AND 2) lots of examples 3) line by line explanation that dosen't leave any line without explanation?
And why is the above code causing error and not saving result?
I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.
Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.
Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.
It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.
input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.
You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt'].
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns

csv writer not printing rows properly

I have a csv script that runs within a sequence on a set of gathered urls like so: threaded(urls, write_csv, num_threads=5). The script writes to the csv correctly but seems to rewrite the first row for each url rather than writing to new rows for each subsequent url that is passed. The result is that the final csv has one row with the data from the last url. Do I need to add a counter and index to accomplish this or restructure the program entirely? Here's the relevant code:
import csv
from thready import threaded
def get_links():
#gather urls
threaded(urls, write_csv, num_threads=5)
def write_csv(url):
#the data dict with values that were previously assigned is defined here
data = {
'scrapeUrl': url,
'model': final_model_num,
'title': final_name,
'description': final_description,
'price': str(final_price),
'image': final_first_image,
'additional_image': final_images,
'quantity': '1',
'subtract': '1',
'minimum': '1',
'status': '1',
'shipping': '1'
}
#currently this writes the values but only to one row even though multiple urls are passed in
with open("local/file1.csv", "w") as f:
writer=csv.writer(f, delimiter=",")
writer.writerows([data.keys()])
writer.writerow([s.encode('ascii', 'ignore') for s in data.values()])
if __name__ == '__main__':
get_links()
It appears that one problem is this line...
with open("local/file1.csv", "w") as f:
The output file is overwritten on each function call ("w" indicates the file mode is write). When an existing file is opened in write mode it is cleared. Since the file is cleared every time the function is called it's giving the appearance of only writing one row.
The bigger issue is that it is not good practice for multiple threads to write to a single file.
You could try this...
valid_chars = "-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
filename = ''.join(c for c in url if c in valid_chars)
with open("local/%s.csv" % filename, "w") as f:
# rest of code...
...which will write each url to a different file (assuming the urls are unique). You could then recombine the files later. A better approach would be to put the data in a Queue and write it all after the call to threaded. Something like this...
import Queue
output_queue = Queue.Queue()
def get_links():
#gather urls
urls = ['www.google.com'] * 25
threaded(urls, write_csv, num_threads=5)
def write_csv(url):
data = {'cat':1,'dog':2}
output_queue.put(data)
if __name__ == '__main__':
get_links() # thready blocks until internal input queue is cleared
csv_out = csv.writer(file('output.csv','wb'))
while not output_queue.empty():
d = output_queue.get()
csv_out.writerow(d.keys())
csv_out.writerow(d.values())
Opening a file in write mode erases whatever was already in the file (as documented here). If you have multiple threads opening the same file, whichever one opens the file last will "win" and write its data to the file. The others will have their data overwritten by the last one.
You should probably rethink your approach. Multithreaded access to external resources like files is bound to cause problems. A better idea is to have the threaded portion of your code only retrieve the data from the urls, and then return it to a single-thread part that writes the data sequentially to the file.
If you only have a small number of urls, you could dispense with threading altogether and just write a direct loop that iterates over the urls, opens the file once, and writes all the data.

Categories