I am processing some text files to search for patterns and count it. As files are very large, processing time is an important issue. I have a python code that gets the counters updated and stored in mongodb. In order to make it work faster I am trying to reduce the number of db operations.
Original version was incrementing every single ocurrence:
mlcol.find_one_and_update(
{"connip": conip},
{"$inc":{ts:1}},
upsert=True
)
As this took to long, what I did was to keep the counters in memory, in dictionaries and periodically go through this data to store it:
for conip in conCounter.keys():
d = conCounter[conip]
for ts in d.keys():
mlcol.find_one_and_update(
{"connip": conip},
{"$inc":{ts:d[ts]}},
upsert=True
)
This way the process is much faster, but I see that it still takes very long to update individually every single counter.
Is there a way to launch multiple updates in a single command?
Any other idea to make this go faster?
As explained by Alex Blex, creating an index and a Bulk execution solved the issue:
mlcol.create_index("connip")
bulk=mlcol.initialize_unordered_bulk_op()
for conip in conCounter.keys():
d = conCounter[conip]
for ts in d.keys():
bulk.find({"connip": conip}).upsert().update({"$inc":{ts:d[ts]}})
res=bulk.execute()
Related
I was working on a problem recently that required me to go through a very large folder (~600,000 files) and return the list of filenames that matched a certain criterion. The original version was a normal list comprehension stored in a variable. This isn't the actual code but it gives the gist.
def filter_files(file_path):
filtered = [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
return filtered
When monitoring this one it would start out fast then progressively get slower and slower. I presume because it's just trying to store so much data in the variable.
I then rewrote it to be:
def filter_files(file_path):
return [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
And called it like:
def test(file_path):
filtered = filter_files(file_path)
This one never slows down. It maintains the same speed the entire time.
My question is what under the hood causes this difference? The data is still being stored in a variable and it's still being processed as a list comprehension. What about writing the comprehension in a return avoids the issues of the first version? Thanks!
There is no difference between those two pieces of code. None at all. Both of them are creating a list, and then managing a reference to that list.
The likely cause of your issue is caching. In the first case, the file system has to keep going out to the disk over and over and over to fetch more entries. After you finished that, the directory was in the file cache and could be read immediately. Reboot and try again, and you'll see the second one takes the same amount of time.
This is my code for iterating on "ListOfDocuments" which is a list of over 500,000 dicts. Each of those dicts have around 30 key-value pairs that I need.
for document in ListOfDocuments:
for field in document:
if(field=="USELESS"):
continue
ExtraList[AllParameters[field]] = document[field]
ExtraList[AllParameters["C_Name"]] = filename.split(".")[0]
AppendingDataframe.loc[len(AppendingDataframe)] = ExtraList
What I'm trying to do is, store all possible column names in AllParameters, loop through the ListOfDocuments followed by looping through the obtained dict, followed by iterating each key-value pair and saving them in ExtraList which I append finally in the AppendingDataframe.
This approach is extremely slow even on most powerful of the machines and I know this is not the right way to do it. Any help would be very appreciated.
Edit:
A sample document looks like a normal key-value with over 30 keys.
Eg
{' FKey':12,'Skey':22,'NConfig':'NA','SCHEMA':'CD123...}
And I'd like to extract and store the individual key-value pairs.
Make threads. You can find out how many files you need to look through and possibly split it across 4 threads. This will make the process much faster as it will allow the documents to be read at the same time
You could start by making a method that accepts a list of files then loop through those. Then you could pass a few sections of the main list to the method and run them in threads. That should provide a decent increase in speed
You can implement this by implementing a function that processes a single entry of the list and then use multiprocessing:
import multiprocessing as multi
from multiprocessing import Manager
manager = Manager()
data = manager.list([])
def func(a): #Implement here the function
data.append(a) #that processes one dict from the list
p = multi.Pool(processes=16)
p.map(func, ListOfDocuments)
print data
On Python, I'm trying to merge multiple JSON files obtained from TinyDB.
I was not able to find a way to directly merge two tinydb JSON files that have keys autogenerated in the sequence that not restart with the opening of the next file.
In code words, i want to merge large amount of data like this:
hello1={"1":"bye",2:"good"....,"20000":"goodbye"}
hello2={"1":"dog",2:"cat"....,"15000":"monkey"}
As:
Hello3= {"1":"bye",2:"good"....,"20000":"goodbye","20001":"dog",20002:"cat"....,"35000":"monkey"}
Because of the problem to find the correct way to do it with TinyDB, I opened and transformed them simply in classic syntax json file, loading each file and then doing:
Data = Data['_default']
The problem that I have, is that at the moment the code works, but it has serious memory problems. After a few seconds, the created merged Db contains like 28Mb of data, but (probably) the cache saturate, and it starts to add all the other data in a really slow way.
So, I need to empty the cache after a certain amount of data, or probably i need to change the way to do this!
That's the code that i use:
Try1.purge()
Try1 = TinyDB('FullDB.json')
with open('FirstDataBase.json') as Part1 :
Datapart1 = json.load(Part1)
Datapart1 = Datapart1['_default']
for dets in range(1, len(Datapart1)):
Try1.insert(Datapart1[str(dets)])
with open('SecondDatabase.json') as Part2:
Datapart2 = json.load(Part2)
Datapart2 = Datapart2['_default']
for dets in range(1, len(Datapart2)):
Try1.insert(Datapart2[str(dets)])
Question: Merge Two TinyDB Databases ... probably i need to change the way to do this!
From TinyDB Documentation
Why Not Use TinyDB?
...
You are really concerned about performance and need a high speed database.
Single row insertion into a DB are always slow, try db.insert_multiple(....
The second one. with generator. gives you the option to hold down the memory footprint.
# From list
Try1.insert_multiple([{"1":"bye",2:"good"....,"20000":"goodbye"}])
or
# From generator function
Try1.insert_multiple(generator())
Following the suggestions given here, I have stored my data using ZODB, created by the following piece of code:
# structure of the data [around 3.5 GB on disk]
bTree_container = {key1:[ [2,.44,0], [1,.23,0], [4,.21,0] ...[10,000th element] ], key2:[ [3,.77,0], [1,.22,0], [6,.98,0] ..[10,000th element] ] ..10,000th key:[[5,.66,0], [2,.32,0], [8,.66,0] ..[10,000th element]]}
# Code used to build the above mentioned data set
for Gnodes in G.nodes(): # Gnodes iterates over 10000 values
Gvalue = someoperation(Gnodes)
for i,Hnodes in enumerate(H.nodes()): # Hnodes iterates over 10000 values
Hvalue =someoperation(Hnodes)
score = SomeOperation on (Gvalue,Hvalue)
btree_container.setdefault(Gnodes, PersistentList()).append([Hnodes, score, 0]) # build a list corresponding to every value of Gnode (key)
if i%5000 == 0 # save the data temporarily to disk.
transaction.savepoint(True)
transaction.commit() # Flush all the data to disk
Now, I want to (in a separate module) (1) modify the stored data and (2) sort it. Following is the code that I was using:
storage = FileStorage('Data.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
sim_sorted = root[0]
# substitute the last element in every list of every key (indicated by 0 above) by 1
# This code exhausts all the memory, never get to the 2nd part i.e. the sorting
for x in sim_sorted.iterkeys():
for i,y in enumerate(sim_sorted[x]):
y[3] = 1
if i%5000 ==0
transaction.savepoint()
# Sort all the lists associated with every key in he reverse order using middle element as key
[sim_sorted[keys].sort(key = lambda x:(-x[1])) for keys in sim_sorted.iterkeys()]
However, the code used for editing the value is eating up all the memory (never get to sorting). I am not sure how this works, but have a feeling that there is something terribly wrong with my code and ZODB is pulling everything into memory and hence the issue. What would be the correct method to achieve the desired effect i.e the substitution and sorting of stored elements in ZODB without running into memory issues? Also the code is very slow, suggestion to quicken it up ?
[Note: It's not necessary for me to write these changes back to the database]
EDIT
There seems to be a little improvement in memory usage by adding the command connection.cacheMinimize() after the inner loop, however again after some time the entire RAM is consumed, which is leaving me puzzled.
Are you certain it's not the sorting that's killing your memory?
Note that I'd expect that each PersistentList has to fit into memory; it is one persistent record so it'll be loaded as a whole on access.
I'd modify your code to run like this and see what happens:
for x in sim_sorted.iterkeys():
for y in sim_sorted[x]:
y[3] = 1
sim_sorted[x].sort(key=lambda y: -y[1])
transaction.savepoint()
Now you process the whole list in one go and sort it; after all, it's already loaded into memory in one. After processing, you tell the ZODB you are done with this stage and the whole changed list will be flushed to temporary storage. There is little point flushing it when only half-way done.
If this still doesn't fit into memory for you, you'll need to rethink your data structure and split up the large lists into smaller persistent records so you can work on chunks of it at a time without loading the whole thing in one.
I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data.
The length of the string is: 800 characters with 120 comma separated fields.
There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data.
In the later part of the program I insert these lines into sqlite memory table for further use.
But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?
Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.
Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.