how to append data to existing LMDB? - python

I have around 1 million images to put in this dataset 10000 at a time appended to the set.
I"m sure the map_size is wrong with ref from this article
used this line to create the set
env = lmdb.open(Path+'mylmdb', map_size=int(1e12)
use this line every 10000 sample to write data to file where X and Y are placeholders for the data to be put in the LMDB.
env = create(env, X[:counter,:,:,:],Y,counter)
def create(env, X,Y,N):
with env.begin(write=True) as txn:
# txn is a Transaction object
for i in range(N):
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = X.shape[1]
datum.height = X.shape[2]
datum.width = X.shape[3]
datum.data = X[i].tostring() # or .tostring() if numpy < 1.9
datum.label = int(Y[i])
str_id = '{:08}'.format(i)
# The encode is only essential in Python 3
txn.put(str_id.encode('ascii'), datum.SerializeToString())
#pdb.set_trace()
return env
How can I edit this code such that new data is added to this LMDB and not replaced as this present method replaces it in the same position.
I have check the length after generation with the env.stat().

Le me expand on my comment above.
All entries in LMDB are stored according to unique keys and your database already contains keys for i = 0, 1, 2, .... You need a way to find unique keys for each i. The simplest way to do that is to find the largest key in existing DB and keep adding to it.
Assuming that existing keys are consecutive,
max_key = env.stat()["entries"]
Otherwise, a more thorough approach is to iterate over all keys. (Check this.)
max_key = 0
for key, value in env.cursor():
max_key = max(max_key, key)
Finally, simply replace line 7 of your for loop,
str_id = '{:08}'.format(i)
by
str_id = '{:08}'.format(max_key + 1 + i)
to append to the existing database.

Related

Count occurances of a specific string within multi-valued elements in a set

I have generated a list of genes
genes = ['geneName1', 'geneName2', ...]
and a set of their interactions:
geneInt = {('geneName1', 'geneName2'), ('geneName1', 'geneName3'),...}
I want to find out how many interactions each gene has and put that in a vector (or dictionary) but I struggle to count them. I tried the usual approach:
interactionList = []
for gene in genes:
interactions = geneInt.count(gene)
interactionList.append(ineractions)
but of course the code fails because my set contains elements that are made out of two values while I need to iterate over the single values within.
I would argue that you are using the wrong data structure to hold interactions. You can represent interactions as a dictionary keyed by gene name, whose values are a set of all the genes it interacts with.
Let's say you currently have a process that does something like this at some point:
geneInt = set()
...
geneInt.add((gene1, gene2))
Change it to
geneInt = collections.defaultdict(set)
...
geneInt[gene1].add(gene2)
If the interactions are symmetrical, add a line
geneInt[gene2].add(gene1)
Now, to count the number of interactions, you can do something like
intCounts = {gene: len(ints) for gene, ints in geneInt.items()}
Counting your original list is simple if the interactions are one-way as well:
intCounts = dict.fromkeys(genes, 0)
for gene, _ in geneInt:
intCounts[gene] += 1
If each interaction is two-way, there are three possibilities:
Both interactions are represented in the set: the above loop will work.
Only one interaction of a pair is represented: change the loop to
for gene1, gene2 in geneInt:
intCounts[gene1] += 1
if gene1 != gene2:
intCounts[gene2] += 1
Some reverse interactions are represented, some are not. In this case, transform geneInt into a dictionary of sets as shown in the beginning.
Try something like this,
interactions = {}
for gene in genes:
interactions_count = 0
for tup in geneInt:
interactions_count += tup.count(gene)
interactions[gene] = interactions_count
Use a dictionary, and keep incrementing the value for every gene you see in each tuple in the set geneInt.
interactions_counter = dict()
for interaction in geneInt:
for gene in interaction:
interactions_counter[gene] = interactions_counter.get(gene, 0) + 1
The dict.get(key, default) method returns the value at the given key, or the specified default if the key doesn't exist. (More info)
For the set geneInt={('geneName1', 'geneName2'), ('geneName1', 'geneName3')}, we get:
interactions_counter = {'geneName1': 2, 'geneName2': 1, 'geneName3': 1}

Iterating through a list and assign an ID to the group of elements in a list

I have a list of lists called new_oder_list. I am iterating through this. I would like to create a sub-batch of 20 unique ids from these lists. The same id may appear in the next list so I am keeping a track of the ids in the order_chk_lst list. If there is a repetitive id in the list, I would like to skip that element and check the next element. I am assigning a unique ID to each sub-batch(of 20 elements). I have tried the following code but I am not getting more than 20 ids. I have tried the following code. I would really appreciate your feedback. Thank you.
new_order_list
[5029339601, 5029339775, 5029338374, 5029338219, 5029339927, 5029338917, 5029338917, 5029338219, 5029339601, 5029338905, 5029339320, 5029338282, 5029338374, 5029339109, 5029339320, 5029369758, 5029338282, 5029369758, 5029368075, 5029368652, 5029339941, 5029368652, 5029369810, 5029339584, 5029339584, 5029339775, 5029369810, 5029338531, 5029368003, 5029339536, 5029340252, 5029338531, 5029339137, 5029340252, 5029368003, 5029339137, 5029339536, 5029338531, 5029367966, 5029339109, 5029338390, 5029368075, 5029339576, 5029368083, 5029338209, 5029338417, 5029338905, 5029339576, 5029339941, 5029368075, 5029339895, 5029340051, 5029368075, 5029338390, 5029370218, 5029370218, 5029338209, 5029340051, 5029339895, 5029367966, 5029338417]
[5029370469, 5029368482, 5029370383, 5029340357, 5029340357, 5029370563, 5029370469, 5029340412, 5029339528, 5029370121, 5029370121, 5029370121, 5029368482, 5029368535, 5029370563, 5029339528, 5029370328, 5029368866, 5029369260, 5029369260, 5029369326, 5029370469, 5029338175, 5029338175, 5029368535, 5029368866, 5029368248, 5029340270, 5029339842, 5029339528, 5029340287, 5029338230, 5029368248, 5029368535, 5029368866, 5029340270, 5029339513, 5029369326, 5029368528, 5029340412, 5029339842, 5029338230, 5029370469, 5029370328, 5029369961, 5029340287, 5029370563, 5029370383, 5029340476, 5029340476]
implementation
MAX_ORDER = 20
batch_id = 10000000
sub_batch_id = 10000000
for i, order in enumerate(new_order_list):
# Increment batch_id if the order reaches every MAX_ORDER
if order in order_chk_lst:
# if the id is repeated then go to the next ( I think I am making a mistake here as the value of `i` will change.
continue
order_chk_lst.append(order)
if i % MAX_ORDER == 0:
batch_id = 1
# assign sub_batch_id for each zone (i == 0 will be the first assign within the batch)
# This is my function which will assgn the batch id (I have added this for a reference)
sub_batch_assign, sub_batch_id = assign_sub_batch(zones, sub_batch_id)
# e.g. sub_batch_assign = {"1A": 10000000, "1B": 10000001, "1D": 10000002}
def assign_sub_batch(zones: list, sub_batch_id: int) -> (dict, int):
sub_batch_assign = {}
for zone in zones:
sub_batch_assign[zone] = sub_batch_id
sub_batch_id += 1
return (sub_batch_assign, sub_batch_id)
If you want unique items just change new_oder_list to a set, it will remove all the duplicates. Iterate over the list of lists and use update to add the items to order_chk_lst
order_chk_lst = set()
for lst in new_order_list:
order_chk_lst.update(lst)
You can also change it back to list if you really need it to be list
order_chk_lst = list(order_chk_lst)
If the order is important you can use the fact that dict preserve the order since Python 3.6
order_chk_dict = {}
for lst in new_order_list:
order_chk_dict.update(dict.fromkeys(lst))
order_chk_lst = list(order_chk_dict.keys())

Spark Streaming updateStateByKey with tuple as a value

is it possible to use updateStateByKey() function with a tuple as a value? I am using PySpark and my input is (word, (count, tweet_id)), which means word is a key and a tuple (count, tweet_id) is a value. The task of updateStateByKey is for each word to sum their counts and create a list of all tweet_ids which contains the word.
I implemented following update function, however I got error list index out of range for new_values with index 1:
def updateFunc(new_values, last_sum):
count = 0
tweets_id = []
if last_sum:
count = last_sum[0]
tweets_id = last_sum[1]
return sum(new_values[0]) + count, tweets_id.extend(new_values[1])
And calling the method:
running_counts.updateStateByKey(updateFunc)
I've found the solution. The problem was with checkpointing which means the current state is persisted to the disk in case of a failure. It caused problems because when I changed my definition of a state, in checkpoint it was in the old state without a tuple. Therefore, I deleted checkpoint from the disk and implement the final solution as:
def updateFunc(new_values, last_sum):
count = 0
counts = [field[0] for field in new_values]
ids = [field[1] for field in new_values]
if last_sum:
count = last_sum[0]
new_ids = last_sum[1] + ids
else:
new_ids = ids
return sum(counts) + count, new_ids
Finally, the answer to my question is: yes, the state can be a tuple or any other data type for storing more values.

Python load large number of files

I'm trying to load a large number of files saved in the Ensight gold format into a numpy array. In order to conduct this read I've written my own class libvec which reads the geometry file and then preallocates the arrays which python will use to save the data as shown in the code below.
N = len(file_list)
# Create the class object and read geometry file
gvec = vec.libvec(os.path.join(current_dir,casefile))
x,y,z = gvec.xyz()
# Preallocate arrays
U_temp = np.zeros((len(y),len(x),N),dtype=np.dtype('f4'))
V_temp = np.zeros((len(y),len(x),N),dtype=np.dtype('f4'))
u_temp = np.zeros((len(x),len(x),N),dtype=np.dtype('f4'))
v_temp = np.zeros((len(x),len(y),N),dtype=np.dtype('f4'))
# Read the individual files into the previously allocated arrays
for idx,current_file in enumerate(file_list):
U,V =gvec.readvec(os.path.join(current_dir,current_file))
U_temp[:,:,idx] = U
V_temp[:,:,idx] = V
del U,V
However this takes seemingly forever so I was wondering if you have any idea how to speed up this process? The code reading the individual files into the array structure can be seen below:
def readvec(self,filename):
# we are supposing for the moment that the naming scheme PIV__vxy.case PIV__vxy.geo not changes should that
# not be the case appropriate changes have to be made to the corresponding file
data_temp = np.loadtxt(filename, dtype=np.dtype('f4'), delimiter=None, converters=None, skiprows=4)
# U value
for i in range(len(self.__y)):
# x value counter
for j in range(len(self.__x)):
# y value counter
self.__U[i,j]=data_temp[i*len(self.__x)+j]
# V value
for i in range(len(self.__y)):
# x value counter
for j in range(len(self.__x)):
# y value counter
self.__V[i,j]=data_temp[len(self.__x)*len(self.__y)+i*len(self.__x)+j]
# W value
if len(self.__z)>1:
for i in range(len(self.__y)):
# x value counter
for j in range(len(self.__xd)):
# y value counter
self.__W[i,j]=data_temp[2*len(self.__x)*len(self.__y)+i*len(self.__x)+j]
return self.__U,self.__V,self.__W
else:
return self.__U,self.__V
Thanks a lot in advance and best regards,
J
It'a bit hard to say without any test input\output to compare against. But i think this would give you the same U\V arrays as your nested for loops in readvec. This method should be considerably faster then the for loops.
U = data[:size_x*size_y].reshape(size_x, size_y)
V = data[size_x*size_y:].reshape(size_x, size_y)
Returning these directly into U_temp and V_temp should also help. Right now you're doing 3(?) copies of your data to get them into U_temp and V_temp
From file to temp_data
From temp_data to self.__U\V
From U\V into U\V_temp
Although my guess is that the two nested for loop, and accessing one element at a time is causing the slowness

File to File Comparison of Primary Key to Select Specific Records

I am currently trying to put together a python script to compare two text files (tab-separated values). The smaller file consists of one field per record of key values (e.g. much like a database primary key), whereas the larger file is comprised of a first-field key, up to thousands of fields per record, with tens of thousands of records.
I am trying to select (from the larger file) only the records which match their corresponding key in the smaller file, and output these to a new text file. The keys occur in the first field of each record.
I have hit a wall. Admittedly, I have been trying for loops, and thus far have had minimal success. I got it to display the key values of each file--a small victory!
I may be a glutton for punishment, as I am bent on using python (2.7) to solve this, rather than import it into something SQL based; I will never learn otherwise!
UPDATE: I have the following code thus far. Is the use of forward-slash correct for the write statement?
# Defining some counters, and setting them to zero.
counter_one = 0
counter_two = 0
counter_three = 0
counter_four = 0
# Defining a couple arrays for sorting purposes.
array_one = []
array_two = []
# This module opens the list of records to be selected.
with open("c:\lines_to_parse.txt") as f0:
LTPlines = f0.readlines()
for i, line in enumerate(LTPlines):
returned_line = line.split()
array_one.append(returned_line)
for line in array_one:
counter_one = counter_one + 1
# This module opens the file to be trimmed as an array.
with open('c:\target_data.txt') as f1:
targetlines = f1.readlines()
for i, line in enumerate(targetlines):
array_two.append(line.split())
for line in array_two:
counter_two = counter_two + 1
# The last module performs a logical check
# of the data and writes to a tertiary file.
with open("c:/research/results", 'w') as f2:
while counter_three <= 3: #****Arbitrarily set, to test if the program will work.
if array_one[counter_three][0] == array_two[counter_four][0]:
f2.write(str(array_two[counter_four]))
counter_three = (counter_three + 1)
counter_four = (counter_four + 1)
else:
counter_four = (counter_four + 1)
You could create a dictionary with the keys in the small file. The key in the small file as th ekey and the value True (is not important). Keep this dict in memory.
Then open the file where you will write to (output file) and the larger file. Check for each line in the larger file if the key exist in the dictionary and if it does write to the output file.
I am not sure if is clear enough. Or if that was your problem.

Categories