random access to re-creatable randomly generated big data - python

I need to generate much data of mostly basic data types for stress testing NoSQL databases (Cassandra right now, maybe others in future). I also need to be able to re-create this randomly created data in the future and, more problematicly, retrieve random entries from this already generated data to generate queries.
Re-creating the data itself imposes no problem via providing the same seed to the randomness generator. The hard part is retrieving a random item from the generated data. The obious way would be to store all of it in a data structure, but we are talking about potentially GBs of data, so this should not be an option (or am I wrong here?).
The random re-generation of previously generated items should also be as fast as possible, synchonisable over different threads and ideally provide a way to specify the underlaying distribution for both the generated data and the selection of test data items.
[edit] I just found out, that the random.jumpahead(n)-function might come in handy, only problem is it does not work with the pseudo number generator (PNGR) used since python 2.3. But the old one is still available (random.WichmannHill()), where I could just "jump ahead" n steps from my initial seed.
And just to be clear: I'm using python 2.7.
[edit2] What this question might boil down to is skipping n generation steps. You can do it with the original PNGR with some code like I found here:
def skip(n):
for _ in xrange(n):
random.random()
But, as said in the source and tested by me, this is only efficient for n<~100.000, which is way to small. Using random.WichmannHill() I can use jumpahead(n) for any n with the same perfomance.

If you already know 1) the number of entries you will be generating, and 2) the number of random entries you need from that data, you could just obtain the random entries as you are generating them, storing only those in a data structure.
Say you need to create a million entries for your NoSQL database, and you know you'll want to grab 100 random items out of there to test queries. You can generate 100 random numbers between 1 and 1,000,000, and as you're generating the entries for your stress test, you can take the entries that match up with your randomly-generated numbers and store those specific ones in a data structure. Alternately, you can just save a randomly generated entry to your data structure with some probability m/n, where m is the number of random test queries you need, and n is the total volume of data you're creating.
Basically, it's going to be much easier to obtain the random data while it's being generated than to store everything and pluck data randomly from there. As for how to generate the data, that's going to probably be dependent on your NoSQL implementation and the specific data format you want to use.
EDIT: as dcorking pointed out, you don't even need to store the test items themselves in a data structure. You can just execute them as they show up while you're generating data. All you would need to store is the sequence that determines which tests get run. Or, if you don't want to run the same tests every time, you can just randomly select certain elements to be your test elements as I mentioned above, and store nothing at all.

Related

Creating random 1 mil records normally distributed python

So I want to create 1 million random records that will distribute normally.
I have 12 time spaces:
hours_dic = {1:"00:00-02:00",2:"02:00-05:00",3:"05:00-07:00",4:"07:00-08:00",5:"08:00-10:00",
6:"10:00-12:00",7:"12:00-16:00" ,8:"15:00-17:00",9:"17:00-19:00",10:"19:00-21:00",11:"21:00-22:00",12:"22:00-00:00"}
My problem is that I want the randomized records would distribute normally. Lets say that the mean is "17:00-19:00", right now the std is not important as those records are "dummy data" for a research. I would like to generate the records and later to put them in EXCEL.
To clarify, those hours represent using time, and I want to generate the data with the assumption that the majority use the app at the afternoon.
I thought about using numpy:
x = np.random.normal(loc=9,scale=1,size=1000000)
And then maybe using .map with the hours_dic.
However, I can't find a way to make the generated number integers only (as I want to use the dictionary) and to ensure the distribution is as I wanted.
Thanks for any help, if any elaboration is required please ask.
(If there's an Excel solution I'd like to know it too)

Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model

Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
I have three datasets, each containing about 20 milions rows (csv files)
There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
nodes.csv: one line per node with all attributes
links.csv: one line per link with source_id and target_id and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script

Split large collection into smaller ones?

I have a collection that is potentially going to be very large. Now I know MongoDB doesn't really have a problem with this, but I don't really know how to go about designing a schema that can handle a very large dataset comfortably. So I'm going to give an outline of the problem.
We are collecting large amounts of data for our customers. Basically, when we gather this data it is represented as a 3-tuple, lets say (a, b, c), where b and c are members of sets B and C respectively. In this particular case we know that the B and C sets will not grow very much over time. For our current customers we are talking about ~200,000 members. However, the A set is the one that keeps growing over time. Currently we are at about ~2,000,000 members per customer, but this is going to grow (possibly rapidly.) Also, there are 1->n relations between b->a and c->a.
The workload on this data set is basically split up into 3 use cases. The collections will be periodically updated, where A will get the most writes, and B and C will get some, but not many. The second use case is random access into B, then aggregating over some number of documents in C that pertain to b \in B. And the last usecase is basically streaming a large subset from A and B to generate some new data.
The problem that we are facing is that the indexes are getting quite big. Currently we have a test setup with about 8 small customers, the total dataset is about 15GB in size at the moment, and indexes are running at about 3GB to 4GB. The problem here is that we don't really have any hot zones in our dataset. It's basically going to get an evenly distributed load amongst all documents.
Basically we've come up with 2 options to do this. The one that I described above, where all data for all customers is piled into one collection. This means we'd have to create an index om some field that links the documents in that collection to a particular customer.
The other options is to throw all b's and c's together (these sets are relatively small) but divide up the C collection, one per customer. I can imangine this last solution being a bit harder to manage, but since we rarely access data for multiple customers at the same time, it would prevent memory problems. MongoDB would be able to load the customers index into memory and just run from there.
What are your thoughts on this?
P.S.: I hope this wasn't too vague, if anything is unclear I'll go into some more details.
It sounds like the larger set (A if I followed along correctly), could reasonably be put into its own database. I say database rather than collection, because now that 2.2 is released you would want to minimize lock contention between the busier database and the others, and to do that a separate database would be best (2.2 introduced database level locking). That is looking at this from a single replica set model, of course.
Also the index sizes sound a bit out of proportion to your data size - are you sure they are all necessary? Pruning unneeded indexes, combining and using compound indexes may well significantly reduce the pain you are hitting in terms of index growth (it would potentially make updates and inserts more efficient too). This really does need specifics and probably belongs in another question, or possibly a thread in the mongodb-user group so multiple eyes can take a look and make suggestions.
If we look at it with the possibility of sharding thrown in, then the truly important piece is to pick a shard key that allows you to make sure locality is preserved on the shards for the pieces you will frequently need to access together. That would lend itself more toward a single sharded collection (preserving locality across multiple related sharded collections is going to be very tricky unless you manually split and balance the chunks in some way). Sharding gives you the ability to scale out horizontally as your indexes hit the single instance limit etc. but it is going to make the shard key decision very important.
Again, specifics for picking that shard key are beyond the scope of this more general discussion, similar to the potential index review I mentioned above.

Saving large Python arrays to disk for re-use later --- hdf5? Some other method?

I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).
HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.
Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.
I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.

How to design a memory and computationally intensive program to run on Google App Engine

I have a problem with my code running on google app engine. I dont know how to modify my code to suit GAE. The following is my problem
for j in range(n):
for d in range(j):
for d1 in range(d):
for d2 in range(d1):
# block which runs in O(n^2)
Efficiently the entire code block is O(N^6) and it will run for more than 10 mins depending on n. Thus I am using task queues. I will also be needing a 4 dimensional array which is stored as a list (eg A[j][d][d1][d2]) of n x n x n x n ie needs memory space O(N^4)
Since the limitation of put() is 10 MB, I cant store the entire array. So I tried chopping into smaller chunks and store it and when retrieve combine them. I used the json function for this but it doesnt support for larger n (> 40).
Then I stored the whole matrix as individual entities of lists in datastore ie each A[j][d][d1] entity. So there is no local variable. When i access A[j][d][d1][d2] in my code I would call my own functions getitem and putitem to get and put data from datastore (used caching also). As a result, my code takes more time for computation. After few iterations, I get the error 203 raised by GAE and task fails with code 500.
I know that my code may not be best suited for GAE. But what is the best way to implement it on GAE ?
There may be even more efficient ways to store your data and to iterate over it.
Questions:
What datatype are you storing, list of list ... of int?
What range of the nested list does your innermost loop O(n^2) portion typically operate over?
When you do the putitem, getitem how many values are you retrieving in a single put or get?
Ideas:
You could try compressing your json (and base64 for cut and pasting). 'myjson'.encode('zlib').encode('base64')
Using a divide and conquer (map reduce) as #Robert suggested. You may be able to use a dictionary with tuples for keys, this may be fewer lookups then A[j][d][d1][d2] in your inner loop. It would also allow you to sparsly populate your structure. You would need to track and know your bounds of what data you loaded in another way. A[j][d][d1][d2] becomes D[(j,d,d1,d2)] or D[j,d,d1,d2]
You've omitted important details like the expected size of n from your question. Also, does the "# block which runs in O(n^2)" need access to the entire matrix, or are you simply populating the matrix based on the index values?
Here is a general answer: you need to find a way to break this up into smaller chunks. Maybe you can use some type of divide and conquer strategy and use tasks for parallelism. How you store your matrix depends on how you split the problem up. You might be able to store submatrices, or perhaps subvectors using the index values as key-names; again, this will depend on your problem and the strategy you use.
An alternative, if for some reason you can not figure out how to parallelize your algorithm, is to use a continuation strategy of some type. In other works, figure out about how many iterations you can typically do within the time constraints (leaving a safety margin), then once you hit that limit save your data and insert a new task to continue the processing. You'll just need to pass in the starting position, then resume running from there. You may be able to do this easily by giving a starting parameter to the outermost range, but again it depends on the specifics of your problem.
Sam, just give you an idea and pointer on where to start.
If what you need is somewhere between storing the whole matrix and storing the numbers one-by-one, may be you will be interested to use pickle to serialize your list, and store them in datastore for later retrieval.
list is a python object, and you should be able to serialize it.
http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore

Categories