Creating random 1 mil records normally distributed python

Creating random 1 mil records normally distributed python - python

So I want to create 1 million random records that will distribute normally.
I have 12 time spaces:
hours_dic = {1:"00:00-02:00",2:"02:00-05:00",3:"05:00-07:00",4:"07:00-08:00",5:"08:00-10:00",
6:"10:00-12:00",7:"12:00-16:00" ,8:"15:00-17:00",9:"17:00-19:00",10:"19:00-21:00",11:"21:00-22:00",12:"22:00-00:00"}
My problem is that I want the randomized records would distribute normally. Lets say that the mean is "17:00-19:00", right now the std is not important as those records are "dummy data" for a research. I would like to generate the records and later to put them in EXCEL.
To clarify, those hours represent using time, and I want to generate the data with the assumption that the majority use the app at the afternoon.
I thought about using numpy:
x = np.random.normal(loc=9,scale=1,size=1000000)
And then maybe using .map with the hours_dic.
However, I can't find a way to make the generated number integers only (as I want to use the dictionary) and to ensure the distribution is as I wanted.
Thanks for any help, if any elaboration is required please ask.
(If there's an Excel solution I'd like to know it too)

Related

what is the best way to handle variable names in python if the all the variables need to have a float(decimal) in it?

Totally new in this forum and new in python so I would appreciate it if anybody can help me.
I am trying to build a script in python based on data that I have in an excel spreadsheet. I'd like to create an app/script where I can estimate the pregnancy due date and the conception date (for animals) based on measurements that I have taken during ultrasounds. I am able to estimate it with a calculator but it takes some conversion to do (from cm to mm) and days to months. In order to do that in Python, I figured I create a variable for each measurement and set each variable equals to its value in days (and integer).
Here is the problem: the main column of my data set is the actual measurements of the babies in mm (Known as BPD) but the BPD can be an integer like 5mm or 6.4mm. Since I can't name a variable with a period or a dot in it, what would be the best way to handle my data and assign variables to it? I have tried BPD_4.8= 77days, but python tells me there's a syntax error (I'm sure lol), but if I type BDP_5= 78 it seems to work. I haven't mastered lists and tuples, not do I really know how to use them properly so ill keep looking online and see what happens.
I'm sure it's something super silly for you guys, but I'm really pulling my hair out and I have nothing but 2 inches of hair lol
This is what my current screen looks like..HELP :(

Howdy and welcome to StackOverflow. The short answer is:
Use a better data structure
You really shouldn't be encoding valuable information into variable names like that. What's going to happen if you want to calculate something with your BPD measurements? Or when you have duplicate BPD's?
This is bad practise. It might seem like a lot of effort to take the time to figure out how to do this properly - but it will be more than worth it if you intend to continue to use Python :)
I'll give you a couple options...
Option 1: Use a dictionary
Dictionaries are common data structures in any language.. so it can pay to know how to use them.
Dictionaries hold information about an object using key/value pairs. For example you might have:
measurements = {
'animal_1' : {'bpd': 4.6, 'due_date_days': 55},
'animal_2' : {'bpd': 5.2, 'due_date_days': 77},
}
An advantage of dictionaries is that they are explicit, ie values have keys which explicitly identify what the information is assigned to. E.g. measurements['animal_1']['due_date_days'] would return the due date for animal 1.
A disadvantage is that it will be harder to compute information / examine relationships than you'll be used to in Excel.
Option 2: Use Pandas
Pandas is a data science library for Python. It's fast, has similar functionality to Excel and is probably well suited to your use case.
I'd recommend you take the time to do a tutorial or two. If you're planning to use Python for data analysis then it's worth using the language and any suitable libraries properly.
You can check out some Pandas tutorials here: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
Good luck!

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :

This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])

This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

Python crashes during category transformation

I have a dataframe with around 200,000 observations and several variables. I want to run a regression and use one of the variables (Location) as a location dummy. There are around 3,600 different values and currently the Location is a string. I read that it might be faster for Pandas to use categorical variables, so I try to run the following code:
df["Location_Cat"] = df.Location.astype("category")
However, this makes my computer run like crazy and after 2 minutes, I still don't get a response.
Do you have any idea why this could be the case? Or is it generally not recommended to create category columns with so many categories?

random access to re-creatable randomly generated big data

I need to generate much data of mostly basic data types for stress testing NoSQL databases (Cassandra right now, maybe others in future). I also need to be able to re-create this randomly created data in the future and, more problematicly, retrieve random entries from this already generated data to generate queries.
Re-creating the data itself imposes no problem via providing the same seed to the randomness generator. The hard part is retrieving a random item from the generated data. The obious way would be to store all of it in a data structure, but we are talking about potentially GBs of data, so this should not be an option (or am I wrong here?).
The random re-generation of previously generated items should also be as fast as possible, synchonisable over different threads and ideally provide a way to specify the underlaying distribution for both the generated data and the selection of test data items.
[edit] I just found out, that the random.jumpahead(n)-function might come in handy, only problem is it does not work with the pseudo number generator (PNGR) used since python 2.3. But the old one is still available (random.WichmannHill()), where I could just "jump ahead" n steps from my initial seed.
And just to be clear: I'm using python 2.7.
[edit2] What this question might boil down to is skipping n generation steps. You can do it with the original PNGR with some code like I found here:
def skip(n):
for _ in xrange(n):
random.random()
But, as said in the source and tested by me, this is only efficient for n<~100.000, which is way to small. Using random.WichmannHill() I can use jumpahead(n) for any n with the same perfomance.

If you already know 1) the number of entries you will be generating, and 2) the number of random entries you need from that data, you could just obtain the random entries as you are generating them, storing only those in a data structure.
Say you need to create a million entries for your NoSQL database, and you know you'll want to grab 100 random items out of there to test queries. You can generate 100 random numbers between 1 and 1,000,000, and as you're generating the entries for your stress test, you can take the entries that match up with your randomly-generated numbers and store those specific ones in a data structure. Alternately, you can just save a randomly generated entry to your data structure with some probability m/n, where m is the number of random test queries you need, and n is the total volume of data you're creating.
Basically, it's going to be much easier to obtain the random data while it's being generated than to store everything and pluck data randomly from there. As for how to generate the data, that's going to probably be dependent on your NoSQL implementation and the specific data format you want to use.
EDIT: as dcorking pointed out, you don't even need to store the test items themselves in a data structure. You can just execute them as they show up while you're generating data. All you would need to store is the sequence that determines which tests get run. Or, if you don't want to run the same tests every time, you can just randomly select certain elements to be your test elements as I mentioned above, and store nothing at all.

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.

Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)

Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.

I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.