I have a dataframe with around 200,000 observations and several variables. I want to run a regression and use one of the variables (Location) as a location dummy. There are around 3,600 different values and currently the Location is a string. I read that it might be faster for Pandas to use categorical variables, so I try to run the following code:
df["Location_Cat"] = df.Location.astype("category")
However, this makes my computer run like crazy and after 2 minutes, I still don't get a response.
Do you have any idea why this could be the case? Or is it generally not recommended to create category columns with so many categories?
Related
So I want to create 1 million random records that will distribute normally.
I have 12 time spaces:
hours_dic = {1:"00:00-02:00",2:"02:00-05:00",3:"05:00-07:00",4:"07:00-08:00",5:"08:00-10:00",
6:"10:00-12:00",7:"12:00-16:00" ,8:"15:00-17:00",9:"17:00-19:00",10:"19:00-21:00",11:"21:00-22:00",12:"22:00-00:00"}
My problem is that I want the randomized records would distribute normally. Lets say that the mean is "17:00-19:00", right now the std is not important as those records are "dummy data" for a research. I would like to generate the records and later to put them in EXCEL.
To clarify, those hours represent using time, and I want to generate the data with the assumption that the majority use the app at the afternoon.
I thought about using numpy:
x = np.random.normal(loc=9,scale=1,size=1000000)
And then maybe using .map with the hours_dic.
However, I can't find a way to make the generated number integers only (as I want to use the dictionary) and to ensure the distribution is as I wanted.
Thanks for any help, if any elaboration is required please ask.
(If there's an Excel solution I'd like to know it too)
Essentially, I am trying to use json_normalize to normalize a column in my pandas dataframe. The normalized portion only includes 5 additional columns which doesn't seem ludicrous to me--though the dataframe does have around 21,000 rows (which still doesn't seem like that much to me).
I have tried here to use a list comprehension to create a dataframe containing only the information that needs to be normalized, that I will later be able to join with the main dataframe.
dfNormalized = pandas.concat([df.append(json_normalize(x)) for x in
data['email_addresses']], ignore_index=True)
The code will run for a little bit before my computer will freeze and all programs will freeze. I am then forced to manually turn off the computer. Am I doing this inefficiently? Does my computer just suck? I thought I remember that this worked yesterday, but maybe I'm remembering wrong because every time I try it today the ol machine blows up.
I have JSON with data about some products and I have already converted this into flat table by pandas, so now I have a few columns with data. I Selected some products manually and putted them into one group. I have sorted them by name for example but this is more complicated, there are also some features and requirements which need to be checked.
So what I want is to create script which will group my products in familiar way as those few groups I created manually based on my own thoughts.
Im totally new into machine learning, but I read about this and also watched some tutorials, but I haven't seen this type of case.
I saw that if I use KNN classifier for example, I have to put in input every group that exists and then it will assign single product to one of those groups, but in my case this must be more complicated I guess since I want from this script to create those groups on his own in similiar way to selected by me.
I was thinking about unsupervised machine learnign but this doesn't look like solution beacuse I have my own data which I want to provide, it seems like I need to use some kind of hybrid with supervised machine learning.
data = pd.read_json('recent.json')['results']
data = json_normalize(data)
le = preprocessing.LabelEncoder()
product_name = le.fit_transform(data['name'])
just some code to show what I have done
I don't know if that makes sense what I want, I already made attempt to this problem in normal way without machine learning just by If and loop things, but I wish I could do that also in "smarter" way
the code above shows nothing. If you have data about some products like each entry contains fields you can clasterize this with KNN what is an unsupervides method.
I have to put in input every group that exists
Not, just define the metric between two entries and the method makes classes or entire dendrogram according to that, so you can select classes from dendrogramm as you want. If you look at each node there, it contains common feature of items in class, so it makes auto-description for a class.
I am working with a huge H2OFrame (~150gb, ~200 million rows), which I need to manipulate a little. To be more specific: I have to use the frame's ip column, to find the location/city names for each IP and add this information to each of the frame's rows.
Converting the frame to a plain python object and manipulating it locally is not an option, due to the huge size of the frame. So what I was hoping I could do is to use my H2O cluster to create a new H2OFrame city_names using the original frame's ip column and then merge both frames.
My question is kind of similar to the question posed here, and what I gathered from this question's answer is that there is no way in H2O to do complex manipulations of each of the frame's rows. Is that really the case? H2OFrame's apply function only accepts a lambda without custom methods after all.
One option I thought of was to use Spark/Sparkling Water for this kind of data manipulation and then convert the spark frame to an H2OFrame to do the machine learning operations. However, if possible I would prefer to avoid that and only use H2O, not least due to the overhead that such a conversion creates.
So I guess it comes down to this: Is there any way to do this kind of manipulation using only H2O? And if not is there another option to do this without having to change my cluster architecture (i.e. without having to turn my H2O cluster into a sparkling water cluster?)
Yes, when using apply with H2OFrame, you can not pass a function instead only lambda is accept. For example if you try passing tryit function you will get the following error showing the limitation:
H2OValueError: Argument `fun` (= <function tryit at 0x108d66410>) does not satisfy the condition fun.__name__ == "<lambda>"
As you already know Sparkling Water is another option to perform all the data munging first in spark and then push you data into H2O for ML.
If you want to stick with H2O as it is, then your options are to just loop through the dataframe to process elements your way. The following option could be little time consuming depending on your data however it does not ask you to move your environment.
Create a new H2O frame by selecting your "ip" column only and add location, city, and other empty columns to it with NA.
Loop through all the ip values and based on "ip", find location/city and add location, city and other column values to the existing columns
Finally cbind the new h2oFrame with original H2OFrame
Check "ip" and "ip0" columns for proper merge with 100% match and then remove one of the duplicate "ip0" column.
Remove the other extra H2OFrame to save memory
If your ip --> city algorithm is a lookup table, you could create that as a data frame, then use h2o.merge. For an example, this video (starting at around the 59min mark) shows how to merge weather data into the airlines data.
For ip addresses I imagine you might want to first truncate to the first two or three parts.
If you don't have a lookup table, it becomes interesting as to whether it is quicker to turn a complex algorithm into that lookup tree and do the h2o.merge, or stick with downloading your huge data in batches, running locally in client, uploading a batch of answers, and doing h2o.cbind at the end.
BTW, the cool and trendy approach would be to sample 1 million of your ip addresses, lookup the correct answer on the client to make a training data set, then use h2o to build a machine learning model. You can then use h2o.predict() to create the new city column in your real data. (You will want to at least split ip address into 4 columns first, though.) (My hunch is a deep random forest would work best... but I would definitely experiment a bit.)
I am new to programming and would appreciate any advice regarding my assignment. And before you read what is following, please accept my appologies for being so silly.
Context
Every week I receive several .txt documents. Each document goes something like this:
№;Date;PatientID;Checkpoint;Parameter1;Parameter2;Parameter3/n
1;01.02.2014;0001;1;25.3;24.2;40.0/n
2;01.02.2014;0002;1;22.1;19.1;40.7/n
3;02.02.2014;0003;1;20.1;19.3;44.2/n
4;04.02.2014;0001;2;22.8;16.1;39.3/n
...
The first line contains column names, and every other line represents an observation. In fact there are over 200 columns and about 3000 lines in each .txt file I get. Moreover, every week column names may be slightly different from what they were a week before, and every week the number of observations increases.
My job is to select the observations that satisfy certain parameter requirements and build boxplots for some of the parameters.
What I think I should do
I want to make a program in Python 2.7.6 that would consist of four parts.
Code that would turn every observation into an object, so that I can access attributes like this:
obs1.checkpoint = 1
obs4.patientid = "0001"
I literary want column names to become attribute names.
Having done this it would be nice to create an object for every unique PatientID. And I would like objects representing observations related to this patient to be attributes of patient objects. My goal here is to make it easy to check if from checkpoint 1 to checkpoint 2 the patient's parameter increases.
Code that would select the observations I need.
Code that would build boxplots.
Code that would combine the three parts above into one program.
What I've found so far
I have found some working code that dynamically adds attributes to instances:
http://znasibov.info/blog/html/2010/03/10/python-classes-dynamic-properties.html
I'm afraid I don't fully understand how it works yet, but I think it might be of use in my case to turn column names into attribute names.
I have also found that creating variables dynamically is frowned upon:
http://nedbatchelder.com/blog/201112/keep_data_out_of_your_variable_names.html
Questions
Is it a good idea to turn every line in the table into an object?
If so, how do I go about it?
How do I create as many objects as there are lines in the table, and how do I name these objects?
What is the best way to turn column names into class attributes?