I have JSON with data about some products and I have already converted this into flat table by pandas, so now I have a few columns with data. I Selected some products manually and putted them into one group. I have sorted them by name for example but this is more complicated, there are also some features and requirements which need to be checked.
So what I want is to create script which will group my products in familiar way as those few groups I created manually based on my own thoughts.
Im totally new into machine learning, but I read about this and also watched some tutorials, but I haven't seen this type of case.
I saw that if I use KNN classifier for example, I have to put in input every group that exists and then it will assign single product to one of those groups, but in my case this must be more complicated I guess since I want from this script to create those groups on his own in similiar way to selected by me.
I was thinking about unsupervised machine learnign but this doesn't look like solution beacuse I have my own data which I want to provide, it seems like I need to use some kind of hybrid with supervised machine learning.
data = pd.read_json('recent.json')['results']
data = json_normalize(data)
le = preprocessing.LabelEncoder()
product_name = le.fit_transform(data['name'])
just some code to show what I have done
I don't know if that makes sense what I want, I already made attempt to this problem in normal way without machine learning just by If and loop things, but I wish I could do that also in "smarter" way
the code above shows nothing. If you have data about some products like each entry contains fields you can clasterize this with KNN what is an unsupervides method.
I have to put in input every group that exists
Not, just define the metric between two entries and the method makes classes or entire dendrogram according to that, so you can select classes from dendrogramm as you want. If you look at each node there, it contains common feature of items in class, so it makes auto-description for a class.
Related
I am currently working on a big categorical dataset using both Pandas an sklearn. As I want know to do feature extraction, or just be able to build a model, I need to encode the categoricals features as they are not handled by sklearn models.
Here is the issue : for one of the categories, I have more than 110,000 different values. I cannot possibly encode this as there would be (and there is) a memory error.
Also, this feature cannot be removed from the DataSet so this option is out of the question.
So I have multiple ways to deal with that :
Use FeatureHasher from sklearn (as mentionned in several topics related to that). But, while it's easy to encode, there is no mention on how to link the hashed feature with the DataFrame and thus, complete the feature extraction.
Use FuzzyMatching to reduce the number or values in the features and then use one-hot-encoding. The issue here is that it'll still create many dummy variables and I will lose the interest regarding the features that get encoded.
So I have multiple questions : first, do you know how to use FeatureHasher to link a Pandas DataFrame to an sklearn model through hashing ? If so, how am I supposed to do it ?
Second, can you think of any other way to do it that might be easier, or would work best with my problem?
Here are some screenshots regarding the DataSet I have, so that you understand more fully the issue.
Here is the output of the number/percentage of different values per feature.
Also, the values inside the biggest feature to encode called 'commentaire' (commentary in english) contains strings that sometimes are long sentences and sometimes just small words.
As requested, here is the current code to test Feature Hashing:
fh = FeatureHasher(input_type='string')
hashedCommentaire = fh.transform(dataFrame['commentaire'])
This outputs a Memory Error, so I can reduce the number of features, let's put it at a 100:
fh = FeatureHasher(input_type='string', n_features=100)
hashedCommentaire = fh.transform(dataFrame['commentaire'])
print(hashedCommentaire.toarray())
print(hashedCommentaire.shape)
It doesn't get an error and outputs the following : feature hashing output
Can I then directly use the result hashing in my DataFrame? The issue is that we totally loose track of the values of 'commentaire', if we then want to predict on new data, will the hashing output follow the previous one ? And also how do we know the hashing is "sufficient". Here I put 100 rows but there was more than 110,000 values to begin with.
Thanks to another observation, I began to explore the use of tf-idf. Here we can see a sample of 'commentaire' values: 'commentaire' feature sample
As we can see the different values will be strings. Sometimes sentences. The thing is, while exploring this data, I noticed that some values are very close to each other; that's why I wanted to explore the idea of Fuzzy Matching at first and now of tf-idf.
I am new to programming and would appreciate any advice regarding my assignment. And before you read what is following, please accept my appologies for being so silly.
Context
Every week I receive several .txt documents. Each document goes something like this:
№;Date;PatientID;Checkpoint;Parameter1;Parameter2;Parameter3/n
1;01.02.2014;0001;1;25.3;24.2;40.0/n
2;01.02.2014;0002;1;22.1;19.1;40.7/n
3;02.02.2014;0003;1;20.1;19.3;44.2/n
4;04.02.2014;0001;2;22.8;16.1;39.3/n
...
The first line contains column names, and every other line represents an observation. In fact there are over 200 columns and about 3000 lines in each .txt file I get. Moreover, every week column names may be slightly different from what they were a week before, and every week the number of observations increases.
My job is to select the observations that satisfy certain parameter requirements and build boxplots for some of the parameters.
What I think I should do
I want to make a program in Python 2.7.6 that would consist of four parts.
Code that would turn every observation into an object, so that I can access attributes like this:
obs1.checkpoint = 1
obs4.patientid = "0001"
I literary want column names to become attribute names.
Having done this it would be nice to create an object for every unique PatientID. And I would like objects representing observations related to this patient to be attributes of patient objects. My goal here is to make it easy to check if from checkpoint 1 to checkpoint 2 the patient's parameter increases.
Code that would select the observations I need.
Code that would build boxplots.
Code that would combine the three parts above into one program.
What I've found so far
I have found some working code that dynamically adds attributes to instances:
http://znasibov.info/blog/html/2010/03/10/python-classes-dynamic-properties.html
I'm afraid I don't fully understand how it works yet, but I think it might be of use in my case to turn column names into attribute names.
I have also found that creating variables dynamically is frowned upon:
http://nedbatchelder.com/blog/201112/keep_data_out_of_your_variable_names.html
Questions
Is it a good idea to turn every line in the table into an object?
If so, how do I go about it?
How do I create as many objects as there are lines in the table, and how do I name these objects?
What is the best way to turn column names into class attributes?
I currently have a function that writes one to four entries into a database every 12 hours. When certain conditions are met the function is called again to write another 1-4 entries based on the previous ones. Now since time isn't the only factor I have to check whether or not the conditions are met and because the entries are all in the same database I have to differentiate them based on their time posted into the database (DateTimeField is in the code)
How could I achieve this? Is there a function built in in django that I just couldn't find? Or would I have to take a look at a rather complicated solution.
as a sketch I would say i'd expect something like this:
latest = []
allData = myManyToManyField.objects.get(externalId=2)
for data in allData:
if data.Timestamp.checkIfLatest(): #checkIfLatest returns true/false
latest.append(data)
or even better something like this (although I don't think that's implemented)
latest = myManyToManyField.objects.get.latest.filter(externalId=2)
The django documentation is very very good, especially with regards to querysets and model layer functions. It's usually the first place you should look. It sounds like you want .latest(), but it's hard to tell with your requirements regarding conditions.
latest_entry = m2mfield.objects.latest('mydatefield')
if latest_entry.somefield:
# do something
Or perhaps you wanted:
latest_entry = m2mfield.objects.filter(somefield=True).latest('mydatefield')
You might also be interested in order_by(), which will order the rows according to a field you specify. You could then iterate on all the m2m fields until you find the one that matches a condition.
But without more information on what these conditions are, it's hard to be more specific.
IT's just a thought.. we can keep epoch time(current time of the entrie) field in database as a primary key and compare with the previous entiries and diffrentiate them
I have a program completed that does the following:
1)Reads formatted data (a sequence of numbers and associated labels) from serial port in real time.
2)Does minor manipulations to data.
3)plots data in real time in a gui I wrote using pyqt.
4)Updates data stats in the gui.
5)Allows post analysis of the data after collection is stopped.
There are two dialogs (separate classes) that are called from within the main window in order to select certain preferences in plotting and statistics.
My question is the following: Right now my data is read in and declared as several global variables that are appended to as data comes in 20x per second or so - a 2d list of values for the numerical values and 1d lists for the various associated text values. Would it be better to create a class in which to store data and its various attributes, and then to use instances of this data class to make everything else happen - like the plotting of the data and the statistics associated with it?
I have a hunch that the answer is yes, but I need a bit of guidance on how to make this happen if it is the best way forward. For instance, would every single datum be a new instance of the data class? Would I then pass them one by one or as a list of instances to the other classes and to methods? How should the passing most elegantly be done?
If I'm not being specific enough, please let me know what other information would help me get a good answer.
A reasonably good rule of thumb is that if what you are doing needs more than 20 lines of code it is worth considering using an object oriented design rather than global variables, and if you get to 100 lines you should already be using classes. The purists will probably say never use globals but IMHO for a simple linear script it is probably overkill.
Be warned that you will probably get a lot of answers expressing horror that you are not already.
There are some really good, (and some of them free), books that introduce you to object oriented programming in python a quick google should provide the help you need.
Added Comments to the answer to preserve them:
So at 741 lines, I'll take that as a yes to OOP:) So specifically on the data class. Is it correct to create a new instance of the data class 20x per second as data strings come in, or is it more appropriate to append to some data list of an existing instance of the class? Or is there no clear preference either way? – TimoB
I would append/extend your existing instance. – seth
I think I see the light now. I can instantiate the data class when the "start data" button is pressed, and append to that instance in the subsequent thread that does the serial reading. THANKS! – TimoB
I'm trying to tweak this code: http://snipperize.todayclose.com/snippet/py/Use-NLTK-Toolkit-to-Classify-Documents--5671027/ to accept some additional features. It seems to be determining its class based on having separate files for separate classes of information, which is fine. But I'd like to also be able to add some additional data for it to look for. What needs to be modified? Any good resources? The book on NLTK/Python doesn't address this.
What do you mean by feature? It seems to me that you want to just add more data, not features.
If you want to consider new features you have to modify extract words accordingly to your needs.
If you just need more data, which may be stored in different files, you should edit the main code to take into account sets of file names rather than single files for features.
That of course implies a modification to the loop at line 74. You have to add another inner loop to iterate over all the filenames in the set