Dataset for a python application - python

I am working on an application to predict a disease from it's symptoms, I have some trouble making a dataset.
If someone has a dataset on this, please link it to drive and share it here.
Also I have a question on a good model for this(sklearn only). I am currently using decision tree classifier as my model for the project. Give suggestions if you have any.
Thank you for reading.
EDIT: Got the solution

You can make your own from this csv template:
Sickness, Symptom1, Symptom2, Symptom4
Covid-19, Cough, Loss of taste, Fever, Chills
Common Cold, Sneezing, Cough, Runny Nose, Headache
ignore bullet points, just for formatting. then use pandas read csv to read the data. if u need more help #mention me

I see that you are having trouble finding a dataset. I made a quick search, and i found this one in kaggle. It would require preprocessing, since many of the symptoms are nulls in the columns. Maybe you could make it so each column is a specific sympton, with values 1 (or 0) if the symptom is (or isn't) present. This would have the problem that the number of 0s would be very high. You can try that and see if it works.
You can also see another implementation with Random Forest in this link, with very different preprocessing. It is an advanced model of Decision Tree. However, the Decision Tree is more interpretable, if that is what you need.

Related

Recommendation for ML algorithm to differentiate between ban or allowance

I am new to Machine Learning and wanted to see if any of you could recommend an algorithm I could apply onto a project I'm doing. Basically I want to scrape popular housing websites and look at their descriptions to see if they allow/disallow something, for example pets. The problem is a simple search for pets leads contradictory results: 'pets allowed' and 'no additional cost for pets' or 'no pets' or 'I don't accept at this time'. As seen from these examples, often negative keywords 'no' are used to indicate pets are allowed, whereas positive keywords like 'accept' are used to indicate a ban. As such, I was wondering if there was any algorithm I could use (preferably in python) differentiate between the two. (Note: I can't run training data to generate an algorithm myself as the thing I am actually looking for is quite niche).
Thank you very much for your help!!
The keyword you're looking for is "document classification". This is a document classification problem. You start with documents (i.e. webpages) and you want to classify them as "allows pets" or "doesn't allow pets" (or whatever). There are a lot of good tutorials out there for performing document classification but a full explanation is beyond the scope of a StackOverflow answer.
You won't be able to do this for your particular niche case without providing at least some training data but you could gather, say 30 example websites, extract their text, manually add labels ("does fit my niche" vs "doesn't fit my niche"), and then run through a standard document classification system and see if that gets you the accuracy you want. Also in order for this to work with a small amount of training data (like your 30 documents), you'll need to start from a pretrained model.
Good luck!

Linear Regression Model that improves as the user selects and trains data

I'm developing a script that detects peaks on a signal data from a biological source. I want to create a semi-automated model that helps predict which peaks are the correct ones. This script improves as the user manually selects a few of these peaks to help teach the model which ones are correct.
The workflow I'm trying to attain is this:
1. User manually selects data
2. Script obtains the correct data and fits it into the model
3. Use the model to predict the likelihood of a given peak to be correct.
4. Hopefully with enough data and training, it could be automated to run through the rest.
I also don't know the name of the general topic and I'm struggling to find what to google.
I've tried to fit it on linear regression model in scikit learn but I don't have enough datasets (as it learns from the user's first intervention). Is what I'm doing possible?
Sorry for the general-ness of this answer but the OP asked for general topics.
It sounds like semi-supervised learning and here for scikit-learn and here for more details may work.
There is no labeled data to start. A manual process is started to gain some labeled data. Soon, semi-supervised can kick in and take over - with a process measuring its accuracy. A match to your situation and a good place to start.
Eventually you may have "enough" correctly labeled data that you can investigate fitting a classic algorithm to predict the remainder. "Enough" being relative to how hard the problem is. Could be tens, hundreds, thousands, ...
Depending on other details of your situation, Reinforcement learning may work. As you described the situation, this may not work but there may be other details in your environment to leverage this family.
Word of warning - machine learning and semi-supervised in particular may not always work great to every problem. Measure, measure, measure.
Thank you everyone for all your help. I was talking to a colleague and he referred me to Online Machine Learning. I think this was the one I was looking for. Although I would not be handling time-series data nor streaming data from online, the method i think is sufficient for my needs. This method allows that data is trained one by one and not as a batch. I think SciKit Learn currently does not have the ability of out-of-the-box online machine learning.
This i think gives a great rundown on the strengths of online machine learning (also showcasing of the creme python library).
Thanks again!

Pandas - How to predict values of a another column with values from neighboring columns

I have a pandas DataFrame that consists of several rows and columns. I am specifically interested in two columns. See the below example.
UID Item Composition
1 Water Hydrogen,Oxygen
2 Sulfuric acid Hydrogen,Sulfur,Oxygen
3 Alcohol Spirit
4 Hydrochloric acid Hydrogen,Chloride
5 Citric Acid Hydrogen,Carbon, Oxygen
Let's say we have a very long list. I would like to predict the Item column by learning the Composition column. Please suggest the best method to do this using python libraries.
One approach may be using sklearn library (decision tree classifier) as you have few features only. The composition will need to be separated and coded to numeric values. I'm not an expert in this filed, you can find plenty resources about it here and elsewhere. It helped me with similar problem to yours. Just a suggestion.
Thank you #B.Malysz for taking the time to comment and giving me a direction. I did go through the decision trees and kept on reading a lot of material and finally found that using TF-IDF vectorizer I was able to build a logic that can solve this problem. I was able to predict the item from its composition with quite a good accuracy level. I also tried to use LinearSVC, Randomforestclassifier or logisticregression and test to see which gives better prediction results.
Unfortunately I have been down voted by some people for asking this question :(

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.
If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

Python NTL - Identifying text interest / topic

I am attempting to build a model that will attempt to identify the interest category / topic of supplied text. For example:
"Enjoyed playing a game of football earlier."
would resolve to a top level category like:
"Sport".
I'm not sure what the correct terminology is for what I am trying to achieve here so Google hasn't turned up any libraries that may be able to help. With that in mind, my approach would be something like:
Extract features from text. Use tagging to classify each feature / identify names / places. Would probably use NTLK for this, or Topia.
Run a Naive Bayes classifier for each interest category ("Sport", "Video Games", "Politics" etc.) and get a relevancy % for each category.
Identify which category has the highest % accuracy and categorise the text.
My approach would likely involve having individual corpora for each interest category and I'm sure the accuracy would be fairly miserable - I understand it will never be that accurate.
Generally looking for some advice on the viability of what I am trying to accomplish, but the crux of my question: a) is my approach is correct? b) are there any libraries / resources that may be of assistance?
You seem to know a lot of the right terminology. Try searching for "document classification." That is the general problem you are trying to solve. A classifier trained on a representative corpus will be more accurate than you think.
(a) There is no one correct approach. The approach you outline will
work, however.
(b) Scikit
Learn
is a wonderful library for this sort of work.
There is plenty of other information, including tutorials, online about this topic:
This Naive Bayesian Classifier on github probably already does most of what you want to accomplish.
This NLTK tutorial explains the topic in depth.
If you really want to get into it, I am sure a Google Scholar search will turn up thousands of academic articles in computer science and linguistics about exactly this topic.
You should check out Latent Dirichlet Allocation it will give you categories without labels , as always ed chens bolg is a good start.

Categories