Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Why it is so important to do pre-processing and what are the simple steps doing it? Can anyone help. I am working on python.
I have a dataframe consisting of null values. The data consist of outliers, moreover it is not distributed uniformly.
My question is what protocol I should follow inorder to fill null values, should I remove outliers because this might lead to loss of information and what are the steps to make data distributed uniformly?
Firstly it really doesnot matter which language you are working on. Both python and R are popular in Data Science.
Secondly, you cannot insert raw data to any machine learning models. Before you need to clean it. Here are some simple steps:
1. Remove missing values: Many a times there are missing values present in the data. So you have to fill those data. Question arises how? There are planty of methods which you can google out.
2. Remove skewness and outliers: Generally data contains values that are not within the range of other data. So you have to bring those data with that range.
3. One-hot-encoding: Categorical values needs to be transformed to encoding format.
Still there are more steps but you to google it out there are tons of blogs you can go through.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 months ago.
Improve this question
Is there any Hierarchical Agglomerative Clustering implementation (in Python) available that preserves the order of data points? For example, I want the output something like this.
(((seg1, seg2), (seg3, seg4)), seg5)
but not like this
(((seg1, seg5), (seg2, seg3)), seg4)
E.g., Actual output with existing implementation
Expected output (any implementation?)
Vijaya, from what I know, there is only one public library that does order preserving hierarchical clustering (ophac), but that will only return a trivial hierarchy if your data is totally ordered (which is the case with the sections of a book).
There is a theory that may offer a theoretical reply to your answer, but no industry-strength algorithms currently exist: https://arxiv.org/abs/2109.04266. I have an implementation of this theory that can deal with up to 20 elements, so if this is interesting, give me a hint, and I will share the code.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am currently working on my master thesis. My topic is the development of a duplicate checking system in ABAP for a customer in the SAP environment. Up to now, the whole thing works in such a way that if the checking system does not know exactly whether an incoming data record is a duplicate or not, a clerk intervenes and makes the final decision. The clerk is to be "replaced by a machine learning model", which is to be located on a Python server, so that the clerk intervenes less in the process and becomes more and more of an "AI trainer". The communication between the duplicate checking system and the ML model is done by a REST API. The ML-model should decide whether incoming data records are duplicates or not.
My first problem is that I don't have any training data to create an ML model. The second problem is that I still do not know exactly what my training data might look like from the structure. It is possible to get client data records from the client, but for various reasons this is quite unlikely. Is there a way to generate "synthetic data" to create an ML model? What could it look like for my application? Which tools could I use to make my work a little easier?
Many greetings
You can't.
When you don't have any real-world data and how humans classified it, then you can not train a ML system to classify real-world data.
What you can do is train a system with data you generated and classified in a way you believe to be similar to what the program might encounter in the wild. This would allow you to try your methodology and see if it works at all. But if you want to create an ML model which is actually useful in the real world, you need to repeat the training process with real-world data.
There is a lot of variability with what your data COULD be. You need to narrow down with what it will look like.
You can always create random data:
>>> N = 5
>>> import random
>>> import numpy as np
>>> np.random.random((N,N))
Create yourself a whole bunch of random arrays, but then copy one, and then you can test whether you can catch whether it is duplicate or not?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I want to forecast upcoming total users on a daily basis within Python using a machine learning algorithm. Check the pattern below:
Looking at this graph, I was wondering if someone knows which forecasting method in Python I should use to predict?
Thanks!
If you have no additional data expect the user data over time which you have shown, the only thing you can do is try to find a function dependent on time which gives you a good approximation for that plot (ordinary curve fitting). I suppose that's not what you want.
To do a predection (which can be done not only by a machine learning approach), you need other data which is somehow correlated to the data you want to predict.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have about 3,000 words and I would like to group them into about 20-50 different categories. My words are typical phrases you might find in company names. "Face", "Book", "Sales", "Force", for example.
The libraries I have been looking at so far are pandas and scikit-learn. I'm wondering if there is a machine-learning or deep-learning algorithm that would be well suited for this?
The topics I have been looking are Classification: identifying which category an object belongs to, and Dimensionality Reduction: reducing the random number of variables to consider.
When I search for putting words into categories on Google, it brings up kids puzzles such as "things you do with a pencil" - draw. Or "parts of a house" - yard, room.
for deep learning to work on this you would have to develop a large dataset, most likely manually. the largest natural language processing dataset was, in fact, created manually.
BUT even if you were able to find a dataset which a model could learn off. THEN a model such as gradient boosted trees would be one, amongst others, that would be well suited to multi-class classification like this. A classic library for this is xgboost.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm planning to write a script that reads in text input data. This would consist of certain terms e.g "red car".
What machine learning tools for python should I use if I wanted to identify potential matches to a term in my text input data within a database of terms and sentences.
For example, I would want similarly spelled terms (e.g mis-spelled terms) like "redd car" to be identified and listed in the output of my script.
Edit 1: I have a method of identifying string similarity using FuzzyWuzzy to return a number representation of two strings's similarity to each other. My question would be now how to divide the words in the database into "similar" and "not similar" using machine learning approaches.
Without knowing much of your setup I would recommend using scikit-learn packages for your project. It has support for almost every aspect of machine learning including but not limited to:
Classification
Regression
Clustering
Dimensionality reduction
Model selection
Preprocessing