Training a model with several large CSV files [closed]

Training a model with several large CSV files [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I have a dataset composed of several large csv files. Their total size is larger than the RAM of the machine on which the training is executed.
I need to train an ML model from Scikit-Learn or TF or pyTorch (Think SVR, not deep learning). I need to use the whole dataset which is impossible to load at once. Any recommendation on how to overcome this, please?

I have been in this situation before and my suggestion would be take a step back and look at the problem again.
Does your model absolutely need all of the data at once? Or can it be done in batches? It's also possible that the model you are using can be done in batches, but the library you are using does not support such a case. In that situation, either try to find a library that does support batches or if such a library does not exist (unlikely), "reinvent the wheel" yourself, i.e., create the model from scratch and allow batches. However, as your question mentioned, you need to use a model from Scikit-Learn, TensorFlow, or PyTorch. So if you truly want to stick with your mentioned libraries, there are techniques such as those that Alexey Larionov and I'mahdi mentioned in comments to your question in relation to PyTorch and TensorFlow.
Is all of your data actually relevant? Once I found that a whole subset of my data was useless to the problem I was trying to solve; another time I found that it was only marginally helpful. Dimensionality reduction, numerosity reduction, and statistical modeling may be your friends here. Here is a link to a wikipedia page about data reduction:
https://en.wikipedia.org/wiki/Data_reduction
Not only will data reduction reduce the amount of memory you need, it will also improve your model. Bad data in means bad data out.

Related

Captcha Security and Deep Learning [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I came across this research paper-http://www.cs.sjsu.edu/~pollett/papers/neural_net_plain.pdf.
These researchers have come up with a way to break character-based CAPTCHAs and it seems they have succeeded as they have used 13 million captchas for training the CNN they made and got accuracies higher than 95%.
How can we make a CAPTCHA secure so that it isn't bypassed by a deep learning model?

First of all, captchas are meant to stop automated users/bots. Yes, if you have the actual captcha generator, and you train a deep learning model on that distribution, chances are it will perform well.
Captchas are getting harder, they can be made even harder. But, it takes resources to generate the captchas, actual computational resources (unless they are random images and not synthetic). If it is needed to make a really bot-proof website, it can be made.
By bot, it usually means web scraping tools/automated users, who try to do things like human users, but very fast. Now, if you also integrate, deep learning models to it, it's possible to bypass the captchas (in most cases), but it may be an overkill (depending on your needs).
Saving websites from bots is less important than facial recognition, self-driving cars (relative statement).

How to train a ML Model for detecting duplicates without having data sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am currently working on my master thesis. My topic is the development of a duplicate checking system in ABAP for a customer in the SAP environment. Up to now, the whole thing works in such a way that if the checking system does not know exactly whether an incoming data record is a duplicate or not, a clerk intervenes and makes the final decision. The clerk is to be "replaced by a machine learning model", which is to be located on a Python server, so that the clerk intervenes less in the process and becomes more and more of an "AI trainer". The communication between the duplicate checking system and the ML model is done by a REST API. The ML-model should decide whether incoming data records are duplicates or not.
My first problem is that I don't have any training data to create an ML model. The second problem is that I still do not know exactly what my training data might look like from the structure. It is possible to get client data records from the client, but for various reasons this is quite unlikely. Is there a way to generate "synthetic data" to create an ML model? What could it look like for my application? Which tools could I use to make my work a little easier?
Many greetings

You can't.
When you don't have any real-world data and how humans classified it, then you can not train a ML system to classify real-world data.
What you can do is train a system with data you generated and classified in a way you believe to be similar to what the program might encounter in the wild. This would allow you to try your methodology and see if it works at all. But if you want to create an ML model which is actually useful in the real world, you need to repeat the training process with real-world data.

There is a lot of variability with what your data COULD be. You need to narrow down with what it will look like.
You can always create random data:
>>> N = 5
>>> import random
>>> import numpy as np
>>> np.random.random((N,N))
Create yourself a whole bunch of random arrays, but then copy one, and then you can test whether you can catch whether it is duplicate or not?

Preprocessing machine learning data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This may be a stupid question, but I am new to ML and can't seem to find a clear answer.
I have implemented a ML algorithm on a Python web app.
Right now I am storing the data that the algorithm uses in an offline CSV file, and every time the algorithm is run, it analyzes all of the data (one new piece of data gets added each time the algorithm is used).
Apologies if I am being too vague, but I am wondering how one should generally go about implementing the data and algorithm properly so that:
The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).

The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
You can store in whatever format you like.
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).
This depends very much on what algorithm you use. Some algorithms can easily be implemented to learn in an incremental manner. For example, Linear/Logistic Regression implemented with Stochastic Gradient Descent could easily just run a quick update on every new instance as it gets added. For other algorithms, full re-trains are the only option (though you could of course elect not to always do them over and over again for every new instance; you could, for example, simply re-train once per day at a set point in time).

Multiclass MultiOutput Classification with both categorical and continous attribute without encoding in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm Working on a Machine learning (Data-Mining) project and i'm done with data exploration and data preparation step and it was done in python!
Now I'm facing this issue : i have categoricals attributes in my dataset .
After research i've found that the best appropriate algorithm for that kind of data is a decision tree or a random forrest classifier !
But I've read some similar questions about decision tree and categorical attribute and found that the library I'm using (scikit-learn) doesn't works with categoricasl attributes . check here and here , for making it work with categorical i need to encode my categorical variables into numerical ones but i don't want to use encoding because i will loose some properties of my attributes and some informations according to this answer , and also some of my attributes has more than 100 different values.
So I want to know :
is there any other python library that can build decision trees with categorical data without any encoding?
in this answer it was suggest that other libraries like WEKA can build decisions trees with categorical attributes so my question is this can I combine 2 language in the same machine learning project?
Will do data exploration and preparation in python, train the model in weka (java), and deploy it in a python-flask web app?
can it be possible?

The answer you linked about encoding categorical inputs is just saying that you should avoid numerical encoding when your categories don't have an inherent order. It correctly recommends that you use a one-hot encoding in this case.
Simply put, machine learning models operate on numbers, so even if you find a library that takes your raw categories without explicit encoding, it will still have to internally encode them before it can perform any computation.
100 categories is not a lot, and most of the shelf libraries will handle such inputs just fine. I recommend you try xgboost

Loading Big Data on Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm fairly new to data-science and barely started using python roughly about two months ago. I've been trying to do a Kaggle competition for fun (catsVsDogs) to try to learn things along the way. However I'm stopped at the very first step. The problem is that there is a training set, which contains about 25000 .jpg images of cats and dogs and the total directory is approximately 800 MB in size. Whenever I try to load the directory into python and save all the images in a matrix (say we have 100 of (300,200) size images, I would like to save them in a matrix of 100*(60000) size) I get either a memory error or the system just stops processing. I'm using canopy on a mac.
I've been trying to read a lot on the internet and find out how people deal with these big images, but it has been a week and I still couldn't find any good source. I would highly appreciate it if somebody helped me out or would just send me a link that describes the situations.
here's the link for Kaggle Competition (you can see there is no prizes involved and it's just for the sake of learning):
https://www.kaggle.com/c/dogs-vs-cats/data
The question is how do I manage to load this big dataset into python using canopy and start training a Neural Network. Or generally how do I deal with big datasets on a single computer without memory error.

I would recommend making an index of items that you wish to read (directory listing). Next read just the first item, train using just that item, remove that item from memory, move on to the next item, and repeat. You shouldn't need to have more then a few in memory at any given time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.