Loading Big Data on Python [closed]

Loading Big Data on Python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm fairly new to data-science and barely started using python roughly about two months ago. I've been trying to do a Kaggle competition for fun (catsVsDogs) to try to learn things along the way. However I'm stopped at the very first step. The problem is that there is a training set, which contains about 25000 .jpg images of cats and dogs and the total directory is approximately 800 MB in size. Whenever I try to load the directory into python and save all the images in a matrix (say we have 100 of (300,200) size images, I would like to save them in a matrix of 100*(60000) size) I get either a memory error or the system just stops processing. I'm using canopy on a mac.
I've been trying to read a lot on the internet and find out how people deal with these big images, but it has been a week and I still couldn't find any good source. I would highly appreciate it if somebody helped me out or would just send me a link that describes the situations.
here's the link for Kaggle Competition (you can see there is no prizes involved and it's just for the sake of learning):
https://www.kaggle.com/c/dogs-vs-cats/data
The question is how do I manage to load this big dataset into python using canopy and start training a Neural Network. Or generally how do I deal with big datasets on a single computer without memory error.

I would recommend making an index of items that you wish to read (directory listing). Next read just the first item, train using just that item, remove that item from memory, move on to the next item, and repeat. You shouldn't need to have more then a few in memory at any given time.

Related

Training a model with several large CSV files [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I have a dataset composed of several large csv files. Their total size is larger than the RAM of the machine on which the training is executed.
I need to train an ML model from Scikit-Learn or TF or pyTorch (Think SVR, not deep learning). I need to use the whole dataset which is impossible to load at once. Any recommendation on how to overcome this, please?

I have been in this situation before and my suggestion would be take a step back and look at the problem again.
Does your model absolutely need all of the data at once? Or can it be done in batches? It's also possible that the model you are using can be done in batches, but the library you are using does not support such a case. In that situation, either try to find a library that does support batches or if such a library does not exist (unlikely), "reinvent the wheel" yourself, i.e., create the model from scratch and allow batches. However, as your question mentioned, you need to use a model from Scikit-Learn, TensorFlow, or PyTorch. So if you truly want to stick with your mentioned libraries, there are techniques such as those that Alexey Larionov and I'mahdi mentioned in comments to your question in relation to PyTorch and TensorFlow.
Is all of your data actually relevant? Once I found that a whole subset of my data was useless to the problem I was trying to solve; another time I found that it was only marginally helpful. Dimensionality reduction, numerosity reduction, and statistical modeling may be your friends here. Here is a link to a wikipedia page about data reduction:
https://en.wikipedia.org/wiki/Data_reduction
Not only will data reduction reduce the amount of memory you need, it will also improve your model. Bad data in means bad data out.

Which is a good method to remove noises from this captcha image [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 months ago.
Improve this question
I'm trying to clean the line noises from this captcha, so I can implement an algorithm to read them. However, I'm finding some difficulties to make it readable to an AI using some techniques, such as Open CV threshold combined with some resources from pil.Image. I also tried an algorithm to "chop" the image, which gave me a better results, but stil far from the expected. I want to know if there is an alternative to remove noises from captchas like this one effectively.
(I'm using python)
Initially, the Captcha looks like this:
Once processed using OpenCV + Pillow, I've got this:
Later, using the "chop method" this what we have:
However, I need a better final image, but I think this methods combination is not appropriate. Is there a better alternative?

I think you could try minisom: https://github.com/JustGlowing/minisom
SOM (Self organizes maps) are a type of neural networks that group clusters of points in data, with an appropiate threshold it could help you removing those lines that are not surrounding the numbers/letters, combining that with chop method could do the job.

Captcha Security and Deep Learning [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I came across this research paper-http://www.cs.sjsu.edu/~pollett/papers/neural_net_plain.pdf.
These researchers have come up with a way to break character-based CAPTCHAs and it seems they have succeeded as they have used 13 million captchas for training the CNN they made and got accuracies higher than 95%.
How can we make a CAPTCHA secure so that it isn't bypassed by a deep learning model?

First of all, captchas are meant to stop automated users/bots. Yes, if you have the actual captcha generator, and you train a deep learning model on that distribution, chances are it will perform well.
Captchas are getting harder, they can be made even harder. But, it takes resources to generate the captchas, actual computational resources (unless they are random images and not synthetic). If it is needed to make a really bot-proof website, it can be made.
By bot, it usually means web scraping tools/automated users, who try to do things like human users, but very fast. Now, if you also integrate, deep learning models to it, it's possible to bypass the captchas (in most cases), but it may be an overkill (depending on your needs).
Saving websites from bots is less important than facial recognition, self-driving cars (relative statement).

Can you generate new asm files based on old ones? And if so what is the most efficent way? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
Let's say I have a lot of .asm files on a python program (It can also be binary strings, hex strings or whatever you would like). How can I use those files to generate new files that function roughly the same (It's for an assembly game).
The thing is I have a lot of assembly players that were really good at the game and I wondered if I can somehow use natural selection to breed better assembly bots.

This sounds a lot like superoptimization (wikipedia).
e.g. STOKE starts with a sequence of asm instructions and stochastically modifies it looking for shorter / faster sequences that do the same thing.
(Or STOKE can start from scratch looking for an asm sequence that gives the desired result for a set of test-cases.)
It's open source, so have a look at the algorithms they use to modify asm and test-run the code. Of course it's possible if you have data structures that represent operands and opcodes.
See also Applying Genetic Programming to Bytecode and
Assembly, an academic paper from 2014.
I haven't read it, but hopefully it addresses ways to recombine code from different mutations and maybe get something useful more often than you get garbage that steps on the registers from the other code. (That's the major trick with random changes to code, especially in assembly where there are lots of non-useful instruction sequences.)

Preprocessing machine learning data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This may be a stupid question, but I am new to ML and can't seem to find a clear answer.
I have implemented a ML algorithm on a Python web app.
Right now I am storing the data that the algorithm uses in an offline CSV file, and every time the algorithm is run, it analyzes all of the data (one new piece of data gets added each time the algorithm is used).
Apologies if I am being too vague, but I am wondering how one should generally go about implementing the data and algorithm properly so that:
The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).

The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
You can store in whatever format you like.
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).
This depends very much on what algorithm you use. Some algorithms can easily be implemented to learn in an incremental manner. For example, Linear/Logistic Regression implemented with Stochastic Gradient Descent could easily just run a quick update on every new instance as it gets added. For other algorithms, full re-trains are the only option (though you could of course elect not to always do them over and over again for every new instance; you could, for example, simply re-train once per day at a set point in time).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.