Captcha Security and Deep Learning [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I came across this research paper-http://www.cs.sjsu.edu/~pollett/papers/neural_net_plain.pdf.
These researchers have come up with a way to break character-based CAPTCHAs and it seems they have succeeded as they have used 13 million captchas for training the CNN they made and got accuracies higher than 95%.
How can we make a CAPTCHA secure so that it isn't bypassed by a deep learning model?

First of all, captchas are meant to stop automated users/bots. Yes, if you have the actual captcha generator, and you train a deep learning model on that distribution, chances are it will perform well.
Captchas are getting harder, they can be made even harder. But, it takes resources to generate the captchas, actual computational resources (unless they are random images and not synthetic). If it is needed to make a really bot-proof website, it can be made.
By bot, it usually means web scraping tools/automated users, who try to do things like human users, but very fast. Now, if you also integrate, deep learning models to it, it's possible to bypass the captchas (in most cases), but it may be an overkill (depending on your needs).
Saving websites from bots is less important than facial recognition, self-driving cars (relative statement).

Related

Training a model with several large CSV files [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I have a dataset composed of several large csv files. Their total size is larger than the RAM of the machine on which the training is executed.
I need to train an ML model from Scikit-Learn or TF or pyTorch (Think SVR, not deep learning). I need to use the whole dataset which is impossible to load at once. Any recommendation on how to overcome this, please?
I have been in this situation before and my suggestion would be take a step back and look at the problem again.
Does your model absolutely need all of the data at once? Or can it be done in batches? It's also possible that the model you are using can be done in batches, but the library you are using does not support such a case. In that situation, either try to find a library that does support batches or if such a library does not exist (unlikely), "reinvent the wheel" yourself, i.e., create the model from scratch and allow batches. However, as your question mentioned, you need to use a model from Scikit-Learn, TensorFlow, or PyTorch. So if you truly want to stick with your mentioned libraries, there are techniques such as those that Alexey Larionov and I'mahdi mentioned in comments to your question in relation to PyTorch and TensorFlow.
Is all of your data actually relevant? Once I found that a whole subset of my data was useless to the problem I was trying to solve; another time I found that it was only marginally helpful. Dimensionality reduction, numerosity reduction, and statistical modeling may be your friends here. Here is a link to a wikipedia page about data reduction:
https://en.wikipedia.org/wiki/Data_reduction
Not only will data reduction reduce the amount of memory you need, it will also improve your model. Bad data in means bad data out.

Designing a game advice [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to design a game which is similar to Plague Inc. and it is essentially where there is a deadly virus (quite ironic) which is spreading and the aim of the game is to stop the virus from spreading.
I've split the world into 13 regions, and each region will have several key details I will need to use, such as the number of cases, the number of deaths and the population. With each of these details, I will want some of them to be dynamic, such as wanting the amount of cases and deaths to go up or down.
I'm extremely new to python, and was hoping for some particular expertise in how to design this game. Any guidance of the best ways to represent this data would be much appreciated!
Hello Aran Khalastchi,
Based off of my experiences, Python is not really a graphical programming language, and more of a text based language. I wouldn't suggest Python as your go to unless you are using a library for graphics. If not, I definitely recommend Unity or Godot, and if you want to go fully raw code (no engines/libraries) I recommend Java as it has its own graphics. If I am wrong, please forgive me :)

How to train a ML Model for detecting duplicates without having data sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am currently working on my master thesis. My topic is the development of a duplicate checking system in ABAP for a customer in the SAP environment. Up to now, the whole thing works in such a way that if the checking system does not know exactly whether an incoming data record is a duplicate or not, a clerk intervenes and makes the final decision. The clerk is to be "replaced by a machine learning model", which is to be located on a Python server, so that the clerk intervenes less in the process and becomes more and more of an "AI trainer". The communication between the duplicate checking system and the ML model is done by a REST API. The ML-model should decide whether incoming data records are duplicates or not.
My first problem is that I don't have any training data to create an ML model. The second problem is that I still do not know exactly what my training data might look like from the structure. It is possible to get client data records from the client, but for various reasons this is quite unlikely. Is there a way to generate "synthetic data" to create an ML model? What could it look like for my application? Which tools could I use to make my work a little easier?
Many greetings
You can't.
When you don't have any real-world data and how humans classified it, then you can not train a ML system to classify real-world data.
What you can do is train a system with data you generated and classified in a way you believe to be similar to what the program might encounter in the wild. This would allow you to try your methodology and see if it works at all. But if you want to create an ML model which is actually useful in the real world, you need to repeat the training process with real-world data.
There is a lot of variability with what your data COULD be. You need to narrow down with what it will look like.
You can always create random data:
>>> N = 5
>>> import random
>>> import numpy as np
>>> np.random.random((N,N))
Create yourself a whole bunch of random arrays, but then copy one, and then you can test whether you can catch whether it is duplicate or not?

Letter segmentation from simple CAPTCHA using K-means clustering in python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Context:
I'm trying learning machine-learning using python3. My intended goal is to create a CNN program that can guess simple 4 letters 72*24 pixels CAPTCHA image as below:
CAPTCHA Image Displaying VDF5. This challenge was inspired by https://medium.com/#ageitgey/how-to-break-a-captcha-system-in-15-minutes-with-machine-learning-dbebb035a710, which I thought would be a great challenge for me to learn k-means clustering and CNN.
Edit---
I see I was being too "build me this guy". Now that I found scikit, I'll try to learn it and apply that instead. Sorry for annoying you all.
It seems as if you are looking to build a machine learning algorithm for educational purposes. If so, import TensorFlow and get to it! However, seeing as your question seems to be "create this for me" you might be better off simply using existing implementations from the scikit learn package. Simply import scikit learn, make an instance of the KNearestNeighborClassifier train it, and boom you've cracked this problem.

Loading Big Data on Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm fairly new to data-science and barely started using python roughly about two months ago. I've been trying to do a Kaggle competition for fun (catsVsDogs) to try to learn things along the way. However I'm stopped at the very first step. The problem is that there is a training set, which contains about 25000 .jpg images of cats and dogs and the total directory is approximately 800 MB in size. Whenever I try to load the directory into python and save all the images in a matrix (say we have 100 of (300,200) size images, I would like to save them in a matrix of 100*(60000) size) I get either a memory error or the system just stops processing. I'm using canopy on a mac.
I've been trying to read a lot on the internet and find out how people deal with these big images, but it has been a week and I still couldn't find any good source. I would highly appreciate it if somebody helped me out or would just send me a link that describes the situations.
here's the link for Kaggle Competition (you can see there is no prizes involved and it's just for the sake of learning):
https://www.kaggle.com/c/dogs-vs-cats/data
The question is how do I manage to load this big dataset into python using canopy and start training a Neural Network. Or generally how do I deal with big datasets on a single computer without memory error.
I would recommend making an index of items that you wish to read (directory listing). Next read just the first item, train using just that item, remove that item from memory, move on to the next item, and repeat. You shouldn't need to have more then a few in memory at any given time.

Categories