python captcha decoder library - python

I need a Captcha decoder for python to read simple image captchas like the following picture:
Do you know of a library that can help me read this captcha?
If you don't know of a library for reading captchas, could you help me to read this (and others like this) with PIL?

I hope this captcha is not used anywhere.
Following is a dummy way to decode it. Basically what you need are the patterns from 0 to 9 as present in these captchas. From your examples, I have only the patterns for 0 3 4 5 7 8. Since everything is fixed on them, you know where to split each character. You also know each character is a number of fixed size and fixed font. If it also includes letters or more characters, but of fixed size and font, then the following code can be easily adapted.
What the code does is: a) load the patterns (I considered they are named n0.png, n1.png, ...); b) split the captcha in NUMS pieces; c) do a sum of squared differences between each pattern and each split number; d) decide that the the split number is the one with the smallest sum. It returns a list for each number, in order, present in the captcha. To obtain the initial patterns, you can uncomment the lines that save the split numbers, place a return after that piece, and adjust the file names.
import sys
from PIL import Image, ImageOps
PAT_SIZE = (8, 10)
NUMS = 3
FIRST_NUM_OFFSET = 5
NUM_OFFSET = (1, 3)
NUMBERS = []
for i in xrange(10):
try:
NUMBERS.append(Image.open('n%d.png' % i).load())
except IOError:
print "I do not know the pattern for the number %d." % i
NUMBERS.append(None)
def magic(fname):
captcha = ImageOps.grayscale(Image.open(fname))
im = captcha.load()
# Split numbers
num = []
for n in xrange(NUMS):
x1, y1 = (FIRST_NUM_OFFSET + n * (NUM_OFFSET[0] + PAT_SIZE[0]),
NUM_OFFSET[1])
num.append(captcha.crop((x1, y1, x1 + PAT_SIZE[0], y1 + PAT_SIZE[1])))
# If you want to save the split numbers:
#for i, n in enumerate(num):
# n.save('%d.png' % i)
def sqdiff(a, b):
if None in (a, b): # XXX This is here just to handle missing pattern.
return float('inf')
d = 0
for x in xrange(PAT_SIZE[0]):
for y in xrange(PAT_SIZE[1]):
d += (a[x, y] - b[x, y]) ** 2
return d
# Calculate a dummy sum of squared differences between the patterns
# and each number. We assume the smallest diff is the number in the
# "captcha".
result = []
for n in num:
n_sqdiff = [(sqdiff(p, n.load()), i) for i, p in enumerate(NUMBERS)]
result.append(min(n_sqdiff)[1])
return result
print magic(sys.argv[1])

I hope you are using it in good faith and you are not going to harm (/spam) anyone.
I won't write you the script nor forward you to an external plugin. But incase you are writing this by your own, this may help:
In case you are trying to decode a specific captcha pattern you should collect all chars (I saw from the examples you attached that it's only numbers so it shouldn't be alot of work).
Put all of the chars in one file and analyze it with PIL
Save in an array each char, its position and its meaning.
Get a Captcha image - Clear the background noise if necessary.
Split the Captcha image to char-sized and cross it through your self-made dictionary of chars.

It is a nice project to do for academic reasons, I was interested in this a while ago. You have a few options:
You write your own with the help from this site: [scrubbed dead link]
You use OpenCV to do the matching.
If think there was a dedicated libary for neural network image matching but i can't seem to find it.
Basically as the others said, you want to remove the noise, split into single chars and compare it using a chosen technique to the model chars.

Related

Improving the performance of a sliding-window fragment function in Python 3

I have a script in Python 3.6.8 which reads through a very large text file, where each line is an ASCII string drawn from the alphabet {a,b,c,d,e,f}.
For each line, I have a function which fragments the string using a sliding window of size k, and then increments a fragment counter dictionary fragment_dict by 1 for each fragment seen.
The same fragment_dict is used for the entire file, and it is initialized for all possible 5^k fragments mapping to zero.
I also ignore any fragment which has the character c in it. Note that c is uncommon, and most lines will not contain it at all.
def fragment_string(mystr, fragment_dict, k):
for i in range(len(mystr) - k + 1):
fragment = mystr[i:i+k]
if 'c' in fragment:
continue
fragment_dict[fragment] += 1
Because my file is so large, I would like to optimize the performance of the above function as much as possible. Could anyone provide any potential optimizations to make this function faster?
I'm worried I may be rate limited by the speed of Python loops, in which case I would need to consider dropping down into C/Cython.
Numpy may help in speeding up your code:
x = np.array([ord(c) - ord('a') for c in mystr])
filter = np.geomspace(1, 5**(k-1), k, dtype=int)
fragment_dict = collections.Counter(np.convolve(x, filter,mode='valid'))
The idea is, represent each k length segment is a k-digit 5-ary number. Then, converting a list of 0-5 integers equivalent to the string to its 5-ary representation is equivalent to applying a convolution with [1,5,25,125,...] as filter.

How to use a regex to search for contiguous incrementing sequences

I would like to use regex to increase the speed of my searches for specific records within a large binary image. It seems like regex searches always outperform my own search methods, so that's why I'm looking into this. I have already implemented the following, which works, but is not very fast.
My binary image is loaded into a Numpy memmap as words.
I_FILE = np.memmap(opts.image_file, dtype='uint32', mode='r')
And here is start of my search loop currently (which works):
for i in range(0, FILESIZE - 19):
if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):
...do stuff...
This is seeking out records that are 19 bytes long that start with a decimal sequence number between 0 and 59. It looks for an incrementing sequence on either a record before or after the current search location to validate the record.
I've seen a few examples where folks have crafted variables into string using re.escape (like this: How to use a variable inside a regular expression?) But I can't seem to figure out how to search for a changing value sequence.
I managed to make it work with regex, but it was a bit more complicated than I expected. The regex expressions look for two values between 0 and 59 that are separated by 72 bytes (18 words). I used two regex searches to ensure that I wouldn't miss records at the end of a sequence:
# First search uses the lookahead assertion to not consume large amounts of data.
SearchPattern1 = re.compile(b'[\0-\x3B]\0\0\0(?=.{72}[\1-\x3B]\0\0\0)', re.DOTALL)
# Again using the positive lookbehind assertion (?<= ... ) to grab the ending entries.
SearchPattern2 = re.compile(b'(?<=[\0-\x3B]\0\0\0.{72})[\1-\x3B]\0\0\0', re.DOTALL)
Next, perform both searches and combine the results.
HitList1 = [m.start(0) for m in SearchPattern1.finditer(I_FILE)]
HitList2 = [m.start(0) for m in SearchPattern2.finditer(I_FILE)]
AllHitList = list(set(HitList1 + HitList2))
SortedHitList = sorted(AllHitList)
Now I run a search that has the same conditions as my original solution, but it runs on a much smaller set of data!
for i in range(0, len(SortedHitList)):
TestLoc = SortedHitList[i]
if (I_FILE[TestLoc] + 1 == I_FILE[TestLoc + 19]) or (I_FILE[TestLoc - 19] + 1 == I_FILE[TestLoc]):
... do stuff ...
The result was very successful! The original solution took 58 seconds to run on a 300 MB binary file, while the new regex solution took only 2 seconds!!

parsing scientific publication page ranges in python

I need to parse a set of strings that contain page ranges as they appear in metadata of scientific and other publications. I don't have a complete spec of the pagination format, and I am not even sure if one exists, but examples of strings I need to process are:
6-10, 19-22
xlvii-xlviii
111S-2S
326
A078-132
XC-CIII
Ideally, I'd like to return the number of pages for each string, e.g. 9 for 6-10, 19-22. If that's too hard, at least whether it's a single page or more. The latter is pretty easy actually since commas and dashes seem to be the only delimiters in the examples I've seen so far. But I do very much prefer to get the right count.
I can write my own parser but I am curious whether there are any existing packages that can already do this out of the box or with minimal mods.
Here's a solution that supports parsing "normal" numbers as well as roman numerals. For parsing roman numerals, install the roman package (easy_install roman). You can enhance the parse_num function to support additional formats.
import roman
def parse_num(p):
p = p.strip()
try:
return roman.fromRoman(p.upper())
except:
return int(p)
def parse_pages(s):
count = 0
for part in s.split(','):
rng = part.split('-', 1)
a, b = parse_num(rng[0]), parse_num(rng[-1])
count += b - a + 1
return count
>>> parse_pages('17')
1
>>> parse_pages('6-10, 19-22')
9
>>> parse_pages('xlvii-xlviii')
2
>>> parse_pages('XC-CIII')
14

Python encryption program (ECC)

I'm trying to code an encryption program that will encode a user's Inputed - if that's a word - string. The encryption method is just a basic use of an elliptic curve encryption and I am currently working on the encryption part of the program at the moment before I work on the mathematical, inverse modules etc. Etc. Required for public and private key calculations. Currently I am using the key pub = 5 and a max value (derived from the product of 2 random primes) of 91. This is all the information needed and the word I am testing the encryption on is 'happy'.
So far here is the code.
word = 'happy'
pub = 5
m = 91
for i in range(pub):
if i == 0:
word = word
else:
word = output
for x in word:
a = [(((ord(z)*ord(z))+1)/m) for z in word]
b = [chr(i) for i in a]
c = [str(i) for i in b]
d = ''.join([str(i) for i in c])
output = d
What I am trying to do is encrypt each letter by multiplying the ASCII value it belongs too by itself and then use the chr() function to rejoin the string after a process of adding 1 then dividing by m , thus creating a new word. Then, using that new string, set it as the value of word for the next cycle in the loop, so the process continues until it has finished pub amount of times and encrypted the word. I'm having a lot of difficulties with this and I don't know where to start with explaining the issues. I'm relatively new to Python and any suggestions and/or advice on completing this fast would be very much appreciated. Thank you in advance.
First, check that your math is right. Your formula (z**2 + 1)/m grows quadratically. My understanding of crypto is quite limited, but it doesn't look right to me. It should be some kind of one-to-one mapping from input to output. But it maps several neighboring characters to the same output. Also, the results grow with every round.
You can only convert the integers back to ascii characters for a range up to 256. That's what your error message says. It's proably thrown in the second iteration of your outer for loop.
You probably need to get the value range down to 256 again.
I suppose you miss a crucial part off the algorithm you are trying to implement, maybe some modulo operation.
Also some minor python hints:
You can use the built in power operator **, so you don't have to evaluate ord() twice.
((ord(z) ** 2) + 1) / m
You can do the conversion back to the string in one step like this:
output = ''.join([str(chr(i)) for i in a])

Project Euler #13 in Python, trying to find smart solution

I'm trying to solve problem 13 from Euler project, and I'm trying to make the solution beautiful (at least, not ugly). Only "ugly thing" I do is that I'm pre-formating the input and keep it in the solution file (due to some technical reasons, and 'cause I want to concentrate on numeric part of problem)
The problem is "Work out the first ten digits of the sum of the following one-hundred 50-digit numbers."
I wrote some code, that should work, as far as I know, but it gives wrong result. I've checked input several times, it seems to be OK...
nums=[37107287533902102798797998220837590246510135740250,
46376937677490009712648124896970078050417018260538,
74324986199524741059474233309513058123726617309629,
91942213363574161572522430563301811072406154908250,
23067588207539346171171980310421047513778063246676,
89261670696623633820136378418383684178734361726757,
28112879812849979408065481931592621691275889832738,
44274228917432520321923589422876796487670272189318,
47451445736001306439091167216856844588711603153276,
70386486105843025439939619828917593665686757934951,
62176457141856560629502157223196586755079324193331,
64906352462741904929101432445813822663347944758178,
92575867718337217661963751590579239728245598838407,
58203565325359399008402633568948830189458628227828,
80181199384826282014278194139940567587151170094390,
35398664372827112653829987240784473053190104293586,
86515506006295864861532075273371959191420517255829,
71693888707715466499115593487603532921714970056938,
54370070576826684624621495650076471787294438377604,
53282654108756828443191190634694037855217779295145,
36123272525000296071075082563815656710885258350721,
45876576172410976447339110607218265236877223636045,
17423706905851860660448207621209813287860733969412,
81142660418086830619328460811191061556940512689692,
51934325451728388641918047049293215058642563049483,
62467221648435076201727918039944693004732956340691,
15732444386908125794514089057706229429197107928209,
55037687525678773091862540744969844508330393682126,
18336384825330154686196124348767681297534375946515,
80386287592878490201521685554828717201219257766954,
78182833757993103614740356856449095527097864797581,
16726320100436897842553539920931837441497806860984,
48403098129077791799088218795327364475675590848030,
87086987551392711854517078544161852424320693150332,
59959406895756536782107074926966537676326235447210,
69793950679652694742597709739166693763042633987085,
41052684708299085211399427365734116182760315001271,
65378607361501080857009149939512557028198746004375,
35829035317434717326932123578154982629742552737307,
94953759765105305946966067683156574377167401875275,
88902802571733229619176668713819931811048770190271,
25267680276078003013678680992525463401061632866526,
36270218540497705585629946580636237993140746255962,
24074486908231174977792365466257246923322810917141,
91430288197103288597806669760892938638285025333403,
34413065578016127815921815005561868836468420090470,
23053081172816430487623791969842487255036638784583,
11487696932154902810424020138335124462181441773470,
63783299490636259666498587618221225225512486764533,
67720186971698544312419572409913959008952310058822,
95548255300263520781532296796249481641953868218774,
76085327132285723110424803456124867697064507995236,
37774242535411291684276865538926205024910326572967,
23701913275725675285653248258265463092207058596522,
29798860272258331913126375147341994889534765745501,
18495701454879288984856827726077713721403798879715,
38298203783031473527721580348144513491373226651381,
34829543829199918180278916522431027392251122869539,
40957953066405232632538044100059654939159879593635,
29746152185502371307642255121183693803580388584903,
41698116222072977186158236678424689157993532961922,
62467957194401269043877107275048102390895523597457,
23189706772547915061505504953922979530901129967519,
86188088225875314529584099251203829009407770775672,
11306739708304724483816533873502340845647058077308,
82959174767140363198008187129011875491310547126581,
97623331044818386269515456334926366572897563400500,
42846280183517070527831839425882145521227251250327,
55121603546981200581762165212827652751691296897789,
32238195734329339946437501907836945765883352399886,
75506164965184775180738168837861091527357929701337,
62177842752192623401942399639168044983993173312731,
32924185707147349566916674687634660915035914677504,
99518671430235219628894890102423325116913619626622,
73267460800591547471830798392868535206946944540724,
76841822524674417161514036427982273348055556214818,
97142617910342598647204516893989422179826088076852,
87783646182799346313767754307809363333018982642090,
10848802521674670883215120185883543223812876952786,
71329612474782464538636993009049310363619763878039,
62184073572399794223406235393808339651327408011116,
66627891981488087797941876876144230030984490851411,
60661826293682836764744779239180335110989069790714,
85786944089552990653640447425576083659976645795096,
66024396409905389607120198219976047599490197230297,
64913982680032973156037120041377903785566085089252,
16730939319872750275468906903707539413042652315011,
94809377245048795150954100921645863754710598436791,
78639167021187492431995700641917969777599028300699,
15368713711936614952811305876380278410754449733078,
40789923115535562561142322423255033685442488917353,
44889911501440648020369068063960672322193204149535,
41503128880339536053299340368006977710650566631954,
81234880673210146739058568557934581403627822703280,
82616570773948327592232845941706525094512325230608,
22918802058777319719839450180888072429661980811197,
77158542502016545090413245809786882778948721859617,
72107838435069186155435662884062257473692284509516,
20849603980134001723930671666823555245252804609722,
53503534226472524250874054075591789781264330331690]
result_sum = []
tmp_sum = 0
for j in xrange(50):
for i in xrange(100):
tmp_sum += nums[i] % 10
nums[i] =nums[i] / 10
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum = tmp_sum / 10
for i in xrange(10):
print result_sum[i]
Your code works by adding all the numbers in nums like a person would: adding column by column. Your code does not work because when you are summing the far left column, you treat it like every other column. Whenever people get to the far left, they write down the entire sum. So this line
result_sum.insert(0,int(tmp_sum % 10))
doesn't work for the far left column; you need to insert something else into result_sum in that case. I would post the code, but 1) I'm sure you don't need it, and 2) it's agains the Project-Euler tag rules. If you would like, I can email it to you, but I'm sure that won't be necessary.
You could save the numbers in a file (with a number on each line), and read from it:
nums = []
with open('numbers.txt', 'r') as f:
for num in f:
nums.append(int(num))
# nums is now populated with all of the numbers, so do your actual algorithm
Also, it looks like you want to store the sum as an array of digits. The cool thing about Python is that it automatically handles large integers. Here is a quote from the docs:
Plain integers (also just called integers) are implemented using long in C, which gives them at least 32 bits of precision (sys.maxint is always set to the maximum plain integer value for the current platform, the minimum value is -sys.maxint - 1). Long integers have unlimited precision.
So using an array of digits isn't really necessary if you are working with Python. In C, it is another story...
Also, regarding your code, you need to factor in the digits in tmp_sum, which contains your carry-over digits. You can add them into result_sum like this:
while tmp_sum:
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum /= 10
This will fix your issue. Here, it works.
Since you already have all the numbers in a list, you should be able to take the sum of them pretty easily. Then you just need to take the first ten digits of the sum. I won't put any code here, though.
As Simple as this :
Values.txt will contain all digits.
nums = []
with open("values.txt",'r') as f:
for num in f:
nums.append(int(num))
print(str(sum(nums))[:10])
Just as easy is storing it in csv and using pandas:
def foo():
import pandas as pd
table = pd.read_csv("data.txt", header = None, usecols = [0])
and then iterate through panda dataframe:
sum = 0
for x in range(len(table)):
sum += int(table[0][x])
return str(sum)[:10]
just keep in mind that Python handles the large digits for you.

Categories