parsing scientific publication page ranges in python

parsing scientific publication page ranges in python - python

I need to parse a set of strings that contain page ranges as they appear in metadata of scientific and other publications. I don't have a complete spec of the pagination format, and I am not even sure if one exists, but examples of strings I need to process are:
6-10, 19-22
xlvii-xlviii
111S-2S
326
A078-132
XC-CIII
Ideally, I'd like to return the number of pages for each string, e.g. 9 for 6-10, 19-22. If that's too hard, at least whether it's a single page or more. The latter is pretty easy actually since commas and dashes seem to be the only delimiters in the examples I've seen so far. But I do very much prefer to get the right count.
I can write my own parser but I am curious whether there are any existing packages that can already do this out of the box or with minimal mods.

Here's a solution that supports parsing "normal" numbers as well as roman numerals. For parsing roman numerals, install the roman package (easy_install roman). You can enhance the parse_num function to support additional formats.
import roman
def parse_num(p):
p = p.strip()
try:
return roman.fromRoman(p.upper())
except:
return int(p)
def parse_pages(s):
count = 0
for part in s.split(','):
rng = part.split('-', 1)
a, b = parse_num(rng[0]), parse_num(rng[-1])
count += b - a + 1
return count
>>> parse_pages('17')
1
>>> parse_pages('6-10, 19-22')
9
>>> parse_pages('xlvii-xlviii')
2
>>> parse_pages('XC-CIII')
14

Related

Numeric ID to very short unique strings

I have rather long IDs 1000000000109872 and would like to represent them as strings.
However all the libraries for Rust I've found such as hash_ids and block_id produce strings that are way bigger.
Ideally I'd like 4 to maybe 5 characters, numbers are okay but only uppercase letters. Doesn't need to be cryptographically secure as long as it's unique.
Is there anything that fits my needs?
I've tried this website: https://v2.cryptii.com/decimal/base64 and for 1000000000109872 I get 4rSw, this is very short which is great. But it's not uppercase.

This is the absolute best you can do if you want to guarantee no collisions without having any specific guarantees on the range of the inputs beyond "unsigned int" and you want it to be stateless:
def base_36(n: int) -> str:
if not isinstance(n, int):
raise TypeError("Check out https://mypy.readthedocs.io/")
if n < 0:
raise ValueError("IDs must be non-negative")
if n < 10:
return str(n)
if n < 36:
return chr(n - 10 + ord('A'))
return base_36(n // 36) + base_36(n % 36)
print(base_36(1000000000109872)) # 9UGXNOTWDS
If you're willing to avoid collisions by keeping track of id allocations, you can of course do much better:
ids: dict[int, int] = {}
def stateful_id(n: int) -> str:
return base_36(ids.setdefault(n, len(ids)))
print(stateful_id(1000000000109872)) # 0
print(stateful_id(1000000000109454)) # 1
print(stateful_id(1000000000109872)) # 0
or if some parts of the ID can be safely truncated:
MAGIC_NUMBER = 1000000000000000
def truncated_id(n: int) -> str:
if n < MAGIC_NUMBER:
raise ValueError(f"IDs must be >= {MAGIC_NUMBER}")
return base_36(n - MAGIC_NUMBER)
print(truncated_id(1000000000109872)) # 2CS0

Short Answer: Impossible.
Long Answer: You're asking to represent 10^16 digits in 36^5 (5 uppercase chars).
Actually, an uppercase/number char would be a one of 36 cases (10 numbers + 26 chars). But, 36^5 = 60,466,176 is less than 10^9, which wouldn't work.
Since 36^10 < 10^16 < 36^11, you'll need at least 11 uppercase chars to represent your (10^16) long IDs.

As you already stated that there is even a checksum inside the original ID, I assume the new representation should contain all of its data.
In this case, your question is strongly related to lossless compression and information content.
Information content says that every data contains a certain amount of information. Information can be measured in bits.
The sad news is that now matter what, you cannot magically reduce your data to less bits. It will always keep the same amount of bits. You can just change the representation to store those bits as compact as possible, but you cannot reduce the number.
You might think of jpg or compressed movies, that are stored very compact; the problem there is they are lossy. They discard information not perceived by the human eye/ear and just delete them.
In your case, there is no trickery possible. You will always have a smallest and a largest ID that you handed out. And all the IDs between your smallest and largest ID have to be distinguishable.
Now some math. If you know the amount of possible states of your data (e.g. the amount of distinguishable IDs), you can compute the required information content like this: log2(N), where N is the number of possible states.
So let's say you have 1000000 different IDs, that would mean you need log2(1000000) = 19.93 bits to represent those IDs. You will never be able to reduce this number to anything less.
Now to actually represent them: You say you want to store them in in a string of 26 different uppercase letters or 10 different digits. This is called a base36 encoding.
Each digit of this can carry log2(36) = 5.17 bits of information. Therefore, to store your 1000000 different IDs, you need at least 19.93/5.17 = 3.85 digits.
This is exactly what #Samwise's answer shows you. His answer is the mathematically most optimal way to encode this. You will never get better than his answer. And the amount if digits will always grow if the amount of possible IDs you want to represent grows. There's just no mathematical way around that.

How to count cells containing numbers in specific range with cells that contain both text and numbers

I thought I could easily sort this issue out but it took me ages to solve just half of it.
I have a table that contains 100 data cells in a row. Data in each cell are either text-only or text and numbers (see layout at bottom).
I need a function that COUNTs how many cells are present in the table that report the value of N2 OR E to be >=37.
Negative
Positive (N2: 23, E: 23)
Negative Function answer: 2
Positive (N2: 37, E: 26)
Positive (N2: 31, E: 38)
So far I could only extract each N2 number with a function [=MID(A2,15,FIND(",",A2)-15)] that considers the 15th character, then a second function counts how many extracted numbers (they have been extracted in B row) are >=37, [=COUNTIF(B2:B100, ">=37")] but have not a clue on how to take the E value into account.
In addition, I would like the function to consider cells containing the N2 value OR the E value >=37.
Is there the chance to have one big function that does that? Is there the chance not to rely on KUTools for Excel?

If you have the newest version of Excel, you can use FILTERXML after making some minor changes. First concatenate the whole range with CONCAT, then eliminate all ","s and replace ")"s with spaces in the concatenated string.
For example, the below gets you all the instances over 36 (if you only want the number of times, wrap it in a COUNT):
=FILTERXML("<t><s>"&SUBSTITUTE(
SUBSTITUTE(SUBSTITUTE(CONCAT($F$2:$F$7), ")", " "), ",", ""), " ",
"</s><s>")&"</s></t>", "//s[number()>=37]")
For more info on dealing with strings, see here.
EDIT: Thanks #MarkBalhoff for catching a missing space in the formula and
#JvdV for giving another way with =IFERROR(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(TEXTJOIN(" ",,F2:F6)," ","</s><s>")&"</s></t>","//s[translate(.,',','')*1>=37 or translate(following::*[2],')','')*1>=37]")),0)

Since you include the python tag and also reference KU-Tools, I assume you have some familiarity with VBA.
You could easily, and flexibly, implement the logic in Excel VBA using regular expressions.
For this function, I allowed three arguments:
The range to search
The threshold for the values
A list of values to look for
In the regex, the pattern looks for the digits that follow either of the strings in "searchFor". Note that, as written, you need to include the colons in the searchFor strings, and that that the strings are case-sensitive. (easily changed)
Option Explicit
Function CountVals(r As Range, Threshold As Long, ParamArray searchFor() As Variant) As Long
Dim RE As Object, MC As Object, M As Object
Dim counter As Long
Dim vSrc As Variant, v As Variant
Dim sPat As String
'read range into variant array for fastest processing
vSrc = r
'create Pattern
sPat = "(?:" & Join(searchFor, "|") & ")\s*(\d+)"
'initialize regex
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.ignorecase = False 'or change to true if capitalization not important
.Pattern = sPat
counter = 0
'check each string for the values
For Each v In vSrc
Set MC = .Execute(v)
For Each M In MC
If CLng(M.submatches(0)) >= Threshold Then counter = counter + 1
Next M
Next v
CountVals = counter
End With
End Function

How to use a regex to search for contiguous incrementing sequences

I would like to use regex to increase the speed of my searches for specific records within a large binary image. It seems like regex searches always outperform my own search methods, so that's why I'm looking into this. I have already implemented the following, which works, but is not very fast.
My binary image is loaded into a Numpy memmap as words.
I_FILE = np.memmap(opts.image_file, dtype='uint32', mode='r')
And here is start of my search loop currently (which works):
for i in range(0, FILESIZE - 19):
if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):
...do stuff...
This is seeking out records that are 19 bytes long that start with a decimal sequence number between 0 and 59. It looks for an incrementing sequence on either a record before or after the current search location to validate the record.
I've seen a few examples where folks have crafted variables into string using re.escape (like this: How to use a variable inside a regular expression?) But I can't seem to figure out how to search for a changing value sequence.

I managed to make it work with regex, but it was a bit more complicated than I expected. The regex expressions look for two values between 0 and 59 that are separated by 72 bytes (18 words). I used two regex searches to ensure that I wouldn't miss records at the end of a sequence:
# First search uses the lookahead assertion to not consume large amounts of data.
SearchPattern1 = re.compile(b'[\0-\x3B]\0\0\0(?=.{72}[\1-\x3B]\0\0\0)', re.DOTALL)
# Again using the positive lookbehind assertion (?<= ... ) to grab the ending entries.
SearchPattern2 = re.compile(b'(?<=[\0-\x3B]\0\0\0.{72})[\1-\x3B]\0\0\0', re.DOTALL)
Next, perform both searches and combine the results.
HitList1 = [m.start(0) for m in SearchPattern1.finditer(I_FILE)]
HitList2 = [m.start(0) for m in SearchPattern2.finditer(I_FILE)]
AllHitList = list(set(HitList1 + HitList2))
SortedHitList = sorted(AllHitList)
Now I run a search that has the same conditions as my original solution, but it runs on a much smaller set of data!
for i in range(0, len(SortedHitList)):
TestLoc = SortedHitList[i]
if (I_FILE[TestLoc] + 1 == I_FILE[TestLoc + 19]) or (I_FILE[TestLoc - 19] + 1 == I_FILE[TestLoc]):
... do stuff ...
The result was very successful! The original solution took 58 seconds to run on a 300 MB binary file, while the new regex solution took only 2 seconds!!

Project Euler #13 in Python, trying to find smart solution

I'm trying to solve problem 13 from Euler project, and I'm trying to make the solution beautiful (at least, not ugly). Only "ugly thing" I do is that I'm pre-formating the input and keep it in the solution file (due to some technical reasons, and 'cause I want to concentrate on numeric part of problem)
The problem is "Work out the first ten digits of the sum of the following one-hundred 50-digit numbers."
I wrote some code, that should work, as far as I know, but it gives wrong result. I've checked input several times, it seems to be OK...
nums=[37107287533902102798797998220837590246510135740250,
46376937677490009712648124896970078050417018260538,
74324986199524741059474233309513058123726617309629,
91942213363574161572522430563301811072406154908250,
23067588207539346171171980310421047513778063246676,
89261670696623633820136378418383684178734361726757,
28112879812849979408065481931592621691275889832738,
44274228917432520321923589422876796487670272189318,
47451445736001306439091167216856844588711603153276,
70386486105843025439939619828917593665686757934951,
62176457141856560629502157223196586755079324193331,
64906352462741904929101432445813822663347944758178,
92575867718337217661963751590579239728245598838407,
58203565325359399008402633568948830189458628227828,
80181199384826282014278194139940567587151170094390,
35398664372827112653829987240784473053190104293586,
86515506006295864861532075273371959191420517255829,
71693888707715466499115593487603532921714970056938,
54370070576826684624621495650076471787294438377604,
53282654108756828443191190634694037855217779295145,
36123272525000296071075082563815656710885258350721,
45876576172410976447339110607218265236877223636045,
17423706905851860660448207621209813287860733969412,
81142660418086830619328460811191061556940512689692,
51934325451728388641918047049293215058642563049483,
62467221648435076201727918039944693004732956340691,
15732444386908125794514089057706229429197107928209,
55037687525678773091862540744969844508330393682126,
18336384825330154686196124348767681297534375946515,
80386287592878490201521685554828717201219257766954,
78182833757993103614740356856449095527097864797581,
16726320100436897842553539920931837441497806860984,
48403098129077791799088218795327364475675590848030,
87086987551392711854517078544161852424320693150332,
59959406895756536782107074926966537676326235447210,
69793950679652694742597709739166693763042633987085,
41052684708299085211399427365734116182760315001271,
65378607361501080857009149939512557028198746004375,
35829035317434717326932123578154982629742552737307,
94953759765105305946966067683156574377167401875275,
88902802571733229619176668713819931811048770190271,
25267680276078003013678680992525463401061632866526,
36270218540497705585629946580636237993140746255962,
24074486908231174977792365466257246923322810917141,
91430288197103288597806669760892938638285025333403,
34413065578016127815921815005561868836468420090470,
23053081172816430487623791969842487255036638784583,
11487696932154902810424020138335124462181441773470,
63783299490636259666498587618221225225512486764533,
67720186971698544312419572409913959008952310058822,
95548255300263520781532296796249481641953868218774,
76085327132285723110424803456124867697064507995236,
37774242535411291684276865538926205024910326572967,
23701913275725675285653248258265463092207058596522,
29798860272258331913126375147341994889534765745501,
18495701454879288984856827726077713721403798879715,
38298203783031473527721580348144513491373226651381,
34829543829199918180278916522431027392251122869539,
40957953066405232632538044100059654939159879593635,
29746152185502371307642255121183693803580388584903,
41698116222072977186158236678424689157993532961922,
62467957194401269043877107275048102390895523597457,
23189706772547915061505504953922979530901129967519,
86188088225875314529584099251203829009407770775672,
11306739708304724483816533873502340845647058077308,
82959174767140363198008187129011875491310547126581,
97623331044818386269515456334926366572897563400500,
42846280183517070527831839425882145521227251250327,
55121603546981200581762165212827652751691296897789,
32238195734329339946437501907836945765883352399886,
75506164965184775180738168837861091527357929701337,
62177842752192623401942399639168044983993173312731,
32924185707147349566916674687634660915035914677504,
99518671430235219628894890102423325116913619626622,
73267460800591547471830798392868535206946944540724,
76841822524674417161514036427982273348055556214818,
97142617910342598647204516893989422179826088076852,
87783646182799346313767754307809363333018982642090,
10848802521674670883215120185883543223812876952786,
71329612474782464538636993009049310363619763878039,
62184073572399794223406235393808339651327408011116,
66627891981488087797941876876144230030984490851411,
60661826293682836764744779239180335110989069790714,
85786944089552990653640447425576083659976645795096,
66024396409905389607120198219976047599490197230297,
64913982680032973156037120041377903785566085089252,
16730939319872750275468906903707539413042652315011,
94809377245048795150954100921645863754710598436791,
78639167021187492431995700641917969777599028300699,
15368713711936614952811305876380278410754449733078,
40789923115535562561142322423255033685442488917353,
44889911501440648020369068063960672322193204149535,
41503128880339536053299340368006977710650566631954,
81234880673210146739058568557934581403627822703280,
82616570773948327592232845941706525094512325230608,
22918802058777319719839450180888072429661980811197,
77158542502016545090413245809786882778948721859617,
72107838435069186155435662884062257473692284509516,
20849603980134001723930671666823555245252804609722,
53503534226472524250874054075591789781264330331690]
result_sum = []
tmp_sum = 0
for j in xrange(50):
for i in xrange(100):
tmp_sum += nums[i] % 10
nums[i] =nums[i] / 10
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum = tmp_sum / 10
for i in xrange(10):
print result_sum[i]

Your code works by adding all the numbers in nums like a person would: adding column by column. Your code does not work because when you are summing the far left column, you treat it like every other column. Whenever people get to the far left, they write down the entire sum. So this line
result_sum.insert(0,int(tmp_sum % 10))
doesn't work for the far left column; you need to insert something else into result_sum in that case. I would post the code, but 1) I'm sure you don't need it, and 2) it's agains the Project-Euler tag rules. If you would like, I can email it to you, but I'm sure that won't be necessary.

You could save the numbers in a file (with a number on each line), and read from it:
nums = []
with open('numbers.txt', 'r') as f:
for num in f:
nums.append(int(num))
# nums is now populated with all of the numbers, so do your actual algorithm
Also, it looks like you want to store the sum as an array of digits. The cool thing about Python is that it automatically handles large integers. Here is a quote from the docs:
Plain integers (also just called integers) are implemented using long in C, which gives them at least 32 bits of precision (sys.maxint is always set to the maximum plain integer value for the current platform, the minimum value is -sys.maxint - 1). Long integers have unlimited precision.
So using an array of digits isn't really necessary if you are working with Python. In C, it is another story...
Also, regarding your code, you need to factor in the digits in tmp_sum, which contains your carry-over digits. You can add them into result_sum like this:
while tmp_sum:
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum /= 10
This will fix your issue. Here, it works.

Since you already have all the numbers in a list, you should be able to take the sum of them pretty easily. Then you just need to take the first ten digits of the sum. I won't put any code here, though.

As Simple as this :
Values.txt will contain all digits.
nums = []
with open("values.txt",'r') as f:
for num in f:
nums.append(int(num))
print(str(sum(nums))[:10])

Just as easy is storing it in csv and using pandas:
def foo():
import pandas as pd
table = pd.read_csv("data.txt", header = None, usecols = [0])
and then iterate through panda dataframe:
sum = 0
for x in range(len(table)):
sum += int(table[0][x])
return str(sum)[:10]
just keep in mind that Python handles the large digits for you.

python captcha decoder library

I need a Captcha decoder for python to read simple image captchas like the following picture:
Do you know of a library that can help me read this captcha?
If you don't know of a library for reading captchas, could you help me to read this (and others like this) with PIL?

I hope this captcha is not used anywhere.
Following is a dummy way to decode it. Basically what you need are the patterns from 0 to 9 as present in these captchas. From your examples, I have only the patterns for 0 3 4 5 7 8. Since everything is fixed on them, you know where to split each character. You also know each character is a number of fixed size and fixed font. If it also includes letters or more characters, but of fixed size and font, then the following code can be easily adapted.
What the code does is: a) load the patterns (I considered they are named n0.png, n1.png, ...); b) split the captcha in NUMS pieces; c) do a sum of squared differences between each pattern and each split number; d) decide that the the split number is the one with the smallest sum. It returns a list for each number, in order, present in the captcha. To obtain the initial patterns, you can uncomment the lines that save the split numbers, place a return after that piece, and adjust the file names.
import sys
from PIL import Image, ImageOps
PAT_SIZE = (8, 10)
NUMS = 3
FIRST_NUM_OFFSET = 5
NUM_OFFSET = (1, 3)
NUMBERS = []
for i in xrange(10):
try:
NUMBERS.append(Image.open('n%d.png' % i).load())
except IOError:
print "I do not know the pattern for the number %d." % i
NUMBERS.append(None)
def magic(fname):
captcha = ImageOps.grayscale(Image.open(fname))
im = captcha.load()
# Split numbers
num = []
for n in xrange(NUMS):
x1, y1 = (FIRST_NUM_OFFSET + n * (NUM_OFFSET[0] + PAT_SIZE[0]),
NUM_OFFSET[1])
num.append(captcha.crop((x1, y1, x1 + PAT_SIZE[0], y1 + PAT_SIZE[1])))
# If you want to save the split numbers:
#for i, n in enumerate(num):
# n.save('%d.png' % i)
def sqdiff(a, b):
if None in (a, b): # XXX This is here just to handle missing pattern.
return float('inf')
d = 0
for x in xrange(PAT_SIZE[0]):
for y in xrange(PAT_SIZE[1]):
d += (a[x, y] - b[x, y]) ** 2
return d
# Calculate a dummy sum of squared differences between the patterns
# and each number. We assume the smallest diff is the number in the
# "captcha".
result = []
for n in num:
n_sqdiff = [(sqdiff(p, n.load()), i) for i, p in enumerate(NUMBERS)]
result.append(min(n_sqdiff)[1])
return result
print magic(sys.argv[1])

I hope you are using it in good faith and you are not going to harm (/spam) anyone.
I won't write you the script nor forward you to an external plugin. But incase you are writing this by your own, this may help:
In case you are trying to decode a specific captcha pattern you should collect all chars (I saw from the examples you attached that it's only numbers so it shouldn't be alot of work).
Put all of the chars in one file and analyze it with PIL
Save in an array each char, its position and its meaning.
Get a Captcha image - Clear the background noise if necessary.
Split the Captcha image to char-sized and cross it through your self-made dictionary of chars.

It is a nice project to do for academic reasons, I was interested in this a while ago. You have a few options:
You write your own with the help from this site: [scrubbed dead link]
You use OpenCV to do the matching.
If think there was a dedicated libary for neural network image matching but i can't seem to find it.
Basically as the others said, you want to remove the noise, split into single chars and compare it using a chosen technique to the model chars.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.