I have a script in Python 3.6.8 which reads through a very large text file, where each line is an ASCII string drawn from the alphabet {a,b,c,d,e,f}.
For each line, I have a function which fragments the string using a sliding window of size k, and then increments a fragment counter dictionary fragment_dict by 1 for each fragment seen.
The same fragment_dict is used for the entire file, and it is initialized for all possible 5^k fragments mapping to zero.
I also ignore any fragment which has the character c in it. Note that c is uncommon, and most lines will not contain it at all.
def fragment_string(mystr, fragment_dict, k):
for i in range(len(mystr) - k + 1):
fragment = mystr[i:i+k]
if 'c' in fragment:
continue
fragment_dict[fragment] += 1
Because my file is so large, I would like to optimize the performance of the above function as much as possible. Could anyone provide any potential optimizations to make this function faster?
I'm worried I may be rate limited by the speed of Python loops, in which case I would need to consider dropping down into C/Cython.
Numpy may help in speeding up your code:
x = np.array([ord(c) - ord('a') for c in mystr])
filter = np.geomspace(1, 5**(k-1), k, dtype=int)
fragment_dict = collections.Counter(np.convolve(x, filter,mode='valid'))
The idea is, represent each k length segment is a k-digit 5-ary number. Then, converting a list of 0-5 integers equivalent to the string to its 5-ary representation is equivalent to applying a convolution with [1,5,25,125,...] as filter.
I would like to use regex to increase the speed of my searches for specific records within a large binary image. It seems like regex searches always outperform my own search methods, so that's why I'm looking into this. I have already implemented the following, which works, but is not very fast.
My binary image is loaded into a Numpy memmap as words.
I_FILE = np.memmap(opts.image_file, dtype='uint32', mode='r')
And here is start of my search loop currently (which works):
for i in range(0, FILESIZE - 19):
if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):
...do stuff...
This is seeking out records that are 19 bytes long that start with a decimal sequence number between 0 and 59. It looks for an incrementing sequence on either a record before or after the current search location to validate the record.
I've seen a few examples where folks have crafted variables into string using re.escape (like this: How to use a variable inside a regular expression?) But I can't seem to figure out how to search for a changing value sequence.
I managed to make it work with regex, but it was a bit more complicated than I expected. The regex expressions look for two values between 0 and 59 that are separated by 72 bytes (18 words). I used two regex searches to ensure that I wouldn't miss records at the end of a sequence:
# First search uses the lookahead assertion to not consume large amounts of data.
SearchPattern1 = re.compile(b'[\0-\x3B]\0\0\0(?=.{72}[\1-\x3B]\0\0\0)', re.DOTALL)
# Again using the positive lookbehind assertion (?<= ... ) to grab the ending entries.
SearchPattern2 = re.compile(b'(?<=[\0-\x3B]\0\0\0.{72})[\1-\x3B]\0\0\0', re.DOTALL)
Next, perform both searches and combine the results.
HitList1 = [m.start(0) for m in SearchPattern1.finditer(I_FILE)]
HitList2 = [m.start(0) for m in SearchPattern2.finditer(I_FILE)]
AllHitList = list(set(HitList1 + HitList2))
SortedHitList = sorted(AllHitList)
Now I run a search that has the same conditions as my original solution, but it runs on a much smaller set of data!
for i in range(0, len(SortedHitList)):
TestLoc = SortedHitList[i]
if (I_FILE[TestLoc] + 1 == I_FILE[TestLoc + 19]) or (I_FILE[TestLoc - 19] + 1 == I_FILE[TestLoc]):
... do stuff ...
The result was very successful! The original solution took 58 seconds to run on a 300 MB binary file, while the new regex solution took only 2 seconds!!
I need to parse a set of strings that contain page ranges as they appear in metadata of scientific and other publications. I don't have a complete spec of the pagination format, and I am not even sure if one exists, but examples of strings I need to process are:
6-10, 19-22
xlvii-xlviii
111S-2S
326
A078-132
XC-CIII
Ideally, I'd like to return the number of pages for each string, e.g. 9 for 6-10, 19-22. If that's too hard, at least whether it's a single page or more. The latter is pretty easy actually since commas and dashes seem to be the only delimiters in the examples I've seen so far. But I do very much prefer to get the right count.
I can write my own parser but I am curious whether there are any existing packages that can already do this out of the box or with minimal mods.
Here's a solution that supports parsing "normal" numbers as well as roman numerals. For parsing roman numerals, install the roman package (easy_install roman). You can enhance the parse_num function to support additional formats.
import roman
def parse_num(p):
p = p.strip()
try:
return roman.fromRoman(p.upper())
except:
return int(p)
def parse_pages(s):
count = 0
for part in s.split(','):
rng = part.split('-', 1)
a, b = parse_num(rng[0]), parse_num(rng[-1])
count += b - a + 1
return count
>>> parse_pages('17')
1
>>> parse_pages('6-10, 19-22')
9
>>> parse_pages('xlvii-xlviii')
2
>>> parse_pages('XC-CIII')
14
I'm trying to code an encryption program that will encode a user's Inputed - if that's a word - string. The encryption method is just a basic use of an elliptic curve encryption and I am currently working on the encryption part of the program at the moment before I work on the mathematical, inverse modules etc. Etc. Required for public and private key calculations. Currently I am using the key pub = 5 and a max value (derived from the product of 2 random primes) of 91. This is all the information needed and the word I am testing the encryption on is 'happy'.
So far here is the code.
word = 'happy'
pub = 5
m = 91
for i in range(pub):
if i == 0:
word = word
else:
word = output
for x in word:
a = [(((ord(z)*ord(z))+1)/m) for z in word]
b = [chr(i) for i in a]
c = [str(i) for i in b]
d = ''.join([str(i) for i in c])
output = d
What I am trying to do is encrypt each letter by multiplying the ASCII value it belongs too by itself and then use the chr() function to rejoin the string after a process of adding 1 then dividing by m , thus creating a new word. Then, using that new string, set it as the value of word for the next cycle in the loop, so the process continues until it has finished pub amount of times and encrypted the word. I'm having a lot of difficulties with this and I don't know where to start with explaining the issues. I'm relatively new to Python and any suggestions and/or advice on completing this fast would be very much appreciated. Thank you in advance.
First, check that your math is right. Your formula (z**2 + 1)/m grows quadratically. My understanding of crypto is quite limited, but it doesn't look right to me. It should be some kind of one-to-one mapping from input to output. But it maps several neighboring characters to the same output. Also, the results grow with every round.
You can only convert the integers back to ascii characters for a range up to 256. That's what your error message says. It's proably thrown in the second iteration of your outer for loop.
You probably need to get the value range down to 256 again.
I suppose you miss a crucial part off the algorithm you are trying to implement, maybe some modulo operation.
Also some minor python hints:
You can use the built in power operator **, so you don't have to evaluate ord() twice.
((ord(z) ** 2) + 1) / m
You can do the conversion back to the string in one step like this:
output = ''.join([str(chr(i)) for i in a])
I'm trying to solve problem 13 from Euler project, and I'm trying to make the solution beautiful (at least, not ugly). Only "ugly thing" I do is that I'm pre-formating the input and keep it in the solution file (due to some technical reasons, and 'cause I want to concentrate on numeric part of problem)
The problem is "Work out the first ten digits of the sum of the following one-hundred 50-digit numbers."
I wrote some code, that should work, as far as I know, but it gives wrong result. I've checked input several times, it seems to be OK...
nums=[37107287533902102798797998220837590246510135740250,
46376937677490009712648124896970078050417018260538,
74324986199524741059474233309513058123726617309629,
91942213363574161572522430563301811072406154908250,
23067588207539346171171980310421047513778063246676,
89261670696623633820136378418383684178734361726757,
28112879812849979408065481931592621691275889832738,
44274228917432520321923589422876796487670272189318,
47451445736001306439091167216856844588711603153276,
70386486105843025439939619828917593665686757934951,
62176457141856560629502157223196586755079324193331,
64906352462741904929101432445813822663347944758178,
92575867718337217661963751590579239728245598838407,
58203565325359399008402633568948830189458628227828,
80181199384826282014278194139940567587151170094390,
35398664372827112653829987240784473053190104293586,
86515506006295864861532075273371959191420517255829,
71693888707715466499115593487603532921714970056938,
54370070576826684624621495650076471787294438377604,
53282654108756828443191190634694037855217779295145,
36123272525000296071075082563815656710885258350721,
45876576172410976447339110607218265236877223636045,
17423706905851860660448207621209813287860733969412,
81142660418086830619328460811191061556940512689692,
51934325451728388641918047049293215058642563049483,
62467221648435076201727918039944693004732956340691,
15732444386908125794514089057706229429197107928209,
55037687525678773091862540744969844508330393682126,
18336384825330154686196124348767681297534375946515,
80386287592878490201521685554828717201219257766954,
78182833757993103614740356856449095527097864797581,
16726320100436897842553539920931837441497806860984,
48403098129077791799088218795327364475675590848030,
87086987551392711854517078544161852424320693150332,
59959406895756536782107074926966537676326235447210,
69793950679652694742597709739166693763042633987085,
41052684708299085211399427365734116182760315001271,
65378607361501080857009149939512557028198746004375,
35829035317434717326932123578154982629742552737307,
94953759765105305946966067683156574377167401875275,
88902802571733229619176668713819931811048770190271,
25267680276078003013678680992525463401061632866526,
36270218540497705585629946580636237993140746255962,
24074486908231174977792365466257246923322810917141,
91430288197103288597806669760892938638285025333403,
34413065578016127815921815005561868836468420090470,
23053081172816430487623791969842487255036638784583,
11487696932154902810424020138335124462181441773470,
63783299490636259666498587618221225225512486764533,
67720186971698544312419572409913959008952310058822,
95548255300263520781532296796249481641953868218774,
76085327132285723110424803456124867697064507995236,
37774242535411291684276865538926205024910326572967,
23701913275725675285653248258265463092207058596522,
29798860272258331913126375147341994889534765745501,
18495701454879288984856827726077713721403798879715,
38298203783031473527721580348144513491373226651381,
34829543829199918180278916522431027392251122869539,
40957953066405232632538044100059654939159879593635,
29746152185502371307642255121183693803580388584903,
41698116222072977186158236678424689157993532961922,
62467957194401269043877107275048102390895523597457,
23189706772547915061505504953922979530901129967519,
86188088225875314529584099251203829009407770775672,
11306739708304724483816533873502340845647058077308,
82959174767140363198008187129011875491310547126581,
97623331044818386269515456334926366572897563400500,
42846280183517070527831839425882145521227251250327,
55121603546981200581762165212827652751691296897789,
32238195734329339946437501907836945765883352399886,
75506164965184775180738168837861091527357929701337,
62177842752192623401942399639168044983993173312731,
32924185707147349566916674687634660915035914677504,
99518671430235219628894890102423325116913619626622,
73267460800591547471830798392868535206946944540724,
76841822524674417161514036427982273348055556214818,
97142617910342598647204516893989422179826088076852,
87783646182799346313767754307809363333018982642090,
10848802521674670883215120185883543223812876952786,
71329612474782464538636993009049310363619763878039,
62184073572399794223406235393808339651327408011116,
66627891981488087797941876876144230030984490851411,
60661826293682836764744779239180335110989069790714,
85786944089552990653640447425576083659976645795096,
66024396409905389607120198219976047599490197230297,
64913982680032973156037120041377903785566085089252,
16730939319872750275468906903707539413042652315011,
94809377245048795150954100921645863754710598436791,
78639167021187492431995700641917969777599028300699,
15368713711936614952811305876380278410754449733078,
40789923115535562561142322423255033685442488917353,
44889911501440648020369068063960672322193204149535,
41503128880339536053299340368006977710650566631954,
81234880673210146739058568557934581403627822703280,
82616570773948327592232845941706525094512325230608,
22918802058777319719839450180888072429661980811197,
77158542502016545090413245809786882778948721859617,
72107838435069186155435662884062257473692284509516,
20849603980134001723930671666823555245252804609722,
53503534226472524250874054075591789781264330331690]
result_sum = []
tmp_sum = 0
for j in xrange(50):
for i in xrange(100):
tmp_sum += nums[i] % 10
nums[i] =nums[i] / 10
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum = tmp_sum / 10
for i in xrange(10):
print result_sum[i]
Your code works by adding all the numbers in nums like a person would: adding column by column. Your code does not work because when you are summing the far left column, you treat it like every other column. Whenever people get to the far left, they write down the entire sum. So this line
result_sum.insert(0,int(tmp_sum % 10))
doesn't work for the far left column; you need to insert something else into result_sum in that case. I would post the code, but 1) I'm sure you don't need it, and 2) it's agains the Project-Euler tag rules. If you would like, I can email it to you, but I'm sure that won't be necessary.
You could save the numbers in a file (with a number on each line), and read from it:
nums = []
with open('numbers.txt', 'r') as f:
for num in f:
nums.append(int(num))
# nums is now populated with all of the numbers, so do your actual algorithm
Also, it looks like you want to store the sum as an array of digits. The cool thing about Python is that it automatically handles large integers. Here is a quote from the docs:
Plain integers (also just called integers) are implemented using long in C, which gives them at least 32 bits of precision (sys.maxint is always set to the maximum plain integer value for the current platform, the minimum value is -sys.maxint - 1). Long integers have unlimited precision.
So using an array of digits isn't really necessary if you are working with Python. In C, it is another story...
Also, regarding your code, you need to factor in the digits in tmp_sum, which contains your carry-over digits. You can add them into result_sum like this:
while tmp_sum:
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum /= 10
This will fix your issue. Here, it works.
Since you already have all the numbers in a list, you should be able to take the sum of them pretty easily. Then you just need to take the first ten digits of the sum. I won't put any code here, though.
As Simple as this :
Values.txt will contain all digits.
nums = []
with open("values.txt",'r') as f:
for num in f:
nums.append(int(num))
print(str(sum(nums))[:10])
Just as easy is storing it in csv and using pandas:
def foo():
import pandas as pd
table = pd.read_csv("data.txt", header = None, usecols = [0])
and then iterate through panda dataframe:
sum = 0
for x in range(len(table)):
sum += int(table[0][x])
return str(sum)[:10]
just keep in mind that Python handles the large digits for you.