Efficient way to check last term of a line in Python file - python

I'm writing a Python script that takes in a (potentially large) file. Here is an example of a way that input file could be formatted:
class1 1:v1 2:v2 3:v3 4:v4 5:v5
class2 1:v6 4:v7 5:v8 6:v9
class1 3:v10 4:v11 5:v12 6:v13 8:v14
class2 1:v15 2:v16 3:v17 5:v18 7:v19
Where class1 and class2 are some number, e.g. 1 and -1. (A curious user may notice that this is a LIBSVM-related file, but knowing the software isn't necessary in this case.) The values v1, v2, ..., v19 represent any integer or float value. Obviously, my files would be much larger than this, in terms of total lines and length per line, which is why I'm concerned about efficiency here.
I am trying to check what is the greatest value to the left of a colon. In LIBSVM, these are called "features" and are always integers here. For instance, in the example I outlined above, line 1 has 5 as its largest feature. Line 2 has 6 as its largest feature, line 3 has 8 as its largest feature, and finally, line 4 has 7 as its largest feature. Since 8 is the largest of these values, that is my desired value. I'm looking at a file with possibly thousands of features per line, and many hundreds of thousands of lines.
The file satisfies the following properties:
The features must be strictly increasing. I.e. "3:v1 4:v2" is allowed, but not "3:v1 3:v2."
The features are not necessarily consecutive and can be skipped. In the first example I gave, the first line has its features in consecutive order (1,2,3,4,5) and skips features 6, 7, and 8. The other 3 lines do not have their features in consecutive order. That's okay, as long as those features are strictly increasing.
Right now, my approach is to check each line, split up each line by a space, split up the final term by a colon, and then check the feature value. Following that, I do a procedure to check the maximum such featureNum.
file1 = open(...)
max = 0
for line in file1:
linesplit = line.rstrip('\n').split(' ')
val = linesplit[len(linesplit) - 1]
valsplit = val.split(':')
featureNum = valsplit[0]
if (featureNum > max):
max = featureNum
print max
file1.close()
But I'm hoping there is a better or more efficient way of doing this, e.g. some way of analyzing the file by only getting those terms that directly precede a newline character (maybe to avoid reading all the lines?). I'm new to Python so it wouldn't surprise me if I missed something obvious.
Possible reference: http://docs.python.org/library/stdtypes.html

Since you don't care about all the features in a line but just the last one, you don't need to split the whole line. I don't know if this is actually faster though, you need to time it and see. It definitely isn't as Pythonic as splitting the entire line.
def last_feature(line):
start = line.rfind(' ') + 1
end = line.rfind(':')
return int(line[start:end])
with open(...) as file1:
largest = max(last_feature(line) for line in file1)

Related

Fastest way to compare two huge csv files in python(numpy)

I am trying find the intesect sub set between two pretty big csv files of
phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help
import pandas as pd
import numpy as np
df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])
dnc_phone = df_dnc['phone']
test_phone = df_test['phone']
np.intersect1d(dnc_phone, test_phone)
I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.
This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.
Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:
bits[0] identifies the number 0 000 000 000
bits[193] identifies the number 0 000 000 193
having a number 659 234-4567 would be addressed by the bits[6592344567]
Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.
I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.
Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.
from bitstring import BitArray
max_number = 9999999999
found_phone_numbers = BitArray(length=max_number+1)
# replace this function with the file open function and retrieving
# the next found phone number
def number_from_file_iteator(dummy_data):
for number in dummy_data:
yield number
def calculate_intersect():
# should be open a file1 and getting the generator with numbers from it
# we use dummy data here
for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
found_phone_numbers[number] = True
# open second file and check if the number is there
for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
if found_phone_numbers[number]:
yield number
number_intersection = set(calculate_intersect())
print number_intersection
I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.
Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.
Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).
Read the 600k phone numbers into a set.
Input the larger file row by row, checking each row against the set.
Write matches to an output file immediately.
That way you don't have to load all the data in memory at once.

Calculating Length of Sequences from .PBS File

I am new here. I am looking for help in a bioinformatics type task I have. The task was to calculate the total length of all the sequences in a .pbs file.
The file when opened, displays something like :
The Length is 102
The Length is 1100
The Length is 101
The Length is 111200
The Length is 102
I see that the length is given like a list, with letters and numbers. I need help figuring out what python code to write to add all the lengths together. Not all the sums are the same.
So far my code is:
f = open('lengthofsequence2.pbs.o8767272','r')
lines = f.readlines()
f.close()
def lengthofsequencesinpbsfile(i):
for x in i:
if
return x +=
print lengthofsequencesinpbsfile(lines)
I am not sure what to do with the for loop. I want to just count the numbers after the statement "The Length is..."
Thank You!
"The Length is " has 14 characters so line[14:] will give you the substring corresponding to the number you are after (starting after the 14th character), you then just have to convert it to int with int(line[14:]) before adding to your total: total += int(line[14:])
You need to parse your input to get the data you want to work with.
a. x.replace('The Length is ','') - this removes the unwanted text.
b. int(x.replace('The Length is ','')) - convert digit characters to
an integer
Add to a total: total += int(x.replace('The Length is ',''))
All of this is directly accessible using google. I looked for python string functions and type conversion functions. I've only looked briefly at python and never programmed with it, but I think these two items should help you do what you want to do.

for/in/if List Comprehension Becomes Very Slow With Large Number of Matches

I have the following list comprehension in my Python 2.7 code which returns the line number (index) and the line from a long list of lines:
results = [[lines.index(line), line] for line in lines
if search_item in line.lower()]
This is lightning quick if the number of results is low:
The search item is: [ 1330 ]
Before string pre-processing, the time is: 0.0000
The number of lines is: 1,028,952
After string pre-processing, the time is: 0.2500
The number of results is: 249
"String pre-processing" is what I am calling the results = operation above.
Here is the same operation but with "1330" as the search item instead of " 1330 ". This one yields 6,049 matches instead of 249:
The search item is: [1330]
Before string pre-processing, the time is: 0.0000
The number of lines is: 1,028,952
After string pre-processing, the time is: 10.3180
The number of results is: 6,049
As you can see, 10 sec vs. 1/4 sec... Furthermore, " 1330 " and "1330" searches run in 2.4 and 3.2 sec respectively using a for loop:
for lineNum, line in enumerate(lines):
if search_item in line.lower():
return lineNum, line
So, list comprehension gives a 10x improvement in performance in the case of 249 results but is 3+x slower for 6,049 results...
Obviously, the issue is not in the if/in portion of list comprehension (both searches scan through all 1M+ lines and either accept or reject each one) but in constructing a results list which is "long'ish" in the second case. In other words, the bottleneck appears to be in the
results = [lines.index(line), line]
portion of the comprehension.
I guess I am very surprised that list comprehension becomes so slow for large'ish result sets (and 6K is really not that large). What am I missing? Is there a different method I should be using that would consistently outperform the for loop?
The list.index() call has to search through all lines to find a match. For N lines, you execute O(N^2) steps; a 1000 lines becomes a million steps, etc. For 6k lines, that's 36 million steps *
If all you need is a line number, use the enumerate() function to generate one:
results = [[index, line] for index, line in enumerate(lines)
if search_item in line.lower()]
enumerate() adds a running counter as you go, leaving your algorithm to only execute O(N) steps. You already were using this in the full for loop statement, but not in your list comprehension.
There will be a difference in the output if you have duplicate lines however; lines.index() finds the first match, while enumerate() produces unique line numbers.
* Big-O notation gives us asymptotic behaviour for algorithms. Since list.index() for a given line x only has to scan (up to) x lines to find the index, and if you do that for each line you iterate over, you only take 1 + 2 + 3 + ... x steps in total, which is a triangle number. So in total 'only' ((N * (N + 1)) / 2) steps are taken, roughtly 1/2 N^2 steps. But when N tends to infinity, multipliers no longer matter, and you end up with O(N^2).

Project Euler #13 in Python, trying to find smart solution

I'm trying to solve problem 13 from Euler project, and I'm trying to make the solution beautiful (at least, not ugly). Only "ugly thing" I do is that I'm pre-formating the input and keep it in the solution file (due to some technical reasons, and 'cause I want to concentrate on numeric part of problem)
The problem is "Work out the first ten digits of the sum of the following one-hundred 50-digit numbers."
I wrote some code, that should work, as far as I know, but it gives wrong result. I've checked input several times, it seems to be OK...
nums=[37107287533902102798797998220837590246510135740250,
46376937677490009712648124896970078050417018260538,
74324986199524741059474233309513058123726617309629,
91942213363574161572522430563301811072406154908250,
23067588207539346171171980310421047513778063246676,
89261670696623633820136378418383684178734361726757,
28112879812849979408065481931592621691275889832738,
44274228917432520321923589422876796487670272189318,
47451445736001306439091167216856844588711603153276,
70386486105843025439939619828917593665686757934951,
62176457141856560629502157223196586755079324193331,
64906352462741904929101432445813822663347944758178,
92575867718337217661963751590579239728245598838407,
58203565325359399008402633568948830189458628227828,
80181199384826282014278194139940567587151170094390,
35398664372827112653829987240784473053190104293586,
86515506006295864861532075273371959191420517255829,
71693888707715466499115593487603532921714970056938,
54370070576826684624621495650076471787294438377604,
53282654108756828443191190634694037855217779295145,
36123272525000296071075082563815656710885258350721,
45876576172410976447339110607218265236877223636045,
17423706905851860660448207621209813287860733969412,
81142660418086830619328460811191061556940512689692,
51934325451728388641918047049293215058642563049483,
62467221648435076201727918039944693004732956340691,
15732444386908125794514089057706229429197107928209,
55037687525678773091862540744969844508330393682126,
18336384825330154686196124348767681297534375946515,
80386287592878490201521685554828717201219257766954,
78182833757993103614740356856449095527097864797581,
16726320100436897842553539920931837441497806860984,
48403098129077791799088218795327364475675590848030,
87086987551392711854517078544161852424320693150332,
59959406895756536782107074926966537676326235447210,
69793950679652694742597709739166693763042633987085,
41052684708299085211399427365734116182760315001271,
65378607361501080857009149939512557028198746004375,
35829035317434717326932123578154982629742552737307,
94953759765105305946966067683156574377167401875275,
88902802571733229619176668713819931811048770190271,
25267680276078003013678680992525463401061632866526,
36270218540497705585629946580636237993140746255962,
24074486908231174977792365466257246923322810917141,
91430288197103288597806669760892938638285025333403,
34413065578016127815921815005561868836468420090470,
23053081172816430487623791969842487255036638784583,
11487696932154902810424020138335124462181441773470,
63783299490636259666498587618221225225512486764533,
67720186971698544312419572409913959008952310058822,
95548255300263520781532296796249481641953868218774,
76085327132285723110424803456124867697064507995236,
37774242535411291684276865538926205024910326572967,
23701913275725675285653248258265463092207058596522,
29798860272258331913126375147341994889534765745501,
18495701454879288984856827726077713721403798879715,
38298203783031473527721580348144513491373226651381,
34829543829199918180278916522431027392251122869539,
40957953066405232632538044100059654939159879593635,
29746152185502371307642255121183693803580388584903,
41698116222072977186158236678424689157993532961922,
62467957194401269043877107275048102390895523597457,
23189706772547915061505504953922979530901129967519,
86188088225875314529584099251203829009407770775672,
11306739708304724483816533873502340845647058077308,
82959174767140363198008187129011875491310547126581,
97623331044818386269515456334926366572897563400500,
42846280183517070527831839425882145521227251250327,
55121603546981200581762165212827652751691296897789,
32238195734329339946437501907836945765883352399886,
75506164965184775180738168837861091527357929701337,
62177842752192623401942399639168044983993173312731,
32924185707147349566916674687634660915035914677504,
99518671430235219628894890102423325116913619626622,
73267460800591547471830798392868535206946944540724,
76841822524674417161514036427982273348055556214818,
97142617910342598647204516893989422179826088076852,
87783646182799346313767754307809363333018982642090,
10848802521674670883215120185883543223812876952786,
71329612474782464538636993009049310363619763878039,
62184073572399794223406235393808339651327408011116,
66627891981488087797941876876144230030984490851411,
60661826293682836764744779239180335110989069790714,
85786944089552990653640447425576083659976645795096,
66024396409905389607120198219976047599490197230297,
64913982680032973156037120041377903785566085089252,
16730939319872750275468906903707539413042652315011,
94809377245048795150954100921645863754710598436791,
78639167021187492431995700641917969777599028300699,
15368713711936614952811305876380278410754449733078,
40789923115535562561142322423255033685442488917353,
44889911501440648020369068063960672322193204149535,
41503128880339536053299340368006977710650566631954,
81234880673210146739058568557934581403627822703280,
82616570773948327592232845941706525094512325230608,
22918802058777319719839450180888072429661980811197,
77158542502016545090413245809786882778948721859617,
72107838435069186155435662884062257473692284509516,
20849603980134001723930671666823555245252804609722,
53503534226472524250874054075591789781264330331690]
result_sum = []
tmp_sum = 0
for j in xrange(50):
for i in xrange(100):
tmp_sum += nums[i] % 10
nums[i] =nums[i] / 10
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum = tmp_sum / 10
for i in xrange(10):
print result_sum[i]
Your code works by adding all the numbers in nums like a person would: adding column by column. Your code does not work because when you are summing the far left column, you treat it like every other column. Whenever people get to the far left, they write down the entire sum. So this line
result_sum.insert(0,int(tmp_sum % 10))
doesn't work for the far left column; you need to insert something else into result_sum in that case. I would post the code, but 1) I'm sure you don't need it, and 2) it's agains the Project-Euler tag rules. If you would like, I can email it to you, but I'm sure that won't be necessary.
You could save the numbers in a file (with a number on each line), and read from it:
nums = []
with open('numbers.txt', 'r') as f:
for num in f:
nums.append(int(num))
# nums is now populated with all of the numbers, so do your actual algorithm
Also, it looks like you want to store the sum as an array of digits. The cool thing about Python is that it automatically handles large integers. Here is a quote from the docs:
Plain integers (also just called integers) are implemented using long in C, which gives them at least 32 bits of precision (sys.maxint is always set to the maximum plain integer value for the current platform, the minimum value is -sys.maxint - 1). Long integers have unlimited precision.
So using an array of digits isn't really necessary if you are working with Python. In C, it is another story...
Also, regarding your code, you need to factor in the digits in tmp_sum, which contains your carry-over digits. You can add them into result_sum like this:
while tmp_sum:
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum /= 10
This will fix your issue. Here, it works.
Since you already have all the numbers in a list, you should be able to take the sum of them pretty easily. Then you just need to take the first ten digits of the sum. I won't put any code here, though.
As Simple as this :
Values.txt will contain all digits.
nums = []
with open("values.txt",'r') as f:
for num in f:
nums.append(int(num))
print(str(sum(nums))[:10])
Just as easy is storing it in csv and using pandas:
def foo():
import pandas as pd
table = pd.read_csv("data.txt", header = None, usecols = [0])
and then iterate through panda dataframe:
sum = 0
for x in range(len(table)):
sum += int(table[0][x])
return str(sum)[:10]
just keep in mind that Python handles the large digits for you.

More efficient solution to loop nesting required

I am trying to compare two files. I will list the two file content:
File 1 File 2
"d.complex.1" "d.complex.1"
1 4
5 5
48 47
65 21
d.complex.10 d.complex.10
46 5
21 46
109 121
192 192
There are totally 2000 d.complex in each file. I am trying to compare both the files but the problem is the values listed under d.complex.1 in first file has to be checked with all the 2000 d.complex entries in the second file and if the entry do not match, they are to be printed out. For example in the above files, in file1 d.complex.1 number 48 is not present in file2 d.complex.1; so that number has to be stored in a list (to print out later). Then again the same d.complex.1 has to be compared with d.complex.10 of file2 and since 1, 48 and 65 are not there, they have to be appended to a list.
The method I chose to achieve this was to use sets and then do a intersection. Code I wrote was:
first_complex=open( "file1.txt", "r" )
first_complex_lines=first_complex.readlines()
first_complex_lines=map( string.strip, first_complex_lines )
first_complex.close()
second_complex=open( "file2.txt", "r" )
second_complex_lines=second_complex.readlines()
second_complex_lines=map( string.strip, second_complex_lines )
second_complex.close()
list_1=[]
list_2=[]
res_1=[]
for line in first_complex_lines:
if line.startswith( "d.complex" ):
res_1.append( [] )
res_1[-1].append( line )
res_2=[]
for line in second_complex_lines:
if line.startswith( "d.complex" ):
res_2.append( [] )
res_2[-1].append( line )
h=len( res_1 )
k=len( res_2 )
for i in res_1:
for j in res_2:
print i[0]
print j[0]
target_set=set ( i )
target_set_1=set( j )
for s in target_set:
if s not in target_set_1:
print s
The above code is giving an output like this (just an example):
1
48
65
d.complex.1.dssp
d.complex.1.dssp
46
21
109
d.complex.1.dssp
d.complex.1.dssp
d.complex.10.dssp
Though the above answer is correct, I want a more efficient way of doing this, can anyone help me? Also two d.complex.1.dssp are printed instead of one which is also not good.
What I would like to have is:
d.complex.1
d.complex.1 (name from file2)
1
48
65
d.complex.1
d.complex.10 (name from file2)
1
48
65
I am so new to python so my concept above might be flawed. Also I have never used sets before :(. Can someone give me a hand here?
Pointers:
Use list comprehensions or generator expressions to simplify data processing. More readable
Just generate the sets once.
Use functions to not repeat yourself, especially doing the same task twice.
I've made a few assumptions about your input data, you might want to try something like this.
def parsefile(filename):
ret = {}
cur = None
for line in ( x.strip() for x in open(filename,'r')):
if line.startswith('d.complex'):
cur = set()
ret[line] = cur
if not cur or not line.isdigit():
continue
cur.add(int(line))
return ret
def compareStructures(first,second):
# Iterate through key,value pairs in first
for firstcmplx, firstmembers in first.iteritems():
# Iterate through key,value pairs in second
for secondcmplx, secondmembers in second.iteritems():
notinsecond = firstmembers- secondmembers
if notinsecond:
# There are items in first that aren't in second
print firstcmplx
print secondcmplx
print "\n".join([ str(x) for x in notinsecond])
first = parsefile("myFirstFile.txt")
second = parsefile("mySecondFile.txt")
compareStructures(first,second)
Edited for fixes.. shows how much I rely on running the code to test it :) Thanks Alex
There's already a good answer, by #MattH, focused on the Python details of your problem, and while it can be improved in several details the improvements would only gain you some percentage points in efficiency -- worthwhile but not great.
The only hope for a huge boost in efficiency (as opposed to "kai-zen" incremental improvement) is a drastic change in the algorithm -- which may or may not be possible depending on characteristics of your data that you do not reveal, and some details about your precise requirements.
The crucial part is: roughly, what range of numbers will be present in the file, and roughly, how many numbers per "d.complex.N" stanza? You already told us there are going to be about 2000 stanzas per file (and that's also crucial of course) and the impression is that in each file they're going to be ordered by contiguous increasing N -- 1, 2, 3, and so on (is that so?).
Your algorithm builds two maps stanza->numbers (not with top efficiency, but that's what #MattH's answer focuses on enhancing) so then inevitably it needs N squared stanza-to-stanza checks -- as N is 2,000, it needs 4 million such checks.
Consider building reversed maps, number->stanzas -- if the range of numbers and the typical size of (amount of numbers in) a stanza are both reasonably limited, those will be more compact. For example, if the numbers are between 1 and 200, and there are about 4 numbers per stanzas, this implies a number will typically be in (2000 * 4) / 200 -> 40 stanzas, so such mappings would have 200 entries of about 40 stanzas each. It only needs 200 squared (40,000) checks, rather than 4 million, to obtain the joint information for each number (then, depending on exact need for output format, formatting that info may require very substantial effort again -- if you absolutely require as final result 4 million "stanza-pairs" section as the output, then of course there's no way to avoid 4 million "output operations, which will be inevitably very costly).
But all of this depends on those numbers that you're not telling us -- average stanza population, and range of numbers in the files, as well as details on what constraints you must absolutely respect for output format (if the numbers are reasonable, the output format constraints are going to be the key constraint on the big-O performance you can get out of any program).
Remember, to quote Fred Brooks:
Show me your flowcharts and conceal
your tables, and I shall continue to
be mystified. Show me your tables, and
I won’t usually need your flowcharts;
they’ll be obvious.
Brooks was writing in the '60s (though his collection of essays, "The Mythical Man-Month", was published later, in the '70s), whence the quaint use of "flowcharts" (where we'd say code or algorithms) and "tables" (where we'd say data or data structures) -- but the general concept is still perfectly valid: the organization and nature of your data, in all kinds of programs focused on data processing (such as yours), can be even more important than the organization of the code, especially since it constrains the latter;-).

Categories