I am trying to compare two files. I will list the two file content:
File 1 File 2
"d.complex.1" "d.complex.1"
1 4
5 5
48 47
65 21
d.complex.10 d.complex.10
46 5
21 46
109 121
192 192
There are totally 2000 d.complex in each file. I am trying to compare both the files but the problem is the values listed under d.complex.1 in first file has to be checked with all the 2000 d.complex entries in the second file and if the entry do not match, they are to be printed out. For example in the above files, in file1 d.complex.1 number 48 is not present in file2 d.complex.1; so that number has to be stored in a list (to print out later). Then again the same d.complex.1 has to be compared with d.complex.10 of file2 and since 1, 48 and 65 are not there, they have to be appended to a list.
The method I chose to achieve this was to use sets and then do a intersection. Code I wrote was:
first_complex=open( "file1.txt", "r" )
first_complex_lines=first_complex.readlines()
first_complex_lines=map( string.strip, first_complex_lines )
first_complex.close()
second_complex=open( "file2.txt", "r" )
second_complex_lines=second_complex.readlines()
second_complex_lines=map( string.strip, second_complex_lines )
second_complex.close()
list_1=[]
list_2=[]
res_1=[]
for line in first_complex_lines:
if line.startswith( "d.complex" ):
res_1.append( [] )
res_1[-1].append( line )
res_2=[]
for line in second_complex_lines:
if line.startswith( "d.complex" ):
res_2.append( [] )
res_2[-1].append( line )
h=len( res_1 )
k=len( res_2 )
for i in res_1:
for j in res_2:
print i[0]
print j[0]
target_set=set ( i )
target_set_1=set( j )
for s in target_set:
if s not in target_set_1:
print s
The above code is giving an output like this (just an example):
1
48
65
d.complex.1.dssp
d.complex.1.dssp
46
21
109
d.complex.1.dssp
d.complex.1.dssp
d.complex.10.dssp
Though the above answer is correct, I want a more efficient way of doing this, can anyone help me? Also two d.complex.1.dssp are printed instead of one which is also not good.
What I would like to have is:
d.complex.1
d.complex.1 (name from file2)
1
48
65
d.complex.1
d.complex.10 (name from file2)
1
48
65
I am so new to python so my concept above might be flawed. Also I have never used sets before :(. Can someone give me a hand here?
Pointers:
Use list comprehensions or generator expressions to simplify data processing. More readable
Just generate the sets once.
Use functions to not repeat yourself, especially doing the same task twice.
I've made a few assumptions about your input data, you might want to try something like this.
def parsefile(filename):
ret = {}
cur = None
for line in ( x.strip() for x in open(filename,'r')):
if line.startswith('d.complex'):
cur = set()
ret[line] = cur
if not cur or not line.isdigit():
continue
cur.add(int(line))
return ret
def compareStructures(first,second):
# Iterate through key,value pairs in first
for firstcmplx, firstmembers in first.iteritems():
# Iterate through key,value pairs in second
for secondcmplx, secondmembers in second.iteritems():
notinsecond = firstmembers- secondmembers
if notinsecond:
# There are items in first that aren't in second
print firstcmplx
print secondcmplx
print "\n".join([ str(x) for x in notinsecond])
first = parsefile("myFirstFile.txt")
second = parsefile("mySecondFile.txt")
compareStructures(first,second)
Edited for fixes.. shows how much I rely on running the code to test it :) Thanks Alex
There's already a good answer, by #MattH, focused on the Python details of your problem, and while it can be improved in several details the improvements would only gain you some percentage points in efficiency -- worthwhile but not great.
The only hope for a huge boost in efficiency (as opposed to "kai-zen" incremental improvement) is a drastic change in the algorithm -- which may or may not be possible depending on characteristics of your data that you do not reveal, and some details about your precise requirements.
The crucial part is: roughly, what range of numbers will be present in the file, and roughly, how many numbers per "d.complex.N" stanza? You already told us there are going to be about 2000 stanzas per file (and that's also crucial of course) and the impression is that in each file they're going to be ordered by contiguous increasing N -- 1, 2, 3, and so on (is that so?).
Your algorithm builds two maps stanza->numbers (not with top efficiency, but that's what #MattH's answer focuses on enhancing) so then inevitably it needs N squared stanza-to-stanza checks -- as N is 2,000, it needs 4 million such checks.
Consider building reversed maps, number->stanzas -- if the range of numbers and the typical size of (amount of numbers in) a stanza are both reasonably limited, those will be more compact. For example, if the numbers are between 1 and 200, and there are about 4 numbers per stanzas, this implies a number will typically be in (2000 * 4) / 200 -> 40 stanzas, so such mappings would have 200 entries of about 40 stanzas each. It only needs 200 squared (40,000) checks, rather than 4 million, to obtain the joint information for each number (then, depending on exact need for output format, formatting that info may require very substantial effort again -- if you absolutely require as final result 4 million "stanza-pairs" section as the output, then of course there's no way to avoid 4 million "output operations, which will be inevitably very costly).
But all of this depends on those numbers that you're not telling us -- average stanza population, and range of numbers in the files, as well as details on what constraints you must absolutely respect for output format (if the numbers are reasonable, the output format constraints are going to be the key constraint on the big-O performance you can get out of any program).
Remember, to quote Fred Brooks:
Show me your flowcharts and conceal
your tables, and I shall continue to
be mystified. Show me your tables, and
I won’t usually need your flowcharts;
they’ll be obvious.
Brooks was writing in the '60s (though his collection of essays, "The Mythical Man-Month", was published later, in the '70s), whence the quaint use of "flowcharts" (where we'd say code or algorithms) and "tables" (where we'd say data or data structures) -- but the general concept is still perfectly valid: the organization and nature of your data, in all kinds of programs focused on data processing (such as yours), can be even more important than the organization of the code, especially since it constrains the latter;-).
Related
I'm trying to standardize a dataframe column of international phone numbers. I've managed to get rid of everything else except of duplicate dialing codes.
For instance, some German numbers are in the following format "00 49 49 7 123 456 789", i.e. they contain two consecutive dialing codes (49). I was wondering if there's an easy fix to get rid of the duplicate and leave the number as "00 49 7 123 456 789"
I have tried some regex and itertools.groupby solutions, however with no success, as the variations in the different dialing codes cause issues.
I would appreciate any help, thank you.
This is a very data-driven problem, therefore the solution may change a lot depending on the actual data you are dealing with. Anyway, this will do what you want to achieve:
number = "00 49 49 7 123 456 789"
# Split the number to a list of number parts
num_parts = number.split()
# Define the latest position to look for a dial number
dial_code_max_pos = 4
# Iterator of tuples in the form of (number_part, next_number_part)
first_parts_with_sibling = zip(
num_parts[:dial_code_max_pos],
num_parts[1:dial_code_max_pos]
)
# Re-build the start of the number but removing the parts
# that have an identical right-sibling
first_parts_with_no_duplicates = [
num[0] for num in first_parts_with_sibling
if len(set(num)) > 1 # This is the actual filter
]
# Compose back the number
number = ' '.join(first_parts_with_no_duplicates + num_parts[dial_code_max_pos - 1:])
Again, this kind of normalizations are very dangerous in production, you could end up loosing valuable data due to an algorithm that does not cover every possible kind of data.
As #Clèment said in his comment, be sure to make a few checks on the original number (i.e.: length) before applying any transformation.
I came across this question where 8 queens should be placed on a chessboard such that none can kill each other.This is how I tried to solve it:
import itertools
def allAlive(position):
qPosition=[]
for i in range(8):
qPosition.append(position[2*i:(2*i)+2])
hDel=list(qPosition) #Horizontal
for i in range(8):
a=hDel[0]
del hDel[0]
l=len(hDel)
for j in range(l):
if a[:1]==hDel[j][:1]:
return False
vDel=list(qPosition) #Vertical
for i in range(8):
a=vDel[0]
l=len(vDel)
for j in range(l):
if a[1:2]==vDel[j][1:2]:
return False
cDel=list(qPosition) #Cross
for i in range(8):
a=cDel[0]
l=len(cDel)
for j in range(l):
if abs(ord(a[:1])-ord(cDel[j][:1]))==1 and abs(int(a[1:2])-int(cDel[j][1:2]))==1:
return False
return True
chessPositions=['A1','A2','A3','A4','A5','A6','A7','A8','B1','B2','B3','B4','B5','B6','B7','B8','C1','C2','C3','C4','C5','C6','C7','C8','D1','D2','D3','D4','D5','D6','D7','D8','E1','E2','E3','E4','E5','E6','E7','E8','F1','F2','F3','F4','F5','F6','F7','F8','G1','G2','G3','G4','G5','G6','G7','G8','H1','H2','H3','H4','H5','H6','H7','H8']
qPositions=[''.join(p) for p in itertools.combinations(chessPositions,8)]
for i in qPositions:
if allAlive(i)==True:
print(i)
Traceback (most recent call last):
qPositions=[''.join(p) for p in itertools.combinations(chessPositions,8)]
MemoryError
I'm still a newbie.How can I overcome this error?Or is there any better way to solve this problem?
What you are trying to do is impossible ;)!
qPositions=[''.join(p) for p in itertools.combinations(chessPositions,8)]
means that you will get a list with length 64 choose 8 = 4426165368, since len(chessPositions) = 64, which you cannot store in memory. Why not? Combining what I stated in the comments and #augray in his answer, the result of above operation would be a list which would take
(64 choose 8) * 2 * 8 bytes ~ 66GB
of RAM, since it will have 64 choose 8 elements, each element will have 8 substrings like 'A1' and each substring like this consists of 2 character. One character takes 1 byte.
You have to find another way. I am not answering to that because that is your job. The n-queens problem falls into dynamic programming. I suggest you to google 'n queens problem python' and search for an answer. Then try to understand the code and dynamic programming.
I did searching for you, take a look at this video. As suggested by #Jean François-Fabre, backtracking. Your job is now to watch the video once, twice,... as long as you don't understand the solution to problem. Then open up your favourite editor (mine is Vi :D) and code it down!
This is one case where it's important to understand the "science" (or more accurately, math) part of computer science as much as it is important to understand the nuts and bolts of programming.
From the documentation for itertools.combinations, we see that the number of items returned is n! / r! / (n-r)! where n is the length of the input collection (in your case the number of chess positions, 64) and r is the length of the subsequences you want returned (in your case 8). As #campovski has pointed out, this results in 4,426,165,368. Each returned subsequence will consist of 8*2 characters, each of which is a byte (not to mention the overhead of the other data structures to hold these and calculate the answer). Each character is 1 byte, so in total, just counting the memory consumption of the resulting subsequences gives 4,426,165,368*2*8=70818645888. dividing this by 1024^3 gives the number of Gigs of memory held by these subsequences, about 66GB.
I'm assuming you don't have that much memory :-) . Calculating the answer to this question will require a well thought out algorithm, not just "brute force". I recommend doing some research on the problem- Wikipedia looks like a good place to start.
As the other answers stated you cant get every combination to fit in memory, and you shouldn't use brute force because the speed will be slow. However, if you want to use brute force, you could constrain the problem, and eliminate common rows and columns and check the diagonal
from itertools import permutations
#All possible letters
letters = ['a','b','c','d','e','f','g','h']
#All possible numbers
numbers = [str(i) for i in range(1,len(letters)+1)]
#All possible permutations given rows != to eachother and columns != to eachother
r = [zip(letters, p) for p in permutations(numbers,8)]
#Formatted for your function
points = [''.join([''.join(z) for z in b]) for b in r]
Also as a note, this line of code attempts to first find all of the combinations, then feed your function, which is a waste of memory.
qPositions=[''.join(p) for p in itertools.combinations(chessPositions,8)]
If you decided you do want to use a brute force method, it is possible. Just modify the code for itertools combinations. Remove the yield and return and just feed your check function one at a time.
I am trying find the intesect sub set between two pretty big csv files of
phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help
import pandas as pd
import numpy as np
df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])
dnc_phone = df_dnc['phone']
test_phone = df_test['phone']
np.intersect1d(dnc_phone, test_phone)
I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.
This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.
Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:
bits[0] identifies the number 0 000 000 000
bits[193] identifies the number 0 000 000 193
having a number 659 234-4567 would be addressed by the bits[6592344567]
Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.
I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.
Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.
from bitstring import BitArray
max_number = 9999999999
found_phone_numbers = BitArray(length=max_number+1)
# replace this function with the file open function and retrieving
# the next found phone number
def number_from_file_iteator(dummy_data):
for number in dummy_data:
yield number
def calculate_intersect():
# should be open a file1 and getting the generator with numbers from it
# we use dummy data here
for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
found_phone_numbers[number] = True
# open second file and check if the number is there
for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
if found_phone_numbers[number]:
yield number
number_intersection = set(calculate_intersect())
print number_intersection
I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.
Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.
Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).
Read the 600k phone numbers into a set.
Input the larger file row by row, checking each row against the set.
Write matches to an output file immediately.
That way you don't have to load all the data in memory at once.
I am solving the homework-1 of Caltech Machine Learning Course (http://work.caltech.edu/homework/hw1.pdf) . To solve ques 7-10 we need to implement a PLA. This is my implementation in python:
import sys,math,random
w=[] # stores the weights
data=[] # stores the vector X(x1,x2,...)
output=[] # stores the output(y)
# returns 1 if dot product is more than 0
def sign_dot_product(x):
global w
dot=sum([w[i]*x[i] for i in xrange(len(w))])
if(dot>0):
return 1
else :
return -1
# checks if a point is misclassified
def is_misclassified(rand_p):
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
# loads data in the following format:
# x1 x2 ... y
# In the present case for d=2
# x1 x2 y
def load_data():
f=open("data.dat","r")
global w
for line in f:
data_tmp=([1]+[float(x) for x in line.split(" ")])
data.append(data_tmp[0:-1])
output.append(data_tmp[-1])
def train():
global w
w=[ random.uniform(-1,1) for i in xrange(len(data[0]))] # initializes w with random weights
iter=1
while True:
rand_p=random.randint(0,len(output)-1) # randomly picks a point
check=[0]*len(output) # check is a list. The ith location is 1 if the ith point is correctly classified
while not is_misclassified(rand_p):
check[rand_p]=1
rand_p=random.randint(0,len(output)-1)
if sum(check)==len(output):
print "All points successfully satisfied in ",iter-1," iterations"
print iter-1,w,data[rand_p]
return iter-1
sign=output[rand_p]
w=[w[i]+sign*data[rand_p][i] for i in xrange(len(w))] # changing weights
if iter>1000000:
print "greater than 1000"
print w
return 10000000
iter+=1
load_data()
def simulate():
#tot_iter=train()
tot_iter=sum([train() for x in xrange(100)])
print float(tot_iter)/100
simulate()
The problem according to the answer of question 7 it should take around 15 iterations for perceptron to converge when size of training set but the my implementation takes a average of 50000 iteration . The training data is to be randomly generated but I am generating data for simple lines such as x=4,y=2,..etc. Is this the reason why I am getting wrong answer or there is something else wrong. Sample of my training data(separable using y=2):
1 2.1 1
231 100 1
-232 1.9 -1
23 232 1
12 -23 -1
10000 1.9 -1
-1000 2.4 1
100 -100 -1
45 73 1
-34 1.5 -1
It is in the format x1 x2 output(y)
It is clear that you are doing a great job learning both Python and classification algorithms with your effort.
However, because of some of the stylistic inefficiencies with your code, it makes it difficult to help you and it creates a chance that part of the problem could be a miscommunication between you and the professor.
For example, does the professor wish for you to use the Perceptron in "online mode" or "offline mode"? In "online mode" you should move sequentially through the data point and you should not revisit any points. From the assignment's conjecture that it should require only 15 iterations to converge, I am curious if this implies the first 15 data points, in sequential order, would result in a classifier that linearly separates your data set.
By instead sampling randomly with replacement, you might be causing yourself to take much longer (although, depending on the distribution and size of the data sample, this is admittedly unlikely since you'd expect roughly that any 15 points would do about as well as the first 15).
The other issue is that after you detect a correctly classified point (cases when not is_misclassified evaluates to True) if you then witness a new random point that is misclassified, then your code will kick down into the larger section of the outer while loop, and then go back to the top where it will overwrite the check vector with all 0s.
This means that the only way your code will detect that it has correctly classified all the points is if the particular random sequence that it evaluates them (in the inner while loop) happens to be a string of all 1's except for the miraculous ability that on any particular 0, on that pass through the array, it classifies correctly.
I can't quite formalize why I think that will make the program take much longer, but it seems like your code is requiring a much stricter form of convergence, where it sort of has to learn everything all at once on one monolithic pass way late in the training stage after having been updated a bunch already.
One easy way to check if my intuition about this is crappy would be to move the line check=[0]*len(output) outside of the while loop all together and only initialize it one time.
Some general advice to make the code easier to manage:
Don't use global variables. Instead, let your function to load and prep the data return things.
There are a few places where you say, for example,
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
This kind of thing can be simplified to
return sign_dot_product(data[rand_p]) != output[rand_p]
which is easier to read and conveys what criteria you're trying to check for in a more direct manner.
I doubt efficiency plays an important role since this seems to be a pedagogical exercise, but there are a number of ways to refactor your use of list comprehensions that might be beneficial. And if possible, just use NumPy which has native array types. Witnessing how some of these operations have to be expressed with list operations is lamentable. Even if your professor doesn't want you to implement with NumPy because she or he is trying to teach you pure fundamentals, I say just ignore them and go learn NumPy. It will help you with jobs, internships, and practical skill with these kinds of manipulations in Python vastly more than fighting with the native data types to do something they were not designed for (array computing).
I'm writing a Python script that takes in a (potentially large) file. Here is an example of a way that input file could be formatted:
class1 1:v1 2:v2 3:v3 4:v4 5:v5
class2 1:v6 4:v7 5:v8 6:v9
class1 3:v10 4:v11 5:v12 6:v13 8:v14
class2 1:v15 2:v16 3:v17 5:v18 7:v19
Where class1 and class2 are some number, e.g. 1 and -1. (A curious user may notice that this is a LIBSVM-related file, but knowing the software isn't necessary in this case.) The values v1, v2, ..., v19 represent any integer or float value. Obviously, my files would be much larger than this, in terms of total lines and length per line, which is why I'm concerned about efficiency here.
I am trying to check what is the greatest value to the left of a colon. In LIBSVM, these are called "features" and are always integers here. For instance, in the example I outlined above, line 1 has 5 as its largest feature. Line 2 has 6 as its largest feature, line 3 has 8 as its largest feature, and finally, line 4 has 7 as its largest feature. Since 8 is the largest of these values, that is my desired value. I'm looking at a file with possibly thousands of features per line, and many hundreds of thousands of lines.
The file satisfies the following properties:
The features must be strictly increasing. I.e. "3:v1 4:v2" is allowed, but not "3:v1 3:v2."
The features are not necessarily consecutive and can be skipped. In the first example I gave, the first line has its features in consecutive order (1,2,3,4,5) and skips features 6, 7, and 8. The other 3 lines do not have their features in consecutive order. That's okay, as long as those features are strictly increasing.
Right now, my approach is to check each line, split up each line by a space, split up the final term by a colon, and then check the feature value. Following that, I do a procedure to check the maximum such featureNum.
file1 = open(...)
max = 0
for line in file1:
linesplit = line.rstrip('\n').split(' ')
val = linesplit[len(linesplit) - 1]
valsplit = val.split(':')
featureNum = valsplit[0]
if (featureNum > max):
max = featureNum
print max
file1.close()
But I'm hoping there is a better or more efficient way of doing this, e.g. some way of analyzing the file by only getting those terms that directly precede a newline character (maybe to avoid reading all the lines?). I'm new to Python so it wouldn't surprise me if I missed something obvious.
Possible reference: http://docs.python.org/library/stdtypes.html
Since you don't care about all the features in a line but just the last one, you don't need to split the whole line. I don't know if this is actually faster though, you need to time it and see. It definitely isn't as Pythonic as splitting the entire line.
def last_feature(line):
start = line.rfind(' ') + 1
end = line.rfind(':')
return int(line[start:end])
with open(...) as file1:
largest = max(last_feature(line) for line in file1)