Create name value pairs in python - python

I have a python script that has the following output stored in a variable called jvmData:
Stats name=jvmRuntimeModule, type=jvmRuntimeModule#
{
name=HeapSize, ID=1, description=The total memory (in KBytes) in the Java virtual machine run time., unit=KILOBYTE, type=BoundedRangeStatistic, lowWaterMark=1048576, highWaterMark=1048576, current=1048576, integral=0.0, lowerBound=1048576, upperBound=2097152
name=FreeMemory, ID=2, description=The free memory (in KBytes) in the Java virtual machine run time., unit=KILOBYTE, type=CountStatistic, count=348466
name=UsedMemory, ID=3, description=The amount of used memory (in KBytes) in the Java virtual machine run time., unit=KILOBYTE, type=CountStatistic, count=700109
name=UpTime, ID=4, description=The amount of time (in seconds) that the Java virtual machine has been running., unit=SECOND, type=CountStatistic, count=3706565
name=ProcessCpuUsage, ID=5, description=The CPU Usage (in percent) of the Java virtual machine., unit=N/A, type=CountStatistic, count=0
}
What I would like to do is simply print out name/value pairs for the important parts, which in this case would simply be:
HeapSize=1048576
FreeMemory=348466
UsedMemory=700109
UpTime=3706565
ProcessCpuUsage=0
Im not at all good with python :) The only solution in my head seems very long-winded? Split the lines, throw away first, second and last lines, then loop through each line with different cases (sometimes current, sometimes count) for finding the length of string, etc etc
Perhaps (well definitely) I am missing something some nice function I can use to put these into the equivalent of a java hashmap or something?

The "equivalent of a java HashMap" would be called a dictionary in python. As for how to parse this, just iterate over the lines that contain the data, make a dict of all key/value pairs in the line and have a special case for the HeapSize:
jvmData = "..." #the string holding the data
jvmLines = jvmData.split("\n") #a list of the lines in the string
lines = [l.strip() for l in jvmLines if "name=" in l] #filter all data lines
result = {}
for line in lines:
data = dict(s.split("=") for s in line.split(", "))
#the value is in "current" for HeapSize or in "count" otherwise
value = data["current"] if data["name"] == "HeapSize" else data["count"]
result[data["name"]] = value
As you seem to be stuck on Jython2.1, here's a version that should work with it (obviously untested). Basically the same as above, but with the list comprehension and generator expression replaced by filter and map respectively, and without using the ternary if/else operator:
jvmData = "..." #the string holding the data
jvmLines = jvmData.split("\n") #a list of the lines in the string
lines = filter(lambda x: "name=" in x, jvmLines) #filter all data lines
result = {}
for line in lines:
data = dict(map(lambda x: x.split("="), line.split(", ")))
if data["name"] == "HeapSize":
result[data["name"]] = data["current"]
else:
result[data["name"]] = data["count"]

Try something using find function and small re:
import re
final_map = {}
NAME= 'name='
COUNT= 'count='
HIGHWATERMARK= "highWaterMark="
def main():
with open(r'<file_location>','r') as file:
lines = [line for line in file if re.search(r'^name', line)]
for line in lines:
sub = COUNT if line.find(COUNT) != -1 else HIGHWATERMARK
final_map[line[line.find(NAME)+5:line.find(',')]] = line[line.find(sub)+len(sub):].split(',')[0].strip()
print line[line.find(NAME)+5:line.find(',')]+'='+final_map[line[line.find(NAME)+5:line.find(',')]]
if __name__ == '__main__':
main()
Output:
HeapSize=1048576
FreeMemory=348466
UsedMemory=700109
UpTime=3706565
ProcessCpuUsage=0

Related

How to iterate over two files effectively (25000+ Lines)

So, I am trying to make a combined list inside of Python for matching data of about 25,000 lines.
The first list data came from file mac.uid and looks like this
Mac|ID
The second list data came serial.uid and looks like this:
Serial|Mac
Mac from list 1 must equal the Mac from list 2 before it's joined.
This is what I am currently doing, I believe there is too much repetition going on.
combined = [];
def combineData():
lines = open('mac.uid', 'r+')
for line in lines:
with open('serial.uid', 'r+') as serial:
for each in serial:
a, b = line.strip().split('|')
a = a.lower()
x, y = each.strip().split('|')
y = y.lower()
if a == y:
combined.append(a+""+b+""+x)
The final list is supposed to look like this:
Mac(List1), ID(List1), Serial(List2)
So that I can import it into an excel sheet.
Thanks for any and all help!
Instead of your nested loops (which cause quadratic complexity) you should use dictionaries which will give you roughly O(n log(n)) complexity. To do so, first read serial.uid once and store the mapping of MAC addresses to serial numbers in a dict.
serial = dict()
with open('serial.uid') as istr:
for line in istr:
(ser, mac) = split_fields(line)
serial[mac] = ser
Then you can close the file again and process mac.uid looking up the serial number for each MAC address in the dictionary you've just created.
combined = list()
with open('mac.uid') as istr:
for line in istr:
(mac, uid) = split_fields(line)
combined.append((mac, uid, serial[mac]))
Note that I've changed combined from a list of strings to a list of tuples. I've also factored the splitting logic out into a separate function. (You'll have to put its definition before its use, of course.)
def split_fields(line):
return line.strip().lower().split('|')
Finally, I recommend that you start using more descriptive names for your variables.
For files of 25k lines, you should have no issues storing everything in memory. If your data sets become too large for that, you might want to start looking into using a database. Note that the Python standard library ships with an SQLite module.

Fast extraction of chunks of lines from large CSV file

I have a large CSV file full of stock-related data formatted as such:
Ticker Symbol, Date, [some variables...]
So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB.
The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines).
I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster.
First, I have a function called getData:
def getData(symbol, filename):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(filename, 'r') as f:
for line in f:
match = checkMatch(symbol, l, line)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and match:
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and not match:
break
return out
(This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented:
def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False):
dataset = getData(symbol)
if not changeRateTransform:
column = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return column
(changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions:
def checkMatch(symbol, symbolLength, line):
out = False
if line[:symbolLength+1] == symbol + ",":
out = True
return out
def formatLineData(lineData):
out = [lineData[0]]
out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date())
out += [float(d) for d in lineData[2:6]]
out += [int(float(d)) for d in lineData[6:9]]
out += [float(d) for d in lineData[9:13]]
out.append(int(float(lineData[13])))
return out
Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up.
EDIT:
In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module:
def getData(symbol, database):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(database, 'r') as f:
databaseReader = csv.reader(f, delimiter=",")
for row in databaseReader:
match = (row[0] == symbol)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(row))
elif not beforeMatch and match:
out.append(formatLineData(row))
elif not beforeMatch and not match:
break
return out
def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False):
if not changeRateTransform:
out = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return out
Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results.
My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.
When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous.
If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront.
Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go.
Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for.
Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly.
Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude!
Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index:
import csv
from itertools import islice
import mmap
def iter_symbol_lines(f, symbol):
# How to recognize the start of a line of interest
ident = b'"' + symbol.encode() + b'",'
# The memory-mapped file
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Skip the header
mm.readline()
# The inclusive lower bound of the byte range we're still interested in
lo = mm.tell()
# The exclusive upper bound of the byte range we're still interested in
hi = mm.size()
# As long as the range isn't empty
while lo < hi:
# Find the position of the beginning of a line near the middle of the range
mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1
# Go to that position
mm.seek(mid)
# Is it a line that comes before lines we're interested in?
if mm.readline() < ident:
# If so, ignore everything up to right after this line
lo = mm.tell()
else:
# Otherwise, ignore everything from right before this line
hi = mid
# We found where the first line of interest would be expected; go there
mm.seek(lo)
while True:
line = mm.readline()
if not line.startswith(ident):
break
yield line.decode()
with open(filename) as f:
r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10))
for line in r:
print(line)
No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!
So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve).
I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv
I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining.
The command to run the following short script follows a super simple convention:
bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'
#!/bin/bash
dates="$3"
cat "$1" \
| gocsv filter --columns 'ticker' --regex "$2" \
| gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv'
both arguments for ticker and for dates are regexes
You can add as many variations as your want into that one regex, separating them by |.
So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT
I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month.
End Result
Starting data:
myusername$ gocsv dims wikiprices.csv
Dimensions:
Rows: 23946
Columns: 14
myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+'
myusername$ gocsv dims AMZN|MSFT-out.csv
Dimensions:
Rows: 24
Columns: 14
Here is a sample where I limited to only those 2 tickers and then to december only:
Voila - in a matter of seconds you have a second file saved with out the data you care about.
The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)
in addition to using csv.reader I think using itertools.groupby would speed up looking for the wanted sections, so the actual iteration could look something like this:
import csv
from itertools import groupby
from operator import itemgetter #for the keyfunc for groupby
def getData(wanted_symbol, filename):
with open(filename) as file:
reader = csv.reader(file)
#so each line in reader is basically line[:-1].split(",") from the plain file
for symb, lines in groupby(reader, itemgetter(0)):
#so here symb is the symbol at the start of each line of lines
#and lines is the lines that all have that symbol in common
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in lines:
#here we have each line as a list of fields
#for only the lines that have `wanted_symbol` as the first element
<DO STUFF HERE>
so in the space of <DO STUFF HERE> you could have the out.append(formatLineData(line)) to do what your current code does but the code for that function has a lot of unnecessary slicing and += operators which I think are pretty expensive for lists (might be wrong), another way you could apply the conversions is to have a list of all the conversions:
def conv_date(date_str):
return datetime.strptime(date_str, '%Y-%m-%d').date()
#the conversions applied to each element (taken from original formatLineData)
castings = [str, conv_date, #0, 1
float, float, float, float, #2:6
int, int, int, #6:9
float, float, float, float, #9:13
int] #13
then use zip to apply these to each field in a line in a list comprehension:
[conv(val) for conv, val in zip(castings, line)]
so you would replace <DO STUFF HERE> with out.append with that comprehension.
I'd also wonder if switching the order of groupby and reader would be better since you don't need to parse most of the file as csv, just the parts you are actually iterating over so you could use a keyfunc that seperates just the first field of the string
def getData(wanted_symbol, filename):
out = [] #why are you starting this with strings in it?
def checkMatch(line): #define the function to only take the line
#this would be the keyfunc for groupby in this example
return line.split(",",1)[0] #only split once, return the first element
with open(filename) as file:
for symb, lines in groupby(file,checkMatch):
#so here symb is the symbol at the start of each line of lines
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in csv.reader(lines):
out.append( [typ(val) for typ,val in zip(castings,line)] )
return out

Python compare every line in file with all others

I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop (O(n2)) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the 'with open' mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.
I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
# get the first set from the list and...
set1 = set(ast.literal_eval(row))
# ...find the intersection of every other set in the file
for i in range(0, len(list1)):
# don't compare the set with itself
if not(pos == i):
set2 = list1[i]
set3 = set1.intersection(set2)
# if the two sets have less than 3 items in common
if(len(set3) < filterValue):
# and you've reached the end of the file
if(i == len(list1)):
# append the row in outfile
outfile.write(row)
# increase position in infile
pos += 1
else:
break
else:
outfile.write(row)
Sample input would be a file with this format:
[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]
The output file if this were the input file would be:
[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]
...because the two sets removed contained three or more of the same user id's.
This answer is not about how to split code in functions, name variables etc. It's about faster algorithm in terms of complexity.
I'd use a dictionary. Will not write exact code, you can do it yourself.
Sets = dict()
for rowID, row in enumerate(Rows):
for userID in row:
if Sets.get(userID) is None:
Sets[userID] = set()
Sets[userID].add(rowID)
So, now we have a dictionary, which can be used to quickly obtain rownumbers of rows containing given userID.
BadRows = set()
for rowID, row in enumerate(Rows):
Intersections = dict()
for userID in row:
for rowID_cmp in Sets[userID]:
if rowID_cmp != rowID:
Intersections[rowID_cmp] = Intersections.get(rowID_cmp, 0) + 1
# Now Intersections contains info about how many "times"
# row numbered rowID_cmp intersectcs current row
filteredOut = False
for rowID_cmp in Intersections:
if Intersections[rowID_cmp] >= filterValue:
BadRows.add(rowID_cmp)
filteredOut = True
if filteredOut:
BadRows.add(rowID)
Having rownumbers of all filtered out rows saved to BadRows, now we do iteration one last time:
for rowID, row in enumerate(Rows):
if rowID not in BadRows:
# output row
This works in 3 scans and in O(nlogn) time. Maybe you'd have to rework iterating Rows array, because it's a file in your case, but doesn't really change much.
Not sure about python syntax and details, but you get the idea behind my code.
First of all, please pack your the code into functions which do one thing well.
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def loop_over_some_result(result):
# insert assertions so that you don't end up looping in infinity:
assert result is not None
...
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
intersects = find_intersections_sets(L1,L2)
...
if __name__ == "__main__":
myfunc()
then you can easily profile the code using:
if __name__ == "__main__":
import cProfile
cProfile.run('myfunc()')
which gives you invaluable insight into your code behaviour and allows you to track down logical bugs. For more on cProfile, see How can you profile a python script?
An option to track down a logical flaw (we're all humans, right?) is to user a timeout function in a decorate like this (python2) or this (python3):
Hereby myfunc can be changed to:
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
#timeout(10) # seconds <---- the clever bit!
intersects = find_intersections_sets(L1,L2)
...
...where the timeout operation will raise an error if it takes too long.
Here is my best guess:
import ast
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def less_than_x_in_common(set1, set2, limit=3):
if len(set1.intersection(set2)) < limit:
return True
else:
return False
def check_infile(datafile, savefile, filtervalue=3):
list1 = [get_ast_set(row) for row in get_data(datafile)]
outlist = []
for row in list1:
if any([less_than_x_in_common(set(row), set(i), limit=filtervalue) for i in outlist]):
outlist.append(row)
with open(savefile, 'w') as fo:
fo.writelines(outlist)
if __name__ == "__main__":
datafile = str(arg2 + arg1 + "/file")
savefile = str(arg2 + arg1 + "/filteredSets" + str(filterValue))
check_infile(datafile, savefile)

parsing sdf file, performance issue

I've wrtien a script which read different files and search molecular ID in big sdf databases (about 4.0 GB each).
the idea of this script is to copy every molecules from a list of id (about 287212 molecules) from my original databases to a new one in a way to have only one single copy of each molecule (in this case, the first copy encountered)
I've writen this script:
import re
import sys
import os
def sdf_grep (molname,files):
filin = open(files, 'r')
filine= filin.readlines()
for i in range(0,len(filine)):
if filine[i][0:-1] == molname and filine[i][0:-1] not in past_mol:
past_mol.append(filine[i][0:-1])
iterate = 1
while iterate == 1:
if filine[i] == "$$$$\n":
filout.write(filine[i])
iterate = 0
break
else:
filout.write(filine[i])
i = i+1
else:
continue
filin.close()
mol_dock = os.listdir("test")
listmol = []
past_mol = []
imp_listmol = open("consensus_sorted_surflex.txt", 'r')
filout = open('test_ini.sdf','wa')
for line in imp_listmol:
listmol.append(line.split('\t')[0])
print 'list ready... reading files'
imp_listmol.close()
for f in mol_dock:
print 'reading '+f
for molecule in listmol:
if molecule not in past_mol:
sdf_grep(molecule , 'test/'+f)
print len(past_mol)
filout.close()
it works perfectely, but it is very slow... too slow for what I need. Is there a way to rewrite this script in a way that can reduce the computation time?
thank you very much.
The main problem is that you have three nested loops: molecular documents, molecules and file parsing in the inner loop. That smells like trouble - I mean, quadratic complexity. You should move huge files parsing outside of inner loop and use set or dictionary for molecules.
Something like this:
For each sdf file
For each line, if it is molecule definition
Check in dictionary of unfound molecules
If present, process it and remove from dictionary of unfound molecules
This way, you will parse each sdf file exactly once, and with each found molecule, speed will further increase.
Let past_mol be a set, rather than a list. That will speed up
filine[i][0:-1] not in past_mol
since checking membership in a set is O(1), while checking membership in a list is O(n).
Try not to write to a file one line at a time. Instead, save up lines in a list, join them into a single string, and then write it out with one call to filout.write.
It is generally better not to allow functions to modify global variables. sdf_grep modifies the global variable past_mol.
By adding past_mol to the arguments of sdf_grep you make it explicit that sdf_grep depends on the existence of past_mol (otherwise, sdf_grep is not really a standalone function).
If you pass past_mol in as a third argument to sdf_grep, then Python will make a new local variable named past_mol which will point to the same object as the global variable past_mol. Since that object is a set and a set is a mutable object, past_mol.add(sline) will affect the global variable past_mol as well.
As an added bonus, Python looks up local variables faster than global variables:
def using_local():
x = set()
for i in range(10**6):
x
y = set
def using_global():
for i in range(10**6):
y
In [5]: %timeit using_local()
10 loops, best of 3: 33.1 ms per loop
In [6]: %timeit using_global()
10 loops, best of 3: 41 ms per loop
sdf_grep can be simplified greatly if you use a variable (let's call it found) which keeps track of whether or not we are inside one of the chunks of lines we want to keep. (By "chunk of lines" I mean one that begins with molname and ends with "$$$$"):
import re
import sys
import os
def sdf_grep(molname, files, past_mol):
chunk = []
found = False
with open(files, 'r') as filin:
for line in filin:
sline = line.rstrip()
if sline == molname and sline not in past_mol:
found = True
past_mol.add(sline)
elif sline == '$$$$':
chunk.append(line)
found = False
if found:
chunk.append(line)
return '\n'.join(chunk)
def main():
past_mol = set()
with open("consensus_sorted_surflex.txt", 'r') as imp_listmol:
listmol = [line.split('\t')[0] for line in imp_listmol]
print 'list ready... reading files'
with open('test_ini.sdf', 'wa') as filout:
for f in os.listdir("test"):
print 'reading ' + f
for molecule in listmol:
if molecule not in past_mol:
filout.write(sdf_grep(molecule, os.path.join('test/', f), past_mol))
print len(past_mol)
if __name__ == '__main__':
main()

Quickly find differences between two large text files

I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).
How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.
Any suggestions are appreciated.
If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.
I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).
Notes:
1.I only store each line's hash to save space (and time if paging might occur)
2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again
3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.
4.I import hashlib because the built-in hash() function is too short to avoid conflicts.
import sys
import hashlib
file = []
lines = []
for i in range(2):
# open the files named in the command line
file.append(open(sys.argv[1+i], 'r'))
# stores the hash value and the line number for each line in file i
lines.append({})
# assuming you like counting lines starting with 1
counter = 1
while 1:
# assuming default encoding is sufficient to handle the input file
line = file[i].readline().encode()
if not line: break
hashcode = hashlib.sha512(line).hexdigest()
lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]
With 60,000 or 80,000 unique lines you could just create a dictionary for each unique line, mapping it to a number. mydict["hello world"] => 1, etc. If your average line is around 40-80 characters this will be in the neighborhood of 10 MB of memory.
Then read each file, converting it to an array of numbers via the dictionary. Those will fit easily in memory (2 files of 8 bytes * 3GB / 60k lines is less than 1 MB of memory). Then diff the lists. You could invert the dictionary and use it to print out the text of the lines that differ.
EDIT:
In response to your comment, here's a sample script that assigns numbers to unique lines as it reads from a file.
#!/usr/bin/python
class Reader:
def __init__(self, file):
self.count = 0
self.dict = {}
self.file = file
def readline(self):
line = self.file.readline()
if not line:
return None
if self.dict.has_key(line):
return self.dict[line]
else:
self.count = self.count + 1
self.dict[line] = self.count
return self.count
if __name__ == '__main__':
print "Type Ctrl-D to quit."
import sys
r = Reader(sys.stdin)
result = 'ignore'
while result:
result = r.readline()
print result
If I understand correctly, you want the lines of these files without duplicates. This does the job:
uniqA = set(open('fileA', 'r'))
Python has difflib which claims to be quite competitive with other diff utilities see:
http://docs.python.org/library/difflib.html

Categories