Reading data from large text file with strings and floats in Python - python

I'm having trouble reading large amounts of data from a text file, and splitting and removing certain objects from it to get a more refined list. For example, let's say I have a text file, we'll call it 'data.txt', that has this data in it.
Some Header Here
Object Number = 1
Object Symbol = A
Mass of Object = 1
Weight of Object = 1.2040
Hight of Object = 0.394
Width of Object = 4.2304
Object Number = 2
Object Symbol = B
Mass Number = 2
Weight of Object = 1.596
Height of Object = 3.293
Width of Object = 4.654
.
.
. ...Same format continuing down
My problem is taking the data I need from this file. Let's say I'm only interested in the Object Number and Mass of Object, which repeats through the file, but with different numerical values. I need a list of this data. Example
Object Number Mass of Object
1 1
2 2
. .
. .
. .
etc.
With the headers excluded of course, as this data will be applied to an equation. I'm very new to Python, and don't have any knowledge of OOP. What would be the easiest way to do this? I know the basics of opening and writing to text files, even a little bit of using the split and strip functions. I've researched quite a bit on this site about sorting data, but I can't get it to work for me.

Try this:
object_number = [] # list of Object Number
mass_of_object = [] # list of Mass of Object
with open('data.txt') as f:
for line in f:
if line.startswith('Object Number'):
object_number.append(int(line.split('=')[1]))
elif line.startswith('Mass of Object'):
mass_of_object.append(int(line.split('=')[1]))

In my opinion dictionary (and sub-classes) has an efficiency greater than a group of lists for huge data input.
Moreover, my code don't need any modification if you need to extract a new object data from your file.
from _collections import defaultdict
checklist = ["Object Number", "Mass of Object"]
data = dict()
with open("text.txt") as f:
# iterating over the file allows
# you to read it automatically one line at a time
for line in f:
for regmatch in checklist:
if line.startswith(regmatch):
# this is to erase newline characters
val = line.rstrip()
val = val.split(" = ")[1]
data.setdefault(regmatch, []).append(val)
print data
This is the output:
defaultdict(None, {'Object Number': ['1', '2'], 'Mass of Object': ['1']})
Here some theory about speed, here some tips about performance optimization and here about dependency between type of data and implementation efficiency.
Last, some examples about re (regular expression):
https://docs.python.org/2/howto/regex.html

Related

lmdb stores the data inefficiently?

I'm looking for a data file structure that enables fast reading of random data samples for deep learning, and have been experimenting with lmdb today. However, one thing that seems surprising to me is how inefficiently it seems to store the data.
I have an ASCII file that is around 120 GB with gene sequences.
Now I would have expected to be able to fit this data in a lmdb database of roughly the same size or perhaps even a bit smaller since ASCII is a highly inefficient storing method.
However what I'm seeing seems, to suggest that I need around 350 GB to store this data in a lmdb file and I simply don't understand that.
Am I not utilizing some setting correctly, or what exactly am I doing wrong here?
import time
import lmdb
import pyarrow as pa
def dumps_pyarrow(obj):
"""
Serialize an object.
Returns:
Implementation-dependent bytes-like object
"""
return pa.serialize(obj).to_buffer()
t0 = time.time()
filepath = './../../Uniparc/uniparc_active/uniparc_active.fasta'
output_file = './../data/out_lmdb.fasta'
write_freq = 100000
start_line = 2
nprot = 0
db = lmdb.open(output_file, map_size=1e9, readonly=False,
meminit=False, map_async=True)
txn = db.begin(write=True)
with open(filepath) as fp:
line = fp.readline()
cnt = 1
protein = ''
while line:
if cnt >= start_line:
if line[0] == '>': #Old protein finished, new protein starting on next line
txn.put(u'{}'.format(nprot).encode('ascii'), dumps_pyarrow((protein)))
nprot += 1
if nprot % write_freq == 0:
t1 = time.time()
print("comitting... nprot={} ,time={:2.2f}".format(nprot,t1-t0))
txn.commit()
txn = db.begin(write=True)
line_checkpoint = cnt
protein = ''
else:
protein += line.strip()
line = fp.readline()
cnt += 1
txn.commit()
keys = [u'{}'.format(k).encode('ascii') for k in range(nprot + 1)]
with db.begin(write=True) as txn:
txn.put(b'__keys__', dumps_pyarrow(keys))
txn.put(b'__len__', dumps_pyarrow(len(keys)))
print("Flushing database ...")
db.sync()
db.close()
t2 = time.time()
print("All done, time taken {:2.2f}s".format(t2-t0))
Edit:
Some additional information about the data:
In the 120 GB file the data is structured like this (Here I am showing the first 2 proteins):
>UPI00001E0F7B status=inactive
YPRSRSQQQGHHNAAQQAHHPYQLQHSASTVSHHPHAHGPPSQGGPGGPGPPHGGHPHHP
HHGGAGSGGGSGPGSHGGQPHHQKPRRTASQRIRAATAARKLHFVFDPAGRLCYYWSMVV
SMAFLYNFWVIIYRFAFQEINRRTIAIWFCLDYLSDFLYLIDILFHFRTGYLEDGVLQTD
ALKLRTHYMNSTIFYIDCLCLLPLDFLYLSIGFNSILRSFRLVKIYRFWAFMDRTERHTN
YPNLFRSTALIHYLLVIFHWNGCLYHIIHKNNGFGSRNWVYHDSESADVVKQYLQSYYWC
TLALTTIGDLPKPRSKGEYVFVILQLLFGLMLFATVLGHVANIVTSVSAARKEFQGESNL
RRQWVKVVWSAPASG
>UPI00001E0FBF status=active
MWRAQPSLWIWWIFLILVPSIRAVYEDYRLPRSVEPLHYNLRILTHLNSTDQRFEGSVTI
DLLARETTKNITLHAAYLKIDENRTSVVSGQEKFGVNRIEVNEVHNFYILHLGRELVKDQ
IYKLEMHFKAGLNDSQSGYYKSNYTDIVTKEVHHLAVTQFSPTFARQAFPCFDEPSWKAT
FNITLGYHKKYMGLSGMPVLRCQDHDSLTNYVWCDHDTLLRTSTYLVAFAVHDLENAATE
ESKTSNRVIFRNWMQPKLLGQEMISMEIAPKLLSFYENLFQINFPLAKVDQLTVPTHRFT
AMENWGLVTYNEERLPQNQGDYPQKQKDSTAFTVAHEYAHQWFGNLVTMNWWNDLWLKEG
PSTYFGYLALDSLQPEWRRGERFISRDLANFFSKDSNATVPAISKDVKNPAEVLGQFTEY
VYEKGSLTIRMLHKLVGEEAFFHGIRSFLERFSFGNVAQADLWNSLQMAALKNQVISSDF
NLSRAMDSWTLQGGYPLVTLIRNYKTGEVTLNQSRFFQEHGIEKASSCWWVPLRFVRQNL
PDFNQTTPQFWLECPLNTKVLKLPDHLSTDEWVILNPQVATIFRVNYDEHNWRLIIESLR
NDPNSGGIHKLNKAQLLDDLMALAAVRLHKYDKAFDLLEYLKKEQDFLPWQRAIGILNRL
GALLNVAEANKFKNYMQKLLLPLYNRFPKLSGIREAKPAIKDIPFAHFAYSQACRYHVAD
CTDQAKILAITHRTEGQLELPSDFQKVAYCSLLDEGGDAEFLEVFGLFQNSTNGSQRRIL
ASALGCVRNFGNFEQFLNYTLESDEKLLGDCYMLAVKSALNREPLVSPTANYIISHAKKL
GEKFKKKELTGLLLSLAQNLRSTEEIDRLKAQLEDLKEFEEPLKKALYQGKMNQKWQKDC
SSDFIEAIEKHL
When I store the data in the database I concatenate all the lines making up each protein, and store those as a single data point. I ignore the headerline (the line starting with >).
The reason why I believe that the data should be more compressed when stored in the database is because I expect it to be stored in some binary form which I would expect would be more compressed - though I will admit I don't know whether that is how it would actually work (For comparison the data is only 70 GB when compressed/zipped).
I would be okay with the data taking up a similar amount of space in lmdb format, but I don't understand why it should take up almost 3 times the space as it does in ASCII format.
LMDB does not implement any sort of compression on data. 1Byte in memory is 1Byte on disk.
But its internals can amplify space required:
data handling is in pages (generally 4KB)
each "record" need to store additional structures for the b-tree (for key and data) and the count of pages occupied by key and data
Bottom line : LMDB is designed for FAST data access, not to save space.

Fast extraction of chunks of lines from large CSV file

I have a large CSV file full of stock-related data formatted as such:
Ticker Symbol, Date, [some variables...]
So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB.
The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines).
I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster.
First, I have a function called getData:
def getData(symbol, filename):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(filename, 'r') as f:
for line in f:
match = checkMatch(symbol, l, line)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and match:
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and not match:
break
return out
(This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented:
def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False):
dataset = getData(symbol)
if not changeRateTransform:
column = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return column
(changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions:
def checkMatch(symbol, symbolLength, line):
out = False
if line[:symbolLength+1] == symbol + ",":
out = True
return out
def formatLineData(lineData):
out = [lineData[0]]
out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date())
out += [float(d) for d in lineData[2:6]]
out += [int(float(d)) for d in lineData[6:9]]
out += [float(d) for d in lineData[9:13]]
out.append(int(float(lineData[13])))
return out
Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up.
EDIT:
In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module:
def getData(symbol, database):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(database, 'r') as f:
databaseReader = csv.reader(f, delimiter=",")
for row in databaseReader:
match = (row[0] == symbol)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(row))
elif not beforeMatch and match:
out.append(formatLineData(row))
elif not beforeMatch and not match:
break
return out
def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False):
if not changeRateTransform:
out = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return out
Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results.
My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.
When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous.
If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront.
Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go.
Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for.
Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly.
Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude!
Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index:
import csv
from itertools import islice
import mmap
def iter_symbol_lines(f, symbol):
# How to recognize the start of a line of interest
ident = b'"' + symbol.encode() + b'",'
# The memory-mapped file
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Skip the header
mm.readline()
# The inclusive lower bound of the byte range we're still interested in
lo = mm.tell()
# The exclusive upper bound of the byte range we're still interested in
hi = mm.size()
# As long as the range isn't empty
while lo < hi:
# Find the position of the beginning of a line near the middle of the range
mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1
# Go to that position
mm.seek(mid)
# Is it a line that comes before lines we're interested in?
if mm.readline() < ident:
# If so, ignore everything up to right after this line
lo = mm.tell()
else:
# Otherwise, ignore everything from right before this line
hi = mid
# We found where the first line of interest would be expected; go there
mm.seek(lo)
while True:
line = mm.readline()
if not line.startswith(ident):
break
yield line.decode()
with open(filename) as f:
r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10))
for line in r:
print(line)
No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!
So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve).
I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv
I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining.
The command to run the following short script follows a super simple convention:
bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'
#!/bin/bash
dates="$3"
cat "$1" \
| gocsv filter --columns 'ticker' --regex "$2" \
| gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv'
both arguments for ticker and for dates are regexes
You can add as many variations as your want into that one regex, separating them by |.
So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT
I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month.
End Result
Starting data:
myusername$ gocsv dims wikiprices.csv
Dimensions:
Rows: 23946
Columns: 14
myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+'
myusername$ gocsv dims AMZN|MSFT-out.csv
Dimensions:
Rows: 24
Columns: 14
Here is a sample where I limited to only those 2 tickers and then to december only:
Voila - in a matter of seconds you have a second file saved with out the data you care about.
The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)
in addition to using csv.reader I think using itertools.groupby would speed up looking for the wanted sections, so the actual iteration could look something like this:
import csv
from itertools import groupby
from operator import itemgetter #for the keyfunc for groupby
def getData(wanted_symbol, filename):
with open(filename) as file:
reader = csv.reader(file)
#so each line in reader is basically line[:-1].split(",") from the plain file
for symb, lines in groupby(reader, itemgetter(0)):
#so here symb is the symbol at the start of each line of lines
#and lines is the lines that all have that symbol in common
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in lines:
#here we have each line as a list of fields
#for only the lines that have `wanted_symbol` as the first element
<DO STUFF HERE>
so in the space of <DO STUFF HERE> you could have the out.append(formatLineData(line)) to do what your current code does but the code for that function has a lot of unnecessary slicing and += operators which I think are pretty expensive for lists (might be wrong), another way you could apply the conversions is to have a list of all the conversions:
def conv_date(date_str):
return datetime.strptime(date_str, '%Y-%m-%d').date()
#the conversions applied to each element (taken from original formatLineData)
castings = [str, conv_date, #0, 1
float, float, float, float, #2:6
int, int, int, #6:9
float, float, float, float, #9:13
int] #13
then use zip to apply these to each field in a line in a list comprehension:
[conv(val) for conv, val in zip(castings, line)]
so you would replace <DO STUFF HERE> with out.append with that comprehension.
I'd also wonder if switching the order of groupby and reader would be better since you don't need to parse most of the file as csv, just the parts you are actually iterating over so you could use a keyfunc that seperates just the first field of the string
def getData(wanted_symbol, filename):
out = [] #why are you starting this with strings in it?
def checkMatch(line): #define the function to only take the line
#this would be the keyfunc for groupby in this example
return line.split(",",1)[0] #only split once, return the first element
with open(filename) as file:
for symb, lines in groupby(file,checkMatch):
#so here symb is the symbol at the start of each line of lines
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in csv.reader(lines):
out.append( [typ(val) for typ,val in zip(castings,line)] )
return out

Efficiently Find Partial String Match --> Values Starting From List of Values in 5 GB file with Python

I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004
Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).
I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.
I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.
Questions:
What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)
This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores
codes = [4451,4452,447,772,45299,45291,45212] #codes for grocery stores
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
data = np.asarray(df)
data = pd.DataFrame(data, columns = headers)
for code in codes:
if np.char.startswith(str(data["SNACODE"][0]), str(code)):
with open("grocery.csv", "a") as myfile:
data.to_csv(myfile, header = False)
print code
break #break code for loop if match
grocery.to_csv("grocery.csv", sep = '\t')
This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.
codes = [4451,4452,447,772,45299,45291,45212]
matched = []
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
x = df[df.SNACODE.isin(codes)]
if len(x):
matched.append(x)
print "Processed chunk and found {} matches".format(len(x))
output = pd.concat(matched, axis=0)
output.to_csv("grocery.csv", index = False)
Thanks!
To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...
codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
"|".join(map(str, codes)) +
".*")
for L in open('infogroup_bus_2010.csv'):
if expr.match(L):
print L
Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:
...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...
that ignores commas inside double-quotes
You can probably make your pandas solution much faster:
codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]
sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'],
chunksize=int(1e6), dtype={'SNACODE': str})
with open('grocery.csv', 'w') as fout:
for chunk in sna:
for code in chunk['SNACODE']:
for target_code in codes:
if code.startswith(target_code):
fout.write('{}\n'.format(code))
Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.

Python: How to extract string from text file to use as data

this is my first time writing a python script and I'm having some trouble getting started. Let's say I have a txt file named Test.txt that contains this information.
x y z Type of atom
ATOM 1 C1 GLN D 10 26.395 3.904 4.923 C
ATOM 2 O1 GLN D 10 26.431 2.638 5.002 O
ATOM 3 O2 GLN D 10 26.085 4.471 3.796 O
ATOM 4 C2 GLN D 10 26.642 4.743 6.148 C
What I want to do is eventually write a script that will find the center of mass of these three atoms. So basically I want to sum up all of the x values in that txt file with each number multiplied by a given value depending on the type of atom.
I know I need to define the positions for each x-value, but I'm having trouble with figuring out how to make these x-values be represented as numbers instead of txt from a string. I have to keep in mind that I'll need to multiply these numbers by the type of atom, so I need a way to keep them defined for each atom type. Can anyone push me in the right direction?
mass_dictionary = {'C':12.0107,
'O':15.999
#Others...?
}
# If your files are this structured, you can just
# hardcode some column assumptions.
coords_idxs = [6,7,8]
type_idx = 9
# Open file, get lines, close file.
# Probably prudent to add try-except here for bad file names.
f_open = open("Test.txt",'r')
lines = f_open.readlines()
f_open.close()
# Initialize an array to hold needed intermediate data.
output_coms = []; total_mass = 0.0;
# Loop through the lines of the file.
for line in lines:
# Split the line on white space.
line_stuff = line.split()
# If the line is empty or fails to start with 'ATOM', skip it.
if (not line_stuff) or (not line_stuff[0]=='ATOM'):
pass
# Otherwise, append the mass-weighted coordinates to a list and increment total mass.
else:
output_coms.append([mass_dictionary[line_stuff[type_idx]]*float(line_stuff[i]) for i in coords_idxs])
total_mass = total_mass + mass_dictionary[line_stuff[type_idx]]
# After getting all the data, finish off the averages.
avg_x, avg_y, avg_z = tuple(map( lambda x: (1.0/total_mass)*sum(x), [[elem[i] for elem in output_coms] for i in [0,1,2]]))
# A lot of this will be better with NumPy arrays if you'll be using this often or on
# larger files. Python Pandas might be an even better option if you want to just
# store the file data and play with it in Python.
Basically using the open function in python you can open any file. So you can do something as follows: --- the following snippet is not a solution to the whole problem but an approach.
def read_file():
f = open("filename", 'r')
for line in f:
line_list = line.split()
....
....
f.close()
From this point on you have a nice setup of what you can do with these values. Basically the second line just opens the file for reading. The third line define a for loop that reads the file one line at a time and each line goes into the line variable.
The last line in that snippet basically breaks the string --at every whitepsace -- into an list. So line_list[0] will be the value on your first column and so forth. From this point if you have any programming experience you can just use if statements and such to get the logic that you want.
** Also keep in mind that the type of values stored in that list will all be string so if you want to perform any arithmetic operations such as adding you have to be careful.
* Edited for syntax correction
If you have pandas installed, checkout the read_fwf function that imports a fixed-width file and creates a DataFrame (2-d tabular data structure). It'll save you lines of code on import and also give you a lot of data munging functionality if you want to do any additional data manipulations.

Filter large file using python, using contents of another

I have a ~1GB text file of data entries and another list of names that I would like to use to filter them. Running through every name for each entry will be terribly slow. What's the most efficient way of doing this in python? Is it possible to use a hash table if the name is embedded in the entry? Can I use make of the fact that the name part is consistently placed?
Example files:
Entries file -- each part of the entry is separated by a tab, until the names
246 lalala name="Jack";surname="Smith"
1357 dedada name="Mary";surname="White"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
Names file -- each on a newline
Jack
Dan
Ryan
Desired output -- only entries with a name in the names file
246 lalala name="Jack";surname="Smith"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
You can use the set data structure to store the names — it offers efficient lookup but if the names list is very large then you may run into memory troubles.
The general idea is to iterate through all the names, adding them to a set, then checking if each name from each line from the data file is contained in the set. As the format of the entries doesn't vary, you should be able to extract the names with a simple regular expression.
If you run into troubles with the size of the names set, you can read n lines from the names file and repeat the process for each set of names, unless you require sorting.
My first instinct was to make a dictionary of with names as keys, assuming that it was most efficient to look up the names using the hash of keys in the dictionary.
Given the answer, by #rfw, using a set of names, I edited the code as below and tested it against the two methods, using a dict of names and a set.
I built a dummy dataset of over 40 M records and over 5400 names. Using this dataset, the set method consistently had the edge on my machine.
import re
from collections import Counter
import time
# names file downloaded from http://www.tucows.com/preview/520007
# the set contains over 5400 names
f = open('./names.txt', 'r')
names = [ name.rstrip() for name in f.read().split(',') ]
name_set = set(names) # set of unique names
names_dict = Counter(names) # Counter ~= dict of names with counts
# Expect: 246 lalala name="Jack";surname="Smith"
pattern = re.compile(r'.*\sname="([^"]*)"')
def select_rows_set():
f = open('./data.txt', 'r')
out_f = open('./data_out_set.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in name_set:
out_f.write(record)
out_f.close()
f.close()
def select_rows_dict():
f = open('./data.txt', 'r')
out_f = open('./data_out_dict.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in names_dict:
out_f.write(record)
out_f.close()
f.close()
if __name__ == '__main__':
# One round to time the use of name_set
t0 = time.time()
select_rows_set()
t1 = time.time()
time_for_set = t1-t0
print 'Total set: ', time_for_set
# One round to time the use of names_dict
t0 = time.time()
select_rows_dict()
t1 = time.time()
time_for_dict = t1-t0
print 'Total dict: ', time_for_dict
I assumed that a Counter, being at heart a dictionary, and easier to build from the dataset, does not add any overhead to the access time. Happy to be corrected if I am missing something.
Your data is clearly structured as a table so this may be applicable.
Data structure for maintaining tabular data in memory?
You could create a custom data structure with its own "search by name" function. That'd be a list of dictionaries of some sort. This should take less memory than the size of your text file as it'll remove duplicate information you have on each line such as "name" and "surname", which would be dictionary keys. If you know a bit of SQL (very little is required here) then go with Filter large file using python, using contents of another

Categories