I have a bunch of files (almost 100) which contain data of the format:
(number of people) \t (average age)
These files were generated from a random walk conducted on a population of a certain demographic. Each file has 100,000 lines, corresponding to the average age of populations of sizes from 1 to 100,000. Each file corresponds to a different locality in a third world country. We will be comparing these values to the average ages of similar sized localities in a developed country.
What I want to do is,
for each i (i ranges from 1 to 100,000):
Read in the first 'i' values of average-age
perform some statistics on these values
That means, for each run i (where i ranges from 1 to 100,000), read in the first i values of average-age, add them to a list and run a few tests (like Kolmogorov-Smirnov or chi-square)
In order to open all these files in parallel, I figured the best way would be a dictionary of file objects. But I am stuck in trying to do the above operations.
Is my method the best possible one (complexity-wise)?
Is there a better method?
Actually, it would be possible to hold 10,000,000 lines in memory.
Make a dictionary where the keys are number of people and values are lists of average age where each element of the list comes a different file. Therefore, if there are 100 files, each of your lists will have 100 elements.
This way, you don't need to store the file objects in a dict
Hope this helps
Why not take a simple approach:
Open each file sequentially and read its lines to fill an in-memory data structure
Perform statistics on the in-memory data structure
Here is a self-contained example with 3 "files", each containing 3 lines. It uses StringIO for convenience instead of actual files:
#!/usr/bin/env python
# coding: utf-8
from StringIO import StringIO
# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'
files = [f1, f2, f3]
# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []
for i,filename in enumerate(files):
f = StringIO(filename)
# f = open(filename, 'r')
data.append(dict())
for line in f:
population, average_age = (int(s) for s in line.split('\t'))
data[i][population] = average_age
print data
# gather custom statistics on the data
# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'
The output is:
[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old
I ... don't know if I like this approach, but it's possible that it could work for you. It has the potential to consume large amounts of memory, but may do what you need it to. I make the assumption that your data files are numbered. If that's not the case this may need adaptation.
# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]
# loop for the number of lines.
for line in range(100000):
lines = [fh.readline() for fh in handles]
# Some sort of processing for the list of lines.
That may get close to what you need, but again, I don't know that I like it. If you have any files that don't have the same number of lines this could run into trouble.
Related
I have this block of code that does several things; I loops through files I have saved in a folder that are labeled 1-100. These files are all forecast files for a specific month (example june 2016). What this function does is read all the files and goes to previous files to conduct forecasts. I store all the values by different months. I want to see the totals for how accurate predictions were for "one month ago", "two months ago", etc. I am able to do this with the code, however I am having trouble extracting the exact values that contribute to this total using arrays/lists. The portion that does not have to do with arrays or lists work, but I am wondering how I can extract these specific numbers. I would want to use the append numbers (the list) for graphing purposes later, that is why I am extracting it, the # with ?'s indicate the list portion that does not seem to work
import pandas as pd
import csv
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 12
total_by_month = [0] * N
total_by_month_list [] * N #????
for basefilenumber in range(24,36):
for n in range(N):
total_by_month[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n)) #????
onetotal = total_by_month[1]
twototal = total_by_month[2]
#etc
Try running your code by initializing total_by_month_list as
total_by_month_list = [[] for _ in range(N)]
Without your data, it's currently speculative. What I understood is that total_by_month_list should be a list of 12 sublists.
Because of the memory error, i have to split my csv files. I did research it. I found it from one of the stack overflow user who is Aziz Alto. This is his code.
csvfile = open('#', 'r').readlines()
filename = 1
for i in range(len(csvfile)):
if i % 10000000 == 0:
open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+10000000])
filename += 1
It works well but for second file, the code did not add header which is very important for me. My question is that How can I add header for second file?
import pandas as pd
rows = pd.read_csv("csvfile.csv", chunksize=5000000)
for i, chuck in enumerate(rows):
chuck.to_csv('out{}.csv'.format(i)) # i is for chunk number of each iteration
chucksize you specify how many rows you want- in excel you can have upto 1,048,576 rows.
This will save it as 5000000 and with header.
hope this Helps!!
On the 2nd till last file you have to always add the 1st line of your original file (the one containing the header):
# this loads the first file fully into memory
with open('#', 'r') as f:
csvfile = f.readlines()
linesPerFile = 1000000
filename = 1
# this is better then your former loop, it loops in 1000000 lines a peice,
# instead of incrementing 1000000 times and only write on the millionth one
for i in range(0,len(csvfile),linesPerFile):
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
f.writelines(csvfile[i:i+linesPerFile])
filename += 1
Fast csv file splitting
If you have a very big file and you have to try different partitions (say to find the best way to split it) the above solutions are too slow to try.
Another way to solve this (and a very fast one) is to create an index file by record number. It takes about six minutes to create an index file of a csv file of 6867839 rows and 9 Gb, and an additional 2 minutes for joblib to store it on disk.
This method is particularly impressive if you are dealing with huge files, like 3 Gb or more.
Here's the code for creating the index file:
# Usage:
# creaidx.py filename.csv
# indexes a csv file by record number. This can be used to
# access any record directly or to split a file without the
# need of reading it all. The index file is joblib-stored as
# filename.index
# filename.csv is the file to create index for
import os,sys,joblib
BLKSIZE=512
def checkopen(s,m='r',bz=None):
if os.access(s,os.F_OK):
if bz==None:
return open(s,m) # returns open file
else:
return open(s,m,bz) # returns open file with buffer size
else:
return None
def get_blk():
global ix,off,blk,buff
while True: # dealing with special cases
if ix==0:
n=0
break
if buff[0]==b'\r':
n=2
off=0
break
if off==BLKSIZE-2:
n=0
off=0
break
if off==BLKSIZE-1:
n=0
off=1
break
n=2
off=buff.find(b'\r')
break
while (off>=0 and off<BLKSIZE-2):
idx.append([ix,blk,off+n])
# g.write('{},{},{}\n'.format(ix,blk,off+n))
print(ix,end='\r')
n=2
ix+=1
off= buff.find(b'\r',off+2)
def crea_idx():
global buff,blk
buff=f.read(BLKSIZE)
while len(buff)==BLKSIZE:
get_blk()
buff=f.read(BLKSIZE)
blk+=1
get_blk()
idx[-1][2]=-1
return
if len(sys.argv)==1:
sys.exit("Need to provide a csv filename!")
ix=0
blk=0
off=0
idx=[]
buff=b'0'
s=sys.argv[1]
f=checkopen(s,'rb')
idxfile=s.replace('.csv','.index')
if checkopen(idxfile)==None:
with open(idxfile,'w') as g:
crea_idx()
joblib.dump(idx,idxfile)
else:
if os.path.getctime(idxfile)<os.path.getctime(s):
with open(idxfile,'w') as g:
crea_idx()
joblib.dump(idx,idxfile)
f.close()
Let's use a toy example:
strings,numbers,colors
string1,1,blue
string2,2,red
string3,3,green
string4,4,yellow
The index file will be:
[[0, 0, 0],
[1, 0, 24],
[2, 0, 40],
[3, 0, 55],
[4, 0, 72],
[5, 0, -1]]
Note the -1 at the last index element to indicate end of index file in case of a sequential access. You can use a tool like this to access any individual row of the csv file:
def get_rec(n=1,binary=False):
n=1 if n<0 else n+1
s=b'' if binary else ''
if len(idx)==0:return ''
if idx[n-1][2]==-1:return ''
f.seek(idx[n-1][1]*BLKSIZE+idx[n-1][2])
buff=f.read(BLKSIZE)
x=buff.find(b'\r')
while x==-1:
s=s+buff if binary else s+buff.decode()
buff=f.read(BLKSIZE)
x=buff.find(b'\r')
return s+buff[:x]+b'\r\n' if binary else s+buff[:x].decode()
The first field of the index record is obviously unnecessary. It is kept there for debugging purposes. As a side note, if you substitute this field by any field in the csv record and you sort the index file by that field, then you have the csv file sorted by that field if you use the index field to access the csv file.
Now, once you have you index file created you just call the following program with the filename (the one which index was created already) and a number between 1 and 100 which will be the percentage the file will be split at as command line parameters:
start_time = time.time()
BLKSIZE=512
WSIZE=1048576 # pow(2,20) 1Mb for faster reading/writing
import sys
import joblib
from common import Drv,checkopen
ix=0
blk=0
off=0
idx=[]
buff=b'0'
if len(sys.argv)<3:
sys.exit('Argument missing!')
s=Drv+sys.argv[1]
if sys.argv[2].isnumeric():
pct=int(sys.argv[2])/100
else:
sys.exit('Bad percentage: '+sys.argv[2])
f=checkopen(s,'rb')
idxfile=s.replace('.csv','.index')
if checkopen(idxfile):
print('Loading index...')
idx=joblib.load(idxfile)
print('Done loading index.')
else:
sys.exit(idxfile+' does not exist.')
head=get_rec(0,True)
n=int(pct*(len(idx)-2))
off=idx[n+1][1]*BLKSIZE+idx[n+1][2]-len(head)-1
num=off//WSIZE
res=off%WSIZE
sout=s.replace('.csv','.part1.csv')
i=0
with open(sout,'wb') as g:
g.write(head)
f.seek(idx[1][1]*BLKSIZE+idx[1][2])
for x in range(num):
print(i,end='\r')
i+=1
buff=f.read(WSIZE)
g.write(buff)
buff=f.read(res)
g.write(buff)
print()
i=0
sout=s.replace('.csv','.part2.csv')
with open(sout,'wb') as g:
g.write(head)
f.seek(idx[n+1][1]*BLKSIZE+idx[n+1][2])
buff=f.read(WSIZE)
while len(buff)==WSIZE:
g.write(buff)
print(i,end='\r')
i+=1
buff=f.read(WSIZE)
g.write(buff)
end_time = time.time()
The file are created using blocks of 1048576 bytes. You can play with that figure to make file creation faster or to tailor it to machines with less memory resources.
The file is split only on two partitions, each of them having the header of the original file. It is not too difficult to change the code to make it
split files into more than two partitions.
Finally to put things in perspective, to split a csv file of 6867839 rows and 9 Gb by 50%, it took me roughly 6 minutes to create the index file and another 2 minutes for joblib to store it on disk. It took 3 additional minutes to split the file.
I have a large CSV file full of stock-related data formatted as such:
Ticker Symbol, Date, [some variables...]
So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB.
The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines).
I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster.
First, I have a function called getData:
def getData(symbol, filename):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(filename, 'r') as f:
for line in f:
match = checkMatch(symbol, l, line)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and match:
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and not match:
break
return out
(This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented:
def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False):
dataset = getData(symbol)
if not changeRateTransform:
column = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return column
(changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions:
def checkMatch(symbol, symbolLength, line):
out = False
if line[:symbolLength+1] == symbol + ",":
out = True
return out
def formatLineData(lineData):
out = [lineData[0]]
out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date())
out += [float(d) for d in lineData[2:6]]
out += [int(float(d)) for d in lineData[6:9]]
out += [float(d) for d in lineData[9:13]]
out.append(int(float(lineData[13])))
return out
Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up.
EDIT:
In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module:
def getData(symbol, database):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(database, 'r') as f:
databaseReader = csv.reader(f, delimiter=",")
for row in databaseReader:
match = (row[0] == symbol)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(row))
elif not beforeMatch and match:
out.append(formatLineData(row))
elif not beforeMatch and not match:
break
return out
def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False):
if not changeRateTransform:
out = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return out
Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results.
My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.
When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous.
If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront.
Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go.
Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for.
Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly.
Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude!
Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index:
import csv
from itertools import islice
import mmap
def iter_symbol_lines(f, symbol):
# How to recognize the start of a line of interest
ident = b'"' + symbol.encode() + b'",'
# The memory-mapped file
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Skip the header
mm.readline()
# The inclusive lower bound of the byte range we're still interested in
lo = mm.tell()
# The exclusive upper bound of the byte range we're still interested in
hi = mm.size()
# As long as the range isn't empty
while lo < hi:
# Find the position of the beginning of a line near the middle of the range
mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1
# Go to that position
mm.seek(mid)
# Is it a line that comes before lines we're interested in?
if mm.readline() < ident:
# If so, ignore everything up to right after this line
lo = mm.tell()
else:
# Otherwise, ignore everything from right before this line
hi = mid
# We found where the first line of interest would be expected; go there
mm.seek(lo)
while True:
line = mm.readline()
if not line.startswith(ident):
break
yield line.decode()
with open(filename) as f:
r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10))
for line in r:
print(line)
No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!
So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve).
I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv
I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining.
The command to run the following short script follows a super simple convention:
bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'
#!/bin/bash
dates="$3"
cat "$1" \
| gocsv filter --columns 'ticker' --regex "$2" \
| gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv'
both arguments for ticker and for dates are regexes
You can add as many variations as your want into that one regex, separating them by |.
So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT
I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month.
End Result
Starting data:
myusername$ gocsv dims wikiprices.csv
Dimensions:
Rows: 23946
Columns: 14
myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+'
myusername$ gocsv dims AMZN|MSFT-out.csv
Dimensions:
Rows: 24
Columns: 14
Here is a sample where I limited to only those 2 tickers and then to december only:
Voila - in a matter of seconds you have a second file saved with out the data you care about.
The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)
in addition to using csv.reader I think using itertools.groupby would speed up looking for the wanted sections, so the actual iteration could look something like this:
import csv
from itertools import groupby
from operator import itemgetter #for the keyfunc for groupby
def getData(wanted_symbol, filename):
with open(filename) as file:
reader = csv.reader(file)
#so each line in reader is basically line[:-1].split(",") from the plain file
for symb, lines in groupby(reader, itemgetter(0)):
#so here symb is the symbol at the start of each line of lines
#and lines is the lines that all have that symbol in common
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in lines:
#here we have each line as a list of fields
#for only the lines that have `wanted_symbol` as the first element
<DO STUFF HERE>
so in the space of <DO STUFF HERE> you could have the out.append(formatLineData(line)) to do what your current code does but the code for that function has a lot of unnecessary slicing and += operators which I think are pretty expensive for lists (might be wrong), another way you could apply the conversions is to have a list of all the conversions:
def conv_date(date_str):
return datetime.strptime(date_str, '%Y-%m-%d').date()
#the conversions applied to each element (taken from original formatLineData)
castings = [str, conv_date, #0, 1
float, float, float, float, #2:6
int, int, int, #6:9
float, float, float, float, #9:13
int] #13
then use zip to apply these to each field in a line in a list comprehension:
[conv(val) for conv, val in zip(castings, line)]
so you would replace <DO STUFF HERE> with out.append with that comprehension.
I'd also wonder if switching the order of groupby and reader would be better since you don't need to parse most of the file as csv, just the parts you are actually iterating over so you could use a keyfunc that seperates just the first field of the string
def getData(wanted_symbol, filename):
out = [] #why are you starting this with strings in it?
def checkMatch(line): #define the function to only take the line
#this would be the keyfunc for groupby in this example
return line.split(",",1)[0] #only split once, return the first element
with open(filename) as file:
for symb, lines in groupby(file,checkMatch):
#so here symb is the symbol at the start of each line of lines
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in csv.reader(lines):
out.append( [typ(val) for typ,val in zip(castings,line)] )
return out
I'm trying to remove values from files containing 3 dimensional arrays corresponding with the dimensions corresponding to [time][long][lat].
I have separate files for each years worth of data. I have a list of observed data points T_obs for a duration of time start_time_index to end_time_index that I want to compare to the mean value over that time period in the file data.
The data sets contained in the files are sufficiently large that my code is running very slowly and I want to optimize my execution time. The code I have currently is below. Are there any ways I could significantly save time?
T_obs = [1.5, 3.6, 4.5]
start_time_index = [20, 300, 10]
end_time_index = [40, 328, 200]
long_obs = [45, 54, 180]
lat_obs = [34, 65, 32]
LE = np.zeros(len(T_obs))
t = 1984
for filename in os.listdir("C:\\Directory"):
if filename.endswith(".nc"):
print(filename)
fh = Dataset("C:\\Directory %s"
% filename, 'r').variables['matrix']
for i in range(0, len(long_obs)):
if year[i] == t and start_time_index[i] > 0:
LE_t = []
for x in range(int(start_time_index[i]), int(end_time_index[i])):
LE_t = np.append(LE_t,float(fh[x][long_obs[i]+180][lat_obs[i]*-1+90])/10)
LE[i] = np.mean(LE_t)
t += 1
continue
else:
continue
Are you going to do this kind of work many times (with different values), or just once?
If the former, you can try to put your files data in a database (such as MySQL, which is simple to setup) and create indexes on the start and end time. This would make your reads mighty faster, as you would not need to do a full table scan (which is pretty much what you are doing by reading the whole file).
If you are doing it just once, then I just suggest you wait it out. There is no trivial way to make I/O (which is your bottleneck) less expensive and it seems like to make your checks / compare the data, you actually need to go through the whole dataset.
I have a ~1GB text file of data entries and another list of names that I would like to use to filter them. Running through every name for each entry will be terribly slow. What's the most efficient way of doing this in python? Is it possible to use a hash table if the name is embedded in the entry? Can I use make of the fact that the name part is consistently placed?
Example files:
Entries file -- each part of the entry is separated by a tab, until the names
246 lalala name="Jack";surname="Smith"
1357 dedada name="Mary";surname="White"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
Names file -- each on a newline
Jack
Dan
Ryan
Desired output -- only entries with a name in the names file
246 lalala name="Jack";surname="Smith"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
You can use the set data structure to store the names — it offers efficient lookup but if the names list is very large then you may run into memory troubles.
The general idea is to iterate through all the names, adding them to a set, then checking if each name from each line from the data file is contained in the set. As the format of the entries doesn't vary, you should be able to extract the names with a simple regular expression.
If you run into troubles with the size of the names set, you can read n lines from the names file and repeat the process for each set of names, unless you require sorting.
My first instinct was to make a dictionary of with names as keys, assuming that it was most efficient to look up the names using the hash of keys in the dictionary.
Given the answer, by #rfw, using a set of names, I edited the code as below and tested it against the two methods, using a dict of names and a set.
I built a dummy dataset of over 40 M records and over 5400 names. Using this dataset, the set method consistently had the edge on my machine.
import re
from collections import Counter
import time
# names file downloaded from http://www.tucows.com/preview/520007
# the set contains over 5400 names
f = open('./names.txt', 'r')
names = [ name.rstrip() for name in f.read().split(',') ]
name_set = set(names) # set of unique names
names_dict = Counter(names) # Counter ~= dict of names with counts
# Expect: 246 lalala name="Jack";surname="Smith"
pattern = re.compile(r'.*\sname="([^"]*)"')
def select_rows_set():
f = open('./data.txt', 'r')
out_f = open('./data_out_set.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in name_set:
out_f.write(record)
out_f.close()
f.close()
def select_rows_dict():
f = open('./data.txt', 'r')
out_f = open('./data_out_dict.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in names_dict:
out_f.write(record)
out_f.close()
f.close()
if __name__ == '__main__':
# One round to time the use of name_set
t0 = time.time()
select_rows_set()
t1 = time.time()
time_for_set = t1-t0
print 'Total set: ', time_for_set
# One round to time the use of names_dict
t0 = time.time()
select_rows_dict()
t1 = time.time()
time_for_dict = t1-t0
print 'Total dict: ', time_for_dict
I assumed that a Counter, being at heart a dictionary, and easier to build from the dataset, does not add any overhead to the access time. Happy to be corrected if I am missing something.
Your data is clearly structured as a table so this may be applicable.
Data structure for maintaining tabular data in memory?
You could create a custom data structure with its own "search by name" function. That'd be a list of dictionaries of some sort. This should take less memory than the size of your text file as it'll remove duplicate information you have on each line such as "name" and "surname", which would be dictionary keys. If you know a bit of SQL (very little is required here) then go with Filter large file using python, using contents of another