Problem:
I have 50 text files, each with thousands of lines of text, each line has a value on it. I am only interesting in a small section near the middle (lines 757-827 - it is actually lines 745-805 I'm interested in, but the first 12 lines of every file is irrelevant stuff). I would like to read each file in. And then total the values between those lines. In the end I would like it to print off a pair of numbers in the format (((n+1)*18),total count), where n is the number of the file (since they are numbered starting at zero). Then repeat for all 50 files, giving 50 pairs of numbers, looking something like:
(18,77),(36,63),(54,50),(72,42),...
Code:
import numpy as np
%matplotlib inline
from numpy import loadtxt, linspace
import glob, os
fileToRun = 'Run0'
location = 'ControlRoom6'
DeadTime = 3
LiveTime = 15
folderId = '\\'
baseFolder = 'C:'+folderId+'Users'+folderId+location+folderId+'Documents'+folderId+'PhD'+folderId+'Ubuntu-Analysis-DCF'+folderId+'DCF-an-b+decay'+folderId+'dcp-ap-27Al'+folderId+''
prefix = 'DECAY_COINC'
folderToAnalyze = baseFolder + fileToRun + '\\'
MaestroT = LiveTime + DeadTime
## Gets number of files
files = []
os.chdir(folderToAnalyze)
for file in glob.glob(prefix + "*.Spe"):
files.append(file)
numfiles = len(files)
if numfiles<=1:
print('numfiles is {0}, minimum of 2 is required'.format(numfiles))
raise SystemExit(0)
xmin = 745
xmax = 815
skips = 12
n=[]
count=[]
for n in range(0, numfiles):
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows = 12, max_rows = 8192)
for x in range(xmin+skips,xmax+skips):
count = count + y
time = MaestroT*(n+1)
print(time, count)
Current output is:
'ValueError Traceback (most recent call last)
in
84
85 for x in range(xmin+skips,xmax+skips):
---> 86 count = count + y
87 time = MaestroT*(n+1)
88
ValueError: operands could not be broadcast together with shapes (0,) (8192,)'
However I did previously have this running, it just printing out thousands of seemingly unconnected numbers. Does anyone know how I can alter the code to acheive the desired result?
EDIT: Data Set
In order to make the example easier to use, I've made a dropbox with some dummy data. The files are named the same as it would be reading in, and are written in the same format (the first 12 rows with unuseful information). Link is Here. I haven't written 8192 dummy numbers as I thought it would probably be easier and produce a nearer facsilime to just use the actual files with a few numbers changed.
Solution was to edit code as shown starting from 'xmin = 745':
xmin = 745
xmax = 815
skip = 12
for n in range(0, numfiles):
total = 0
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= xmin + skip, max_rows = xmax - xmin)
for x in y:
val = int(x)
total = total + val
print(((n+1)*MaestroT), total)
Prints out as
18 74
36 64
54 62
72 54
90 47
108 39
126 40
144 35
etc.
Which fit my needs.
Related
I'm trying to solve the knapsack problem using Python, implementing a greedy algorithm. The result I'm getting back makes no sense to me.
Knapsack:
The first line gives the number of items, in this case 20. The last line gives the capacity of the knapsack, in this case 524. The remaining lines give the index, value and weight of each item.
20
1 91 29
2 60 65
3 61 71
4 9 60
5 79 45
6 46 71
7 19 22
8 57 97
9 8 6
10 84 91
11 20 57
12 72 60
13 32 49
14 31 89
15 28 2
16 81 30
17 55 90
18 43 25
19 100 82
20 27 19
524
Python code:
import os
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap):
best = max(values)
i = values.index(best)
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
def read_kfile(fname):
with open(fname, 'rU') as kfile:
lines = kfile.readlines() # reads the whole file
n = int(lines[0])
c = int(lines[n+1])
vs = []
ws = []
lines = lines[1:n+1] # Removes the first and last line
for l in lines:
numbers = l.split() # Converts the string into a list
vs.append(int(numbers[1])) # Appends value, need to convert to int
ws.append(int(numbers[2])) # Appends weigth, need to convert to int
return n, c, vs, ws
dir_path = os.path.dirname(os.path.realpath(__file__)) # Get the directory where the file is located
os.chdir(dir_path) # Change the working directory so we can read the file
knapfile = 'knap20.txt'
nitems, cap, values, weights = read_kfile(knapfile)
val1,val2 =constructive()
print ('knapsack',val1)
print('weight', val2)
print('cap', cap)
Result:
knapsack [18, 0, 8, 13, 3, 8, 1, 0, 3]
weight 570
cap 524
Welcome. the reason why your program is giving a weights over the cap limit is because on the final item you are putting in the knapsack, you aren't checking if it can fit in it. To do this just add an if statement, Also you should check if the list of values is empty. Do note that I have append (i+1) since your text file's index is starting at 1 but Python starts it's list index at 0:
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap and values):
best = max(values)
i = values.index(best)
if weights[i] <= cap-Weight:
knapsack.append(i+1)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
The problem is -- in the last step -- the best item you find will exceed the maximum weight. But since you already entered the loop you add it anyway.
In the next iteration you recognize that you are over the cap and stop.
I am not sure how you want to proceed once the next best is too heavy. In case you simple want to stop and not add anything more you can simply modify your constructive to look as follows:
def constructive():
knapsack = []
Weight = 0
while(True):
best = max(values)
i = values.index(best)
if Weight + weights[i] > cap:
break
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
So, I have this challenge on CodeEval, but I seem don't know where to start, so I need some pointers (and answers if you can) to help me figure out this challenge.
DESCRIPTION:
There is a board (matrix). Every cell of the board contains one integer, which is 0 initially.
The next operations can be applied to the Query Board:
SetRow i x: it means that all values in the cells on row "i" have been change value to "x" after this operation.
SetCol j x: it means that all values in the cells on column "j" have been changed to value "x" after this operation.
QueryRow i: it means that you should output the sum of values on row "i".
QueryCol j: it means that you should output the sum of values on column "j".
The board's dimensions are 256x256
i and j are integers from 0 to 255
x is an integer from 0 to 31
INPUT SAMPLE:
Your program should accept as its first argument a path to a filename. Each line in this file contains an operation of a query. E.g.
SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2
OUTPUT SAMPLE:
For each query, output the answer of the query. E.g.
5118
34
1792
3571
I'm not that great on Python, but this challenge is pretty interesting, although I didn't have any clues on how to solve it. So, I need some help from you guys.
Thanks!
You could use a sparse matrix for this; addressed by (col, row) tuples as keys in a dictionary, to save memory. 64k cells is a big list otherwise (2MB+ on a 64-bit system):
matrix = {}
This is way more efficient, as the challenge is unlikely to set values for all rows and columns on the board.
Setting a column or row is then:
def set_col(col, x):
for i in range(256):
matrix[i, col] = x
def set_row(row, x):
for i in range(256):
matrix[row, i] = x
and summing a row or column is then:
def get_col(col):
return sum(matrix.get((i, col), 0) for i in range(256))
def get_row(row):
return sum(matrix.get((row, i), 0) for i in range(256))
WIDTH, HEIGHT = 256, 256
board = [[0] * WIDTH for i in range(HEIGHT)]
def set_row(i, x):
global board
board[i] = [x]*WIDTH
... implement each function, then parse each line of input to decide which function to call,
for line in inf:
dat = line.split()
if dat[0] == "SetRow":
set_row(int(dat[1]), int(dat[2]))
elif ...
Edit: Per Martijn's comments:
total memory usage for board is about 2.1MB. By comparison, after 100 random row/column writes, matrix is 3.1MB (although it tops out there and doesn't get any bigger).
yes, global is unnecessary when modifying a global object (just don't try to assign to it).
while dispatching from a dict is good and efficient, I did not want to inflict it on someone who is "not that great on Python", especially for just four entries.
For sake of comparison, how about
time = 0
WIDTH, HEIGHT = 256, 256
INIT = 0
rows = [(time, INIT) for _ in range(WIDTH)]
cols = [(time, INIT) for _ in range(HEIGHT)]
def set_row(i, x):
global time
time += 1
rows[int(i)] = (time, int(x))
def set_col(i, x):
global time
time += 1
cols[int(i)] = (time, int(x))
def query_row(i):
rt, rv = rows[int(i)]
total = rv * WIDTH + sum(cv - rv for ct, cv in cols if ct > rt)
print(total)
def query_col(j):
ct, cv = cols[int(j)]
total = cv * HEIGHT + sum(rv - cv for rt, rv in rows if rt > ct)
print(total)
ops = {
"SetRow": set_row,
"SetCol": set_col,
"QueryRow": query_row,
"QueryCol": query_col
}
inf = """SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2""".splitlines()
for line in inf:
line = line.split()
op = line.pop(0)
ops[op](*line)
which only uses 4.3k of memory for rows[] and cols[].
Edit2:
using your code from above for matrix, set_row, set_col,
import sys
for n in range(256):
set_row(n, 1)
print("{}: {}".format(2*(n+1)-1, sys.getsizeof(matrix)))
set_col(n, 1)
print("{}: {}".format(2*(n+1), sys.getsizeof(matrix)))
which returns (condensed:)
1: 12560
2: 49424
6: 196880
22: 786704
94: 3146000
... basically the allocated memory quadruples at each step. If I change the memory measure to include key-tuples,
def get_matrix_size():
return sys.getsizeof(matrix) + sum(sys.getsizeof(key) for key in matrix)
it increases more smoothly, but still takes a bit jump at the above points:
5 : 127.9k
6 : 287.7k
21 : 521.4k
22 : 1112.7k
60 : 1672.0k
61 : 1686.1k <-- approx expected size on your reported problem set
93 : 2121.1k
94 : 4438.2k
Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be).
I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns.
csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file.
So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary.
var_start = 1
total_condition_amount_start = 1
while (var_start < 5):
with open("condition"+`var_start`+".csv", "rb") as population1:
conditions1 = [line for line in population1]
random_selection1 = random.sample(conditions1, 40)
with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output:
temp_output.write("".join(random_selection1))
var_start = var_start + 1
while (total_condition_amount_start < total_condition_amount):
folder_no = 1
splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb'));
shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
total_condition_amount_start = total_condition_amount_start + 1
You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html
That way you can handle each file as a list of dictionaries, which will make your task a lot easier.
from random import randint, sample, choice
def create_random_list(length):
return [randint(0, 100) for i in range(length)]
# This should be your list of four initial csv files
# with the 264 rows in total, read with the csv lib
lists = [create_random_list(264) for i in range(4)]
# Take a randomized sample from the lists
lists = map(lambda x: sample(x, 40), lists)
# Add some variables to the
lists = map(lambda x: {'data': x, 'full_count': 0}, lists)
final = [[] for i in range(4)]
for l in final:
prev = None
count = 0
while len(l) < 40:
current = choice(lists)
if current['full_count'] == 10 or (current is prev and count == 3):
continue
# Take an item from the chosen list if it hasn't been used 3 times in a
# row or is already used 10 times. Append that item to the final list
total_left = 40 - len(l)
maxx = 0
for i in lists:
if i is not current and 10 - i['full_count'] > maxx:
maxx = 10 - i['full_count']
current_left = 10 - current['full_count']
max_left = maxx + maxx/3.0
if maxx > 3 and total_left <= max_left:
# Make sure that in te future it can still be split in to sets of
# max 3
continue
l.append(current['data'].pop())
count += 1
current['full_count'] += 1
if current is not prev:
count = 0
prev = current
for li in lists:
li['full_count'] = 0
I have been trying to smooth a plot which is noisy due to the sampling rate I'm using, and what it's counting. I've been using the help on here - mainly Plot smooth line with PyPlot (although I couldn't find the "spline" function and so am using UnivarinteSpline instead)
However, whatever I do I keep getting errors with either the pyplot error that "x and y are not of the same length" or, that the scipi.UnivariateSpline has a value for w that is incorrect. I am not sure quite how to fix this (not really a Python person!) I've attached the code although it's just the plotting bit at the end that is causing problems. Thanks
import os.path
import matplotlib.pyplot as plt
import scipy.interpolate as sci
import numpy as np
def main():
jcc = "0050"
dj = "005"
l = "060"
D = 20
hT = 4 * D
wT1 = 2 * D
wT2 = 5 * D
for jcm in ["025","030","035","040","045","050","055","060"]:
characteristic = "LeadersOnly/Jcm" + jcm + "/Jcc" + jcc + "/dJ" + dj + "/lambda" + l + "/Seed000"
fingertime1 = []
fingertime2 = []
stamp =[]
finger=[]
for x in range(0,2500,50):
if x<10000:
z=("00"+str(x))
if x<1000:
z=("000"+str(x))
if x<100:
z=("0000"+str(x))
if x<10:
z=("00000"+str(x))
stamp.append(x)
path = "LeadersOnly/Jcm" + jcm + "/Jcc" + jcc + "/dJ" + dj + "/lambda" + l + "/Seed000/profile_" + str(z) + ".txt"
if os.path.exists(path):
f = open(path, 'r')
pr1,pr2=np.genfromtxt(path, delimiter='\t', unpack=True)
p1=[]
p2=[]
h1=[]
h2=[]
a1=[]
a2=[]
finger1 = 0
finger2 = 0
for b in range(len(pr1)):
p1.append(pr1[b])
p2.append(pr2[b])
for elem in range(len(pr1)-80):
h1.append((p1[elem + (2*D)]-0.5*(p1[elem]+p1[elem + (4*D)])))
h2.append((p2[elem + (2*D)]-0.5*(p2[elem]+p2[elem + (4*D)])))
if h1[elem] >= hT:
a1.append(1)
else:
a1.append(0)
if h2[elem]>=hT:
a2.append(1)
else:
a2.append(0)
for elem in range(len(a1)-1):
if (a1[elem] - a1[elem + 1]) != 0:
finger1 = finger1 + 1
finger1 = finger1 / 2
for elem in range(len(a2)-1):
if (a2[elem] - a2[elem + 1]) != 0:
finger2 = finger2 + 1
finger2 = finger2 / 2
fingertime1.append(finger1)
fingertime2.append(finger2)
finger.append((finger1+finger2)/2)
namegraph = jcm
stampnew = np.linspace(stamp[0],stamp[-1],300)
fingernew = sci.UnivariateSpline(stamp, finger, stampnew)
plt.plot(stampnew,fingernew,label=namegraph)
plt.show()
main()
For information, the data input files are simply a list of integers (two lists seperated by tabs, as the code suggests).
Here is one of the error codes that I get:
0-th dimension must be fixed to 50 but got 300
error Traceback (most recent call last)
/group/data/Cara/JCMMOTFingers/fingercount_jcm_smooth.py in <module>()
116
117 if __name__ == '__main__':
--> 118 main()
119
120
/group/data/Cara/JCMMOTFingers/fingercount_jcm_smooth.py in main()
93 #print(len(stamp))
94 stampnew = np.linspace(stamp[0],stamp[-1],300)
---> 95 fingernew = sci.UnivariateSpline(stamp, finger, stampnew)
96 #print(len(stampnew))
97 #print(len(fingernew))
/usr/lib/python2.6/dist-packages/scipy/interpolate/fitpack2.pyc in __init__(self, x, y, w, bbox, k, s)
86 #_data == x,y,w,xb,xe,k,s,n,t,c,fp,fpint,nrdata,ier
87 data = dfitpack.fpcurf0(x,y,k,w=w,
---> 88 xb=bbox[0],xe=bbox[1],s=s)
89 if data[-1]==1:
90 # nest too small, setting to maximum bound
error: failed in converting 1st keyword `w' of dfitpack.fpcurf0 to C/Fortran array
Let's analyze your code a bit, starting from the for x in range(0, 2500, 50):
You define z as a string of 6 digits padded with 0s. You should really use somestring formatting like z = "{0:06d}".format(x) or z = "%06d" % x instead of these multiple tests of yours.
At the end of your loop, stamp will have (2500//50)=50 elements.
You check for the existence of your file path, then open it and read it, but you never close it. A more Pythonic way is to do:
try:
with open(path,"r") as f:
do...
except IOError:
do something else
With the with syntax, your file is automatically closed.
pr1 and pr2 are likely to be 1D arrays, right? You can really simplify the construction of your p1 and p2 lists as:
p1 = pr1.tolist()
p2 = pr2.tolist()
Your lists a1, a2 have the same size: you could combine your for elem in range(len(a..)-1) loops in a single one. You could also use the np.diff function.
at the end of the for x in range(...) loops, finger will have 50 elements minus the number of missing files. As you're not telling what to do in case of a missing file, your stamp and finger lists may not have the same number of elements, which will crash your scipy.UnivariateSpline. An easy fix would be to update your stamp list only if the path file is defined (that way, it always has the same number of elements as finger).
Your stampnew array has 300 elements, when your stamp and finger can only have at most 50. That's a second problem, the size of the weight array (stampnew) must be the same as the size of the inputs.
You're eventually trying to plot fingernew vs stamp. The problem is that fingernew is not an array, it's an instance of UnivariateSpline. You still need to calculate some actual points, for example with fingernew(stamp), then use that in your plot function.
I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files.
How can i select random files and random entry(pair of line, excluding header) ?
If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.
Here's an intuitive explanation for the algorithm.
Process the file line by line.
pick = line, with probability 1/N, where N = line number
In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.
For an intuitive proof, imagine a file with 3 lines:
1 Pick line 1.
/ \
.5 .5
/ \
2 1 Switch to line 2?
/ \ / \
.67 .33 .33 .67
/ \ / \
2 3 1 Switch to line 3?
The probability for each outcome:
Line 1: .5 * .67 = 1/3
Line 2: .5 * .67 = 1/3
Line 3: .5 * .33 * 2 = 1/3
From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).
Here's the Perl FAQ answer, adapted to your problem.
use strict;
use warnings;
# Ignore 5 lines.
<> for 1 .. 5;
# Use reservoir sampling to select pairs from remaining lines.
my (#picks, $n);
until (eof){
my #lines;
$lines[$_] = <> for 0 .. 1;
$n ++;
#picks = #lines if rand($n) < 1;
}
print #picks;
You may find perlfaq5 useful.
sed "1,5d" < FILENAME | sort -R | head -2
Python solution - reads file only once and requires little memory
Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines
from random import randrange
def getRandomItems(f, skipFirst=0, numItems=1):
for _ in xrange(skipFirst):
f.next()
n = 0; r = []
while True:
try:
nxt = [f.next() for _ in range(numItems)]
except StopIteration: break
n += 1
if not randrange(n):
r = nxt
return r
Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f is an iterator (supports next() method). Hence we can pass something different than file, say we want to see the distribution:
>>> s={}
>>> for i in xrange(5000):
... r = getRandomItems(iter(xrange(50)))[0]
... s[r] = 1 + s.get(r,0)
...
>>> for i in s:
... print i, '*' * s[i]
...
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************
Answer is in Python. Assuming you can read a whole file into memory.
#using python 2.6
import sys
import os
import itertools
import random
def main(directory, num_files=5, num_entries=5):
file_paths = os.listdir(directory)
# get a random sampling of the available paths
chosen_paths = random.sample(file_paths, num_files)
for path in chosen_paths:
chosen_entries = get_random_entries(path, num_entries)
for entry in chosen_entries:
# do something with your chosen entries
print entry
def get_random_entries(file_path, num_entries):
with open(file_path, 'r') as file:
# read the lines and slice off the headers
lines = file.readlines()[5:]
# group the lines into pairs (i.e. entries)
entries = list(itertools.izip_longest(*[iter(lines)]*2))
# return a random sampling of entries
return random.sample(entries, num_entries)
if __name__ == '__main__':
#use optparse here to do fancy things with the command line args
main(sys.argv[1:])
Links: itertools, random, optparse
Another Python option; reading the contents of all files into memory:
import random
import fileinput
def openhook(filename, mode):
f = open(filename, mode)
headers = [f.readline() for _ in range(5)]
return f
num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)
Two other means to do so:
1- by generators (may still require a lot of memory):
http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html
2- by a clever seeking (best method actually):
http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html
I here copy the code of the clever Jonathan Kupferman:
#!/usr/bin/python
import os,random
filename="averylargefile"
file = open(filename,'r')
#Get the total file size
file_size = os.stat(filename)[6]
while 1:
#Seek to a place in the file which is a random distance away
#Mod by file size so that it wraps around to the beginning
file.seek((file.tell()+random.randint(0,file_size-1))%file_size)
#dont use the first readline since it may fall in the middle of a line
file.readline()
#this will return the next (complete) line from the file
line = file.readline()
#here is your random line in the file
print line