Randomly selecting lines from files - python

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files.
How can i select random files and random entry(pair of line, excluding header) ?

If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.
Here's an intuitive explanation for the algorithm.
Process the file line by line.
pick = line, with probability 1/N, where N = line number
In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.
For an intuitive proof, imagine a file with 3 lines:
1 Pick line 1.
/ \
.5 .5
/ \
2 1 Switch to line 2?
/ \ / \
.67 .33 .33 .67
/ \ / \
2 3 1 Switch to line 3?
The probability for each outcome:
Line 1: .5 * .67 = 1/3
Line 2: .5 * .67 = 1/3
Line 3: .5 * .33 * 2 = 1/3
From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).
Here's the Perl FAQ answer, adapted to your problem.
use strict;
use warnings;
# Ignore 5 lines.
<> for 1 .. 5;
# Use reservoir sampling to select pairs from remaining lines.
my (#picks, $n);
until (eof){
my #lines;
$lines[$_] = <> for 0 .. 1;
$n ++;
#picks = #lines if rand($n) < 1;
}
print #picks;

You may find perlfaq5 useful.

sed "1,5d" < FILENAME | sort -R | head -2

Python solution - reads file only once and requires little memory
Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines
from random import randrange
def getRandomItems(f, skipFirst=0, numItems=1):
for _ in xrange(skipFirst):
f.next()
n = 0; r = []
while True:
try:
nxt = [f.next() for _ in range(numItems)]
except StopIteration: break
n += 1
if not randrange(n):
r = nxt
return r
Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f is an iterator (supports next() method). Hence we can pass something different than file, say we want to see the distribution:
>>> s={}
>>> for i in xrange(5000):
... r = getRandomItems(iter(xrange(50)))[0]
... s[r] = 1 + s.get(r,0)
...
>>> for i in s:
... print i, '*' * s[i]
...
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************

Answer is in Python. Assuming you can read a whole file into memory.
#using python 2.6
import sys
import os
import itertools
import random
def main(directory, num_files=5, num_entries=5):
file_paths = os.listdir(directory)
# get a random sampling of the available paths
chosen_paths = random.sample(file_paths, num_files)
for path in chosen_paths:
chosen_entries = get_random_entries(path, num_entries)
for entry in chosen_entries:
# do something with your chosen entries
print entry
def get_random_entries(file_path, num_entries):
with open(file_path, 'r') as file:
# read the lines and slice off the headers
lines = file.readlines()[5:]
# group the lines into pairs (i.e. entries)
entries = list(itertools.izip_longest(*[iter(lines)]*2))
# return a random sampling of entries
return random.sample(entries, num_entries)
if __name__ == '__main__':
#use optparse here to do fancy things with the command line args
main(sys.argv[1:])
Links: itertools, random, optparse

Another Python option; reading the contents of all files into memory:
import random
import fileinput
def openhook(filename, mode):
f = open(filename, mode)
headers = [f.readline() for _ in range(5)]
return f
num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)

Two other means to do so:
1- by generators (may still require a lot of memory):
http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html
2- by a clever seeking (best method actually):
http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html
I here copy the code of the clever Jonathan Kupferman:
#!/usr/bin/python
import os,random
filename="averylargefile"
file = open(filename,'r')
#Get the total file size
file_size = os.stat(filename)[6]
while 1:
#Seek to a place in the file which is a random distance away
#Mod by file size so that it wraps around to the beginning
file.seek((file.tell()+random.randint(0,file_size-1))%file_size)
#dont use the first readline since it may fall in the middle of a line
file.readline()
#this will return the next (complete) line from the file
line = file.readline()
#here is your random line in the file
print line

Related

Read in Values from external file, add them, print results

Problem:
I have 50 text files, each with thousands of lines of text, each line has a value on it. I am only interesting in a small section near the middle (lines 757-827 - it is actually lines 745-805 I'm interested in, but the first 12 lines of every file is irrelevant stuff). I would like to read each file in. And then total the values between those lines. In the end I would like it to print off a pair of numbers in the format (((n+1)*18),total count), where n is the number of the file (since they are numbered starting at zero). Then repeat for all 50 files, giving 50 pairs of numbers, looking something like:
(18,77),(36,63),(54,50),(72,42),...
Code:
import numpy as np
%matplotlib inline
from numpy import loadtxt, linspace
import glob, os
fileToRun = 'Run0'
location = 'ControlRoom6'
DeadTime = 3
LiveTime = 15
folderId = '\\'
baseFolder = 'C:'+folderId+'Users'+folderId+location+folderId+'Documents'+folderId+'PhD'+folderId+'Ubuntu-Analysis-DCF'+folderId+'DCF-an-b+decay'+folderId+'dcp-ap-27Al'+folderId+''
prefix = 'DECAY_COINC'
folderToAnalyze = baseFolder + fileToRun + '\\'
MaestroT = LiveTime + DeadTime
## Gets number of files
files = []
os.chdir(folderToAnalyze)
for file in glob.glob(prefix + "*.Spe"):
files.append(file)
numfiles = len(files)
if numfiles<=1:
print('numfiles is {0}, minimum of 2 is required'.format(numfiles))
raise SystemExit(0)
xmin = 745
xmax = 815
skips = 12
n=[]
count=[]
for n in range(0, numfiles):
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows = 12, max_rows = 8192)
for x in range(xmin+skips,xmax+skips):
count = count + y
time = MaestroT*(n+1)
print(time, count)
Current output is:
'ValueError Traceback (most recent call last)
in
84
85 for x in range(xmin+skips,xmax+skips):
---> 86 count = count + y
87 time = MaestroT*(n+1)
88
ValueError: operands could not be broadcast together with shapes (0,) (8192,)'
However I did previously have this running, it just printing out thousands of seemingly unconnected numbers. Does anyone know how I can alter the code to acheive the desired result?
EDIT: Data Set
In order to make the example easier to use, I've made a dropbox with some dummy data. The files are named the same as it would be reading in, and are written in the same format (the first 12 rows with unuseful information). Link is Here. I haven't written 8192 dummy numbers as I thought it would probably be easier and produce a nearer facsilime to just use the actual files with a few numbers changed.
Solution was to edit code as shown starting from 'xmin = 745':
xmin = 745
xmax = 815
skip = 12
for n in range(0, numfiles):
total = 0
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= xmin + skip, max_rows = xmax - xmin)
for x in y:
val = int(x)
total = total + val
print(((n+1)*MaestroT), total)
Prints out as
18 74
36 64
54 62
72 54
90 47
108 39
126 40
144 35
etc.
Which fit my needs.

Solving knapsack problem using a greedy python algorithm

I'm trying to solve the knapsack problem using Python, implementing a greedy algorithm. The result I'm getting back makes no sense to me.
Knapsack:
The first line gives the number of items, in this case 20. The last line gives the capacity of the knapsack, in this case 524. The remaining lines give the index, value and weight of each item.
20
1 91 29
2 60 65
3 61 71
4 9 60
5 79 45
6 46 71
7 19 22
8 57 97
9 8 6
10 84 91
11 20 57
12 72 60
13 32 49
14 31 89
15 28 2
16 81 30
17 55 90
18 43 25
19 100 82
20 27 19
524
Python code:
import os
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap):
best = max(values)
i = values.index(best)
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
def read_kfile(fname):
with open(fname, 'rU') as kfile:
lines = kfile.readlines() # reads the whole file
n = int(lines[0])
c = int(lines[n+1])
vs = []
ws = []
lines = lines[1:n+1] # Removes the first and last line
for l in lines:
numbers = l.split() # Converts the string into a list
vs.append(int(numbers[1])) # Appends value, need to convert to int
ws.append(int(numbers[2])) # Appends weigth, need to convert to int
return n, c, vs, ws
dir_path = os.path.dirname(os.path.realpath(__file__)) # Get the directory where the file is located
os.chdir(dir_path) # Change the working directory so we can read the file
knapfile = 'knap20.txt'
nitems, cap, values, weights = read_kfile(knapfile)
val1,val2 =constructive()
print ('knapsack',val1)
print('weight', val2)
print('cap', cap)
Result:
knapsack [18, 0, 8, 13, 3, 8, 1, 0, 3]
weight 570
cap 524
Welcome. the reason why your program is giving a weights over the cap limit is because on the final item you are putting in the knapsack, you aren't checking if it can fit in it. To do this just add an if statement, Also you should check if the list of values is empty. Do note that I have append (i+1) since your text file's index is starting at 1 but Python starts it's list index at 0:
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap and values):
best = max(values)
i = values.index(best)
if weights[i] <= cap-Weight:
knapsack.append(i+1)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
The problem is -- in the last step -- the best item you find will exceed the maximum weight. But since you already entered the loop you add it anyway.
In the next iteration you recognize that you are over the cap and stop.
I am not sure how you want to proceed once the next best is too heavy. In case you simple want to stop and not add anything more you can simply modify your constructive to look as follows:
def constructive():
knapsack = []
Weight = 0
while(True):
best = max(values)
i = values.index(best)
if Weight + weights[i] > cap:
break
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight

Query Board challenge on Python, need some pointers

So, I have this challenge on CodeEval, but I seem don't know where to start, so I need some pointers (and answers if you can) to help me figure out this challenge.
DESCRIPTION:
There is a board (matrix). Every cell of the board contains one integer, which is 0 initially.
The next operations can be applied to the Query Board:
SetRow i x: it means that all values in the cells on row "i" have been change value to "x" after this operation.
SetCol j x: it means that all values in the cells on column "j" have been changed to value "x" after this operation.
QueryRow i: it means that you should output the sum of values on row "i".
QueryCol j: it means that you should output the sum of values on column "j".
The board's dimensions are 256x256
i and j are integers from 0 to 255
x is an integer from 0 to 31
INPUT SAMPLE:
Your program should accept as its first argument a path to a filename. Each line in this file contains an operation of a query. E.g.
SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2
OUTPUT SAMPLE:
For each query, output the answer of the query. E.g.
5118
34
1792
3571
I'm not that great on Python, but this challenge is pretty interesting, although I didn't have any clues on how to solve it. So, I need some help from you guys.
Thanks!
You could use a sparse matrix for this; addressed by (col, row) tuples as keys in a dictionary, to save memory. 64k cells is a big list otherwise (2MB+ on a 64-bit system):
matrix = {}
This is way more efficient, as the challenge is unlikely to set values for all rows and columns on the board.
Setting a column or row is then:
def set_col(col, x):
for i in range(256):
matrix[i, col] = x
def set_row(row, x):
for i in range(256):
matrix[row, i] = x
and summing a row or column is then:
def get_col(col):
return sum(matrix.get((i, col), 0) for i in range(256))
def get_row(row):
return sum(matrix.get((row, i), 0) for i in range(256))
WIDTH, HEIGHT = 256, 256
board = [[0] * WIDTH for i in range(HEIGHT)]
def set_row(i, x):
global board
board[i] = [x]*WIDTH
... implement each function, then parse each line of input to decide which function to call,
for line in inf:
dat = line.split()
if dat[0] == "SetRow":
set_row(int(dat[1]), int(dat[2]))
elif ...
Edit: Per Martijn's comments:
total memory usage for board is about 2.1MB. By comparison, after 100 random row/column writes, matrix is 3.1MB (although it tops out there and doesn't get any bigger).
yes, global is unnecessary when modifying a global object (just don't try to assign to it).
while dispatching from a dict is good and efficient, I did not want to inflict it on someone who is "not that great on Python", especially for just four entries.
For sake of comparison, how about
time = 0
WIDTH, HEIGHT = 256, 256
INIT = 0
rows = [(time, INIT) for _ in range(WIDTH)]
cols = [(time, INIT) for _ in range(HEIGHT)]
def set_row(i, x):
global time
time += 1
rows[int(i)] = (time, int(x))
def set_col(i, x):
global time
time += 1
cols[int(i)] = (time, int(x))
def query_row(i):
rt, rv = rows[int(i)]
total = rv * WIDTH + sum(cv - rv for ct, cv in cols if ct > rt)
print(total)
def query_col(j):
ct, cv = cols[int(j)]
total = cv * HEIGHT + sum(rv - cv for rt, rv in rows if rt > ct)
print(total)
ops = {
"SetRow": set_row,
"SetCol": set_col,
"QueryRow": query_row,
"QueryCol": query_col
}
inf = """SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2""".splitlines()
for line in inf:
line = line.split()
op = line.pop(0)
ops[op](*line)
which only uses 4.3k of memory for rows[] and cols[].
Edit2:
using your code from above for matrix, set_row, set_col,
import sys
for n in range(256):
set_row(n, 1)
print("{}: {}".format(2*(n+1)-1, sys.getsizeof(matrix)))
set_col(n, 1)
print("{}: {}".format(2*(n+1), sys.getsizeof(matrix)))
which returns (condensed:)
1: 12560
2: 49424
6: 196880
22: 786704
94: 3146000
... basically the allocated memory quadruples at each step. If I change the memory measure to include key-tuples,
def get_matrix_size():
return sys.getsizeof(matrix) + sum(sys.getsizeof(key) for key in matrix)
it increases more smoothly, but still takes a bit jump at the above points:
5 : 127.9k
6 : 287.7k
21 : 521.4k
22 : 1112.7k
60 : 1672.0k
61 : 1686.1k <-- approx expected size on your reported problem set
93 : 2121.1k
94 : 4438.2k

Read n test cases from file in Python

I've been trying to use Python for a few sample programming competition questions, but I've been stumped on file reading.
I'm reading from stdin, the first line is the number of test cases that follow, each subsequent line contains two integers that I need to process. E.g.
3
4 -10
0 5
6 20
2
0 -1
20 10
etc...
I've found a C++ solution that looks like this:
int main()
{
int runs,a,b ;
cin >> runs ;
while(runs--)
{
cin >> a >> b ;
long long ret = solve(a,b) ;
cout << ret << endl ;
}
return 0 ;
}
The closest I've come up with in Python is:
t = int(raw_input())
answer = 0
while t :
n, m = map(int, raw_input().split())
answer = solve(n,m)
print answer
I've seen similar questions on Stack Overflow but I'm still having a tricky time wrapping my head around the Python way to do this.
3
4 -10
0 5
6 20
2
0 -1
20 10
You would do it like this.
num_of_testcases = int(raw_input()) # this corresponds to 3 and 2
for each in range(number_of_testcases):
x, y = map(int, raw_input().split()) # this would give the pair of numbers
In the contests, usually, you will have the total number of test cases. You have not mentioned it here. It is taken upfront
total_test_cases = int(raw_input())
and then you iterate the above input gathering routine over the total_test_cases If the total test cases is not present, then you can iterate over while True and then cancel at EOF.
for tc in range(total_test_cases):
num_of_testcases = int(raw_input()) # this corresponds to 3 and 2
for each in range(number_of_testcases):
x, y = map(int, raw_input().split()) # this would give the pair of numbers
Try this:
import sys
for l in sys.stdin.readlines()[1:]:
a,b = map(int,l.split())
#now process your test cases
Also according to your input file description there should only be one set of test cases.Like so:
3
4 -10
0 5
4 20
If you don't want to use raw_input you can use fileinput instead:
import fileinput
input = fileinput.input()
for line in input:
for j in range(int(line)):
solve(*[int(i) for i in input.next().split()])
or with sys.stdin
import sys
for line in sys.stdin:
for j in range(int(line)):
solve(*[int(i) for i in sys.stdin.next().split()])

How to randomly delete a number of lines from a big file?

I have a big text file of 13 GB with 158,609,739 lines and I want to randomly select 155,000,000 lines.
I have tried to scramble the file and then cut the 155000000 first lines, but it's seem that my ram memory (16GB) isn't enough big to do this. The pipelines i have tried are:
shuf file | head -n 155000000
sort -R file | head -n 155000000
Now instead of selecting lines, I think is more memory efficient delete 3,609,739 random lines from the file to get a final file of 155000000 lines.
As you copy each line of the file to the output, assess its probability that it should be deleted. The first line should have a 3,609,739/158,609,739 chance of being deleted. If you generate a random number between 0 and 1 and that number is less than that ratio, don't copy it to the output. Now the odds for the second line are 3,609,738/158,609,738; if that line is not deleted, the odds for the third line are 3,609,738/158,609,737. Repeat until done.
Because the odds change with each line processed, this algorithm guarantees the exact line count. Once you've deleted 3,609,739 the odds go to zero; if at any time you would need to delete every remaining line in the file, the odds go to one.
You could always pre-generate which line numbers (a list of 3,609,739 random numbers selected without replacement) you plan on deleting, then just iterate through the file and copy to another, skipping lines as necessary. As long as you have space for a new file this would work.
You could select the random numbers with random.sample
E.g.,
random.sample(xrange(158609739), 3609739)
Proof of Mark Ransom's Answer
Let's use numbers easier to think about (at least for me!):
10 items
delete 3 of them
First time through the loop we will assume that the first three items get deleted -- here's what the probabilities look like:
first item: 3 / 10 = 30%
second item: 2 / 9 = 22%
third item: 1 / 8 = 12%
fourth item: 0 / 7 = 0 %
fifth item: 0 / 6 = 0 %
sixth item: 0 / 5 = 0 %
seventh item: 0 / 4 = 0 %
eighth item: 0 / 3 = 0 %
ninth item: 0 / 2 = 0 %
tenth item: 0 / 1 = 0 %
As you can see, once it hits zero, it stays at zero. But what if nothing is getting deleted?
first item: 3 / 10 = 30%
second item: 3 / 9 = 33%
third item: 3 / 8 = 38%
fourth item: 3 / 7 = 43%
fifth item: 3 / 6 = 50%
sixth item: 3 / 5 = 60%
seventh item: 3 / 4 = 75%
eighth item: 3 / 3 = 100%
ninth item: 2 / 2 = 100%
tenth item: 1 / 1 = 100%
So even though the probability varies per line, overall you get the results you are looking for. I went a step further and coded a test in Python for one million iterations as a final proof to myself -- remove seven items from a list of 100:
# python 3.2
from __future__ import division
from stats import mean # http://pypi.python.org/pypi/stats
import random
counts = dict()
for i in range(100):
counts[i] = 0
removed_failed = 0
for _ in range(1000000):
to_remove = 7
from_list = list(range(100))
removed = 0
while from_list:
current = from_list.pop()
probability = to_remove / (len(from_list) + 1)
if random.random() < probability:
removed += 1
to_remove -= 1
counts[current] += 1
if removed != 7:
removed_failed += 1
print(counts[0], counts[1], counts[2], '...',
counts[49], counts[50], counts[51], '...',
counts[97], counts[98], counts[99])
print("remove failed: ", removed_failed)
print("min: ", min(counts.values()))
print("max: ", max(counts.values()))
print("mean: ", mean(counts.values()))
and here's the results from one of the several times I ran it (they were all similar):
70125 69667 70081 ... 70038 70085 70121 ... 70047 70040 70170
remove failed: 0
min: 69332
max: 70599
mean: 70000.0
A final note: Python's random.random() is [0.0, 1.0) (doesn't include 1.0 as a possibility).
I believe you're looking for "Algorithm S" from section 3.4.2 of Knuth (D. E. Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981).
You can see several implementations at http://rosettacode.org/wiki/Knuth%27s_algorithm_S
The Perlmonks list has some Perl implementations of Algorithm S and Algorithm R that might also prove useful.
These algorithms rely on there being a meaningful interpretation of floating point numbers like 3609739/158609739, 3609738/158609738, etc. which might not have sufficient resolution with a standard Float datatype, unless the Float datatype is implemented using numbers of double precision or larger.
Here's a possible solution using Python:
import random
skipping = random.sample(range(158609739), 3609739)
input = open(input)
output = open(output, 'w')
for i, line in enumerate(input):
if i in skipping:
continue
output.write(line)
input.close()
output.close()
Here's another using Mark's method:
import random
lines_in_file = 158609739
lines_left_in_file = lines_in_file
lines_to_delete = lines_in_file - 155000000
input = open(input)
output = open(output, 'w')
try:
for line in input:
current_probability = lines_to_delete / lines_left_in_file
lines_left_in_file -= 1
if random.random < current_probability:
lines_to_delete -= 1
continue
output.write(line)
except ZeroDivisionError:
print("More than %d lines in the file" % lines_in_file)
finally:
input.close()
output.close()
I wrote this code before seeing that Darren Yin has expressed its principle.
I've modified my code to take the use of name skipping (I didn't dare to choose kangaroo ...) and of keyword continue from Ethan Furman whose code's principle is the same too.
I defined default arguments for the parameters of the function in order that the function can be used several times without having to make re-assignement at each call.
import random
import os.path
def spurt(ff,skipping):
for i,line in enumerate(ff):
if i in skipping:
print 'line %d excluded : %r' % (i,line)
continue
yield line
def randomly_reduce_file(filepath,nk = None,
d = {0:'st',1:'nd',2:'rd',3:'th'},spurt = spurt,
sample = random.sample,splitext = os.path.splitext):
# count of the lines of the original file
with open(filepath) as f: nl = sum(1 for _ in f)
# asking for the number of lines to keep, if not given as argument
if nk is None:
nk = int(raw_input(' The file has %d lines.'
' How many of them do you '
'want to randomly keep ? : ' % nl))
# transfer of the lines to keep,
# from one file to another file with different name
if nk<nl:
with open(filepath,'rb') as f,\
open('COPY'.join(splitext(filepath)),'wb') as g:
g.writelines( spurt(f,sample(xrange(0,nl),nl-nk) ) )
# sample(xrange(0,nl),nl-nk) is the list
# of the counting numbers of the lines to be excluded
else:
print ' %d is %s than the number of lines (%d) in the file\n'\
' no operation has been performed'\
% (nk,'the same' if nk==nl else 'greater',nl)
With the $RANDOM variable you can get a random number between 0 and 32,767.
With this, you could read in each line, and see if $RANDOM is less than 155,000,000 / 158,609,739 * 32,767 (which is 32,021), and if so, let the line through.
Of course, this wouldn't give you exactly 150,000,000 lines, but pretty close to it depending on the normality of the random number generator.
EDIT: Here is some code to get you started:
#!/bin/bash
while read line; do
if (( $RANDOM < 32021 ))
then
echo $line
fi
done
Call it like so:
thatScript.sh <inFile.txt >outFile.txt

Categories