pandas read_csv specific row [duplicate] - python

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

Assuming no header in the CSV file:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.
With header and unknown file length:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

#dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.
Note also that their answer for unknown file length relies on iterating through the file twice -- once to get the length, and then another time to read the csv. I have three solutions here which only rely on iterating through the file once, though they all have tradeoffs.
Solution 1: Approximate Percentage
If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
As pointed out in the comments, this only gives approximately the right number of lines, but I think it satisfies the desired usecase.
Solution 2: Every Nth line
This isn't actually a random sample, but depending on how your input is sorted and what you're trying to achieve, this may meet your needs.
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
Solution 3: Reservoir Sampling
(Added July 2021)
Reservoir sampling is an elegant algorithm for selecting k items randomly from a stream whose length is unknown, but that you only see once.
The big advantage is that you can use this without having the full dataset on disk, and that it gives you an exactly-sized sample without knowing the full dataset size. The disadvantage is that I don't see a way to implement it in pure pandas, I think you need to drop into python to read the file and then construct the dataframe afterwards. So you may lose some functionality from read_csv or need to reimplement it, since we're not using pandas to actually read the file.
Taking an implementation of the algorithm from Oscar Benjamin here:
from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO
def reservoir_sample(iterable, k=1):
"""Select k items uniformly from iterable.
Returns the whole population if there are k or fewer items
from https://bugs.python.org/issue41311#msg373733
"""
iterator = iter(iterable)
values = list(islice(iterator, k))
W = exp(log(random())/k)
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
selection = list(islice(iterator, skip, skip+1))
if selection:
values[randrange(k)] = selection[0]
W *= exp(log(random())/k)
else:
return values
def sample_file(filepath, k):
with open(filepath, 'r') as f:
header = next(f)
result = [header] + sample_iter(f, k)
df = pd.read_csv(StringIO(''.join(result)))
The reservoir_sample function returns a list of strings, each of which is a single row, so we just need to turn it into a dataframe at the end. This assumes there is exactly one header row, I haven't thought about how to extend it to other situations.
I tested this locally and it is much faster than the other two solutions. Using a 550 MB csv (January 2020 "Yellow Taxi Trip Records" from the NYC TLC), solution 3 runs in about 1 second, while the other two take ~3-4 seconds.
In my test this is even slightly (~10-20%) faster than #Bar's answer using shuf, which surprises me.

This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:
shuf -n 100000 data/original.tsv > data/sample.tsv
The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.
Relevant question: https://unix.stackexchange.com/q/108581
Benchmark on a 7M lines csv available here (2008):
Top answer:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
Timing for pandas:
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
While using shuf:
time shuf -n 100000 2008.csv > temp.csv
real 0m1.583s
user 0m1.445s
sys 0m0.136s
So shuf is about 12x faster and importantly does not read the whole file into memory.

Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.
Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.
By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.
See code below:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line

The following code reads first the header, and then a random sample on the other lines:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

class magic_checker:
def __init__(self,target_count):
self.target = target_count
self.count = 0
def __eq__(self,x):
self.count += 1
return self.count >= self.target
min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
f.seek(seek_target)
f.readline() #discard this line
rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")
something like that should work I think

No pandas!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
You'll end up with a sampled_lines list. What kind of statistics do you mean?

use subsample
pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv

You can also create a sample with the 10000 records before bringing it into the Python environment.
Using Git Bash (Windows 10) I just ran the following command to produce the sample
shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv
To note: If your CSV has headers this is not the best solution.

TL;DR
If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:
import pandas as pd
import numpy as np
filename = "data.csv"
sample_size = 10000
batch_size = 200
rng = np.random.default_rng()
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
sample = sample_reader.get_chunk(sample_size)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
sample.loc[chunk.index] = chunk
Explanation
It's not always trivial to know the size of the input CSV file.
If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.
So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.
To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
First, we get the initial sample:
sample = sample_reader.get_chunk(sample_size)
Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
And use this reindexed chunk to replace (some of the) lines in the sample:
sample.loc[chunk.index] = chunk
After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.
To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).
Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.
On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.

read the data file
import pandas as pd
df = pd.read_csv('data.csv', 'r')
First check the shape of df
df.shape()
create the small sample of 1000 raws from df
sample_data = df.sample(n=1000, replace='False')
#check the shape of sample_data
sample_data.shape()

For example, you have the loan.csv, you can use this script to easily load the specified number of random items.
data = pd.read_csv('loan.csv').sample(10000, random_state=44)

Let's say that you want to load a 20% sample of the dataset:
import pandas as pd
df = pd.read_csv(filepath).sample(frac = 0.20)

Related

Calculate averages over subgroups of data in extremely large (100GB+) CSV file

I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.
The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.
The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.
What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.
The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!
import itertools
import pandas as pd
# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)])
# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
counter = 0
total_lines = sum(1 for line in f)
f.seek(0)
for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column
sc16, pubyr = key
rows = [row.strip().split(';') for row in group]
columns = rows[0]
rows = rows[1:]
# Step 3: convert the group of rows to a dataframe
group_df = pd.DataFrame(rows, columns=columns)
# Step 4: calculate the mean for the group
mean_row = {'sc16': sc16, 'pubyr': pubyr}
for col in group_df.columns:
if col.startswith('dim_'):
mean_row[col] = group_df[col].astype(float).mean()
# Step 5: append the mean row to the mean dataframe
mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
counter += len(rows)
print(f"{counter} of {total_lines}")
# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)
You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.
Try something like this; it's using numpy to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)
Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.
import itertools
import operator
import numpy as np
def read_embeddings_file(filename):
# Read the (pre-sorted) embeddings file,
# yielding tuples of ((sc16, pubyr) and a list of dimensions).
with open(filename) as in_file:
for line in in_file:
if not line or line.startswith("sc16"): # Header or empty line
continue
line = line.split(";")
sc16, cpid, type, pubyr, *dims = line
# list(map(... is faster than the equivalent listcomp
yield (sc16, pubyr), list(map(int, dims))
def main():
output_name = "scibert_embeddings_mean.txt"
input_name = "scibert_embeddings_sorted.txt"
with open(output_name, "w") as out_f:
print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
counter = 0
for group, group_contents in itertools.groupby(
read_embeddings_file(input_name),
key=operator.itemgetter(0), # Group by (sc16, pubyr)
):
dims = [d[1] for d in group_contents]
# Calculate the mean of each dimension
mean_dims = np.mean(np.array(dims).astype(float), axis=0)
# Write group to output
print(*group, *mean_dims, sep=";", file=out_f)
# Print progress
counter += len(dims)
print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")
if __name__ == "__main__":
main()

reading csv file line by line and save lines which are satisfying certain conditions

I have an issue which was already discussed in several topics, nevertheless i would like to go a bit deeper and maybe find a better solution.
So the idea is to go through "huge" (50 to 60GB) .csv files with python, find the lines which satisfy some conditions, extract them and finally store them in a second variable for further analysis.
Initially the problem was for r scripts, which i manage with sparklyr connection, or eventually some gawk code in bash (see awk, or gawk), to extract the data I need, then analyse it with R/python.
I would like to resolve this issue exclusively with python, the idea would be to avoid mixing languages like bash/python, or bash/R (unix). So far i use the open as x, and go through file line by line, and it kinda works, but it's awfully slow. For example, going through the file is pretty fast (~500.000 lines per second, even for a 58M lines is ok), but when I try to store the data, the speed drops to ~10 lines per second. For an extraction with ~300.000 lines, it's unacceptable.
I tried several solutions and I'm guessing that it's not optimal (poor python code ? :( ) and better solutions eventually exist.
Solution 1: go through file, split the line in a list, check the conditions, if ok put the line in numpy matrix and vstack for each iteration which is satisfying the condition (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
a = numpy.array(['colnames']*35) #data is 35 columns
index = list()
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";") # csv with ";" ...
date_file = line[1][0:10] # date stored in the 2nd column
if date_file >= date_first and date_file <= date_last : #data extraction concern a time period (one month for example)
line=numpy.array(line) #go to numpy
a=numpy.vstack((a, line)) #stack it
Solution 2 : the same but store the line in a pandas data.frame with a row index if conditions ok (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0 #row index
a = pandas.DataFrame(numpy.zeros((0,35)))#data is 35 columns
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
if date_file>=date_first and date_file<=date_last :
a.loc[row] = line #store the line in the pd.data.frame at the position row
row = row + 1 #go to next row
Solution 3 : the same, but instead of storing the line somewhere, which is the main issue for me, keep an index for satisfying rows, and then open the csv with the rows i need (even slower, actually going through file to find the indexes is fast enough, the opening index's row is awfully slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0
index = list()
with open("data.csv", "r") as f:
f = csv.reader(f, delimiter = ";")
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
row = row + 1
if date_file>=date_first and date_file<=date_last :
index.append(row)
with open("data.csv") as f:
reader=csv.reader(f)
interestingrows=[row for idx, row in enumerate(reader) if idx in index]
The idea would be to keep only the data which satisfy the condition, here an extraction for a specific month. I do not understand where the problem is coming from, saving the data somewhere (vstack, or writing with in a pd.DF) is definitively an issue. I'm pretty sure i do something wrong but i'm not sure where/what.
The data is a csv with 35 columns and over 57M rows.
Thanks for the reading
O.
Appends to dataframes and numpy arrays are very expensive because each append must copy the entire data to a new memory location. Instead, you can try reading the file in chunks, processing the data, and appending back out. Here I've picked a chunk size of 100,000 but you can obviously change this.
I don't know the column names of your CSV so I guessed at 'date_file'. This should get you close:
import pandas as pd
date_first = '2008-11-01'
date_last = '2008-11-10'
df = pd.read_csv("data.csv", chunksize=100000)
for chunk in df:
chunk = chunk[(chunk['date_file'].str[:10] >= date_first)
& (chunk['date_file'].str[:10] <= date_last)]
chunk.to_csv('output.csv', mode='a')

Why is file processing in python taking more time for chunks which come later in the file?

I have a very simple code which parses a JSON file. The file contains each line as a JSON object. For some reason, the processing time for each row increases as I run the code.
Can someone explain to me why this is happening and how to stop this?
Here is the code snippet:
from ast import literal_eval as le
import re
import string
from pandas import DataFrame
import pandas
import time
f = open('file.json')
df = DataFrame(columns=(column_names))
row_num = 0
while True:
t = time.time()
for line in f:
line = line.strip()
d = le(line)
df.loc[row_num] = [d[column_name1], d[column_name2]]
row_num+=1
if(row_num%5000 == 0):
print row_num, 'done', time.time() - t
break
df.to_csv('MetaAnalysis', encoding='utf-8')
Part of the output is as follows:
5000 done 11.4549999237
10000 done 16.5380001068
15000 done 24.2339999676
20000 done 36.3680000305
25000 done 50.0610001087
30000 done 57.0130000114
35000 done 65.9800000191
40000 done 74.4649999142
As visible the time is increasing for each row.
Pandas is notoriously slow about appending rows - it maintains hierarchical indices on the data, and every time you append a row it has to update all the indices.
This means it is much faster to add a thousand rows (then update) instead of adding a row (and updating) a thousand times.
Sample code to follow; I am still downloading the mozilla.tar.gz file (453 MB).
Edit: apparently the file I downloaded and extracted (/dump/mozilla/mozall.bson, 890 MB) is a MongoDB dump in bson with TenGen extensions, containing 769k rows. For testing purposes I took the first 50k rows and re-saved as json (result is 54 MB - average line is about 1200 characters), then used Notepad++ to break it to one record per line.
Most of the complexity here is for reading the file in chunks:
from itertools import islice
import pandas as pd
from time import time
LINES_PER_BLOCK = 5000
# read object chunks from json file
with open("mozilla.json") as inf:
chunks = []
while True:
start = time()
block = list(islice(inf, LINES_PER_BLOCK))
if not block:
# reached EOF
break
json = "[" + ",".join(block) + "]"
chunk = pd.read_json(json, "records")
chunks.append(chunk)
done = time()
print(LINES_PER_BLOCK * len(chunks), "done", done - start)
start = done
# now combine chunks
start = time()
df = pd.concat(chunks)
done = time()
print("Concat done", done - start)
which gives
5000 done 0.12293195724487305
10000 done 0.12034845352172852
15000 done 0.12239885330200195
20000 done 0.11942410469055176
25000 done 0.12282919883728027
30000 done 0.11931681632995605
35000 done 0.1278700828552246
40000 done 0.12238287925720215
45000 done 0.12096738815307617
50000 done 0.20111417770385742
Concat done 0.04361534118652344
for a total time of 1.355s; if you don't need to chunk the file, it simplifies to
import pandas as pd
from time import time
start = time()
with open("mozilla.json") as inf:
json = "[" + ",".join(inf) + "]"
df = pd.read_json(json, "records")
done = time()
print("Total time", done - start)
which gives
Total time 1.247551441192627
You are monotonically increasing the data structure df.loc by inserting new elements in the line
df.loc[row_num] = [d[column_name1], d[column_name2]].
The variable df.loc seems to be a dictionary (e.g. here). Inserting into a python dictionary is getting slower the more elements it already holds. This has already been discussed in this stackoverflow response. Hence, the increasing size of your dictionary will eventually slow down the inner code of the loop.
So, based on the answer by mayercn and comment by Hugh Bowell, I was able to identify the issue with the code.
I modified the code as follows to reduce the time by 1/12th (average).
TL;DR: I appended rows to a list and later appended it to a final dataframe.
from ast import literal_eval as le
import re
import string
from pandas import DataFrame
import pandas
import time
f = open('Filename')
df = DataFrame(columns=cols)
row_num = 0
while True:
t = time.time()
l = []
for line in f:
line = line.strip()
bug = le(line)
l.append([values])
row_num+=1
if(row_num%5000 == 0):
print row_num, 'done', time.time() - t
df = df.append(pandas.DataFrame(l),ignore_index=True)
break
df.to_csv('File', index='id', encoding='utf-8')
Output time:
5000 done 0.998000144958
10000 done 1.01800012589
15000 done 1.01699995995
20000 done 0.999000072479
25000 done 1.04600000381
30000 done 1.09200000763
35000 done 1.06200003624
40000 done 1.14300012589
45000 done 1.00900006294
50000 done 1.06600022316

Adding run number when appending data frame to CSV

I am running a simulation in python, writing results to Pandas DataFrame and appending data to a CSV file. The code will be run multiple times with possible variation of parameters. Is there a smart way to record run number of the simulation to the CSV file for future data analysis?
import pandas as pd
import random
# Create a data frame with random values of random length, append
# to a data frame and write to file.
df = dp.DataFrame()
for i in range(3):
length = random.randint(3,20)
aa = [random.randint(0,25) for i in range(length)]
aa = [random.randint(0,25) for i in range(length)]
run_n = [i * length]
aabb = list(zip(aa, bb, run_n)
aabb_df = pd.DataFrame(data=aabb, columns=['aa', 'bb', 'run_N'])
df = df.append(aabb_df)
with open(myfile, 'a') as csvfile:
df.to_csv(csvfile, index=False, header=False)
Recording the number of the run from the for-loop is straight forward, however I suspect incorrect. Is it possible to check the number of the run and continue counting from there without reading the whole file in.
Thank you in advance!
You could always make the run number an integer drawn from a uniform random distribution such that it's highly unlikely two of the same values will ever be drawn:
run_n = np.random.randint(1e9)
Or, you can increment each run number so that there is a guarantee no two runs will have the same run_n using a counter strategy.

Python: How to sample data into Test and Train datasets?

I have been using a CSV data to implement my scripts and wanted to sample the data into two datasets:
Test Data
Train Data
i want to sample the data sets in 85% and 15% divisions and want to output two CSV files Test.csv and Train.csv
i want it to do in base Python and do not want to use any other external module like Numpy, SciPy, Pandas or Scikitlearn. Can anyone help me out in random sampling of data by percentage. Moreover i will be provided with the datasets that may have random number of observations. SO far i have just read about Pandas and various other modules to sample the data by percentage basis and have not got any concrete solution for my problem.
Moreover i want to retain the headers of the CSV in both the files. Because headers would make each row accessible and can be used in further analysis.
Use the random function in the random module to get a uniformly distributed random number between 0 and 1.
If it's > .85 write to training data, else the test data. See How do I simulate flip of biased coin in python?.
import random
with open(input_file) as data:
with open(test_output, 'w') as test:
with open(train_output, 'w') as train:
header = next(data)
test.write(header)
train.write(header)
for line in data:
if random.random() > 0.85:
train.write(line)
else:
test.write(line)
Use random.shuffle to create a random permutation of your dataset and slice it as you wish:
import random
random.shuffle(data)
train = data[:int(len(data)*0.85)]
test = data[len(train):]
Since you requested a specific solution to partition a potentially large CSV file into two files for training and test data, I'll also show how that could be done using a similar approach like the general method described above:
import random
# Count lines
with open('data.csv','r') as csvf:
linecount = sum(1 for lines in csvf if line.strip() != '')
# Create index sets for training and test data
indices = list(range(linecount))
random.shuffle(indices)
ind_test = set(indices[:int(linecount*0.15)])
del indices
# Partition CSV file
with open('data.csv','r') as csvf, open('train.csv','w') as trainf, open('test.csv','w') as testf:
i = 0
for line in csvf:
if line.strip() != '':
if i in ind_test:
testf.write(line.strip() + '\n')
else:
trainf.write(line.strip() + '\n')
Thereby, I assume that the CSV file contains one observation per row.
This will create an accurate 85:15 split. If less accurate partitions are okay for you, the solution of Peter Wood would be much more efficient.

Categories