Python: How to sample data into Test and Train datasets? - python

I have been using a CSV data to implement my scripts and wanted to sample the data into two datasets:
Test Data
Train Data
i want to sample the data sets in 85% and 15% divisions and want to output two CSV files Test.csv and Train.csv
i want it to do in base Python and do not want to use any other external module like Numpy, SciPy, Pandas or Scikitlearn. Can anyone help me out in random sampling of data by percentage. Moreover i will be provided with the datasets that may have random number of observations. SO far i have just read about Pandas and various other modules to sample the data by percentage basis and have not got any concrete solution for my problem.
Moreover i want to retain the headers of the CSV in both the files. Because headers would make each row accessible and can be used in further analysis.

Use the random function in the random module to get a uniformly distributed random number between 0 and 1.
If it's > .85 write to training data, else the test data. See How do I simulate flip of biased coin in python?.
import random
with open(input_file) as data:
with open(test_output, 'w') as test:
with open(train_output, 'w') as train:
header = next(data)
test.write(header)
train.write(header)
for line in data:
if random.random() > 0.85:
train.write(line)
else:
test.write(line)

Use random.shuffle to create a random permutation of your dataset and slice it as you wish:
import random
random.shuffle(data)
train = data[:int(len(data)*0.85)]
test = data[len(train):]
Since you requested a specific solution to partition a potentially large CSV file into two files for training and test data, I'll also show how that could be done using a similar approach like the general method described above:
import random
# Count lines
with open('data.csv','r') as csvf:
linecount = sum(1 for lines in csvf if line.strip() != '')
# Create index sets for training and test data
indices = list(range(linecount))
random.shuffle(indices)
ind_test = set(indices[:int(linecount*0.15)])
del indices
# Partition CSV file
with open('data.csv','r') as csvf, open('train.csv','w') as trainf, open('test.csv','w') as testf:
i = 0
for line in csvf:
if line.strip() != '':
if i in ind_test:
testf.write(line.strip() + '\n')
else:
trainf.write(line.strip() + '\n')
Thereby, I assume that the CSV file contains one observation per row.
This will create an accurate 85:15 split. If less accurate partitions are okay for you, the solution of Peter Wood would be much more efficient.

Related

Sorting a big file using its chunks

Suppose we want to sort a file that has 40000 rows around a column=X. Let us also assume that same values are widespread across the table, so that rows with same value in column=X are found not only in the top 1000 rows. Now if we read file by chunks and consider only 1000 rows, we might mess the other rows with same value found in column=X if we are to sort again the table around that column. So how we can solve this issue please? No code is needed since no data is available, but please I am looking for your opinion on the matter? Should we go with merge sort by giving each chunk to a merge sort algorithm parallelly and then recombine the results? I don't see that there is a way to do that with pandas, but I am not sure?
import pandas as pd
chunk_size = 1000
batch_no = 1
for chunk in pd.read_csv('data.csv', chunksize=chunk_size):
chunk.sort_values(by='X', inplace=True)
chunk.to_csv('data' +str(batch_no) + '.csv', index=False)
batch_no +=1
You need to merge the sorted csv files, luckily Python provides a function for it. Use it as below:
from operator import itemgetter
import pandas as pd
import numpy as np
import csv
import heapq
# generate test data
test_data = pd.DataFrame(data=[[f"label{i}", val] for i, val in enumerate(np.random.uniform(size=40000))],
columns=["label", "X"])
test_data.to_csv("data.csv", index=False)
# read and sort each chunk
chunk_size = 1000
file_names = []
for batch_no, chunk in enumerate(pd.read_csv("data.csv", chunksize=chunk_size), 1):
chunk.sort_values(by="X", inplace=True)
file_name = f"data_{batch_no}.csv"
chunk.to_csv(file_name, index=False)
file_names.append(file_name)
# merge the chunks
chunks = [csv.DictReader(open(file_name)) for file_name in file_names]
with open("data_sorted.csv", "w") as outfile:
field_names = ["label", "X"]
writer = csv.DictWriter(outfile, fieldnames=field_names)
writer.writeheader()
for row in heapq.merge(*chunks, key=itemgetter("X")):
writer.writerow(row)
From the documentation on heapq.merge:
Merge multiple sorted inputs into a single sorted output (for example,
merge timestamped entries from multiple log files). Returns an
iterator over the sorted values.
Similar to sorted(itertools.chain(*iterables)) but returns an
iterable, does not pull the data into memory all at once, and assumes
that each of the input streams is already sorted (smallest to
largest).
So using as you can read in the above quote (emphasis mine) using heapq.merge won't load all data into memory. Is also worth noting that the complexity of this function is O(n) where n is the size of the whole data. Therefore the overall sorting algorithm is O(nlogn)

Data from multiple sensors saved to txt file imported to pandas

Good day everyone.
I was hoping someone here could help me with a bit of a problem. I've run an experiment, where data has been gathered from 6 separate sensors simultaneously. The data has then been exported to a single shared txt file. Now I need to import the data to python to analyze it.
I know I can do this by taking each of the lines and simply copy&pasting data output from each sensor into a separate document, and then import those in a loop - but that is a lot of work and brings in a high potential of human error.
But is there no way of using readline with specific lines read, and porting that to pandas DataFrame? There is a fixed header spacing, and line spacing between each sensor.
I tried:
f=open('OR0024622_auto3200.txt')
lines = f.readlines()
base = 83
sensorlines = 6400
Sensor=[]
Sensor = lines[base:sensorlines+base]
df_sens = pd.DataFrame(Sensor)
df_sens
but the output isn't very useful:
Snip from of Output
--
Here's the file i am importing:
link.
Any suggestions ?
Looks like a tab separated data.
use
>>> df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t', skiprows=83, header=None, nrows=38955-84)
>>> df.tail()
0 1 2
38686 6397 3.1980000000e+003 9.28819e-009
38687 6398 3.1985000000e+003 9.41507e-009
38688 6399 3.1990000000e+003 1.11703e-008
38689 6400 3.1995000000e+003 9.64276e-009
38690 6401 3.2000000000e+003 8.92203e-009
>>> df.head()
0 1 2
0 1 0.0000000000e+000 6.62579e+000
1 2 5.0000000000e-001 3.31289e+000
2 3 1.0000000000e+000 2.62362e-011
3 4 1.5000000000e+000 1.51130e-011
4 5 2.0000000000e+000 8.35723e-012
abhilb's answer is to the point and correct, but there is a lot to be said regarding loading/reading files. A quick browser search will take you a long way (I encourage you to read up on this!), but I'll add a few details here:
If you want to load multiple files that match a pattern you can do so iteratively via glob:
import pandas as pd
from glob import glob as gg
filePattern = "/path/to/file/*.txt"
for fileName in gg(filePattern):
df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t')
This will load each file one-by-one. What if you want to put all data into a single dataframe? Do this:
masterDF = pd.Dataframe()
for fileName in gg(filePattern):
df = pd.read_csv('OR0024622_auto3200.txt', delimiter=r'\t')
masterDF = pd.concat([masterDF, df], axis=0)
This works great for pandas, but what if you want to read into a numpy array?
import numpy as np
# using previous imports
base = 83
sensorlines = 6400
# create an empty array that has three columns
masterArray = np.full((0, 3), np.nan)
for fileName in gg(filePattern):
# open the file (NOTE: this does not read the file, just puts it in a buffer)
with open(fileName, "r") as tmp:
# now read the file and split each line by the carriage return (could be "\r\n")
# you now have a list of strings
data = tmp.read().split("\n")
# keep only the "data" portion of the file
data = data[base:sensorlines + base]
# convert list of strings to an array of floats
# here, I use a "list comprehension" for speed and simplicity
data = np.array([r.split("\t") for r in data]).astype(float)
# stack your new data onto your master array
masterArray = np.vstack([masterArray, data])
Opening a file via the "with open(fileName, "r")" syntax is handy because Python automatically closes the file when you are done. If you don't use "with" then you must manually close the file (e.g. tmp.close()).
These are just some starting points to get you on your way. Feel free to ask for clarification.

Adding run number when appending data frame to CSV

I am running a simulation in python, writing results to Pandas DataFrame and appending data to a CSV file. The code will be run multiple times with possible variation of parameters. Is there a smart way to record run number of the simulation to the CSV file for future data analysis?
import pandas as pd
import random
# Create a data frame with random values of random length, append
# to a data frame and write to file.
df = dp.DataFrame()
for i in range(3):
length = random.randint(3,20)
aa = [random.randint(0,25) for i in range(length)]
aa = [random.randint(0,25) for i in range(length)]
run_n = [i * length]
aabb = list(zip(aa, bb, run_n)
aabb_df = pd.DataFrame(data=aabb, columns=['aa', 'bb', 'run_N'])
df = df.append(aabb_df)
with open(myfile, 'a') as csvfile:
df.to_csv(csvfile, index=False, header=False)
Recording the number of the run from the for-loop is straight forward, however I suspect incorrect. Is it possible to check the number of the run and continue counting from there without reading the whole file in.
Thank you in advance!
You could always make the run number an integer drawn from a uniform random distribution such that it's highly unlikely two of the same values will ever be drawn:
run_n = np.random.randint(1e9)
Or, you can increment each run number so that there is a guarantee no two runs will have the same run_n using a counter strategy.

pandas read_csv specific row [duplicate]

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?
Assuming no header in the CSV file:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.
With header and unknown file length:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
#dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.
Note also that their answer for unknown file length relies on iterating through the file twice -- once to get the length, and then another time to read the csv. I have three solutions here which only rely on iterating through the file once, though they all have tradeoffs.
Solution 1: Approximate Percentage
If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
As pointed out in the comments, this only gives approximately the right number of lines, but I think it satisfies the desired usecase.
Solution 2: Every Nth line
This isn't actually a random sample, but depending on how your input is sorted and what you're trying to achieve, this may meet your needs.
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
Solution 3: Reservoir Sampling
(Added July 2021)
Reservoir sampling is an elegant algorithm for selecting k items randomly from a stream whose length is unknown, but that you only see once.
The big advantage is that you can use this without having the full dataset on disk, and that it gives you an exactly-sized sample without knowing the full dataset size. The disadvantage is that I don't see a way to implement it in pure pandas, I think you need to drop into python to read the file and then construct the dataframe afterwards. So you may lose some functionality from read_csv or need to reimplement it, since we're not using pandas to actually read the file.
Taking an implementation of the algorithm from Oscar Benjamin here:
from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO
def reservoir_sample(iterable, k=1):
"""Select k items uniformly from iterable.
Returns the whole population if there are k or fewer items
from https://bugs.python.org/issue41311#msg373733
"""
iterator = iter(iterable)
values = list(islice(iterator, k))
W = exp(log(random())/k)
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
selection = list(islice(iterator, skip, skip+1))
if selection:
values[randrange(k)] = selection[0]
W *= exp(log(random())/k)
else:
return values
def sample_file(filepath, k):
with open(filepath, 'r') as f:
header = next(f)
result = [header] + sample_iter(f, k)
df = pd.read_csv(StringIO(''.join(result)))
The reservoir_sample function returns a list of strings, each of which is a single row, so we just need to turn it into a dataframe at the end. This assumes there is exactly one header row, I haven't thought about how to extend it to other situations.
I tested this locally and it is much faster than the other two solutions. Using a 550 MB csv (January 2020 "Yellow Taxi Trip Records" from the NYC TLC), solution 3 runs in about 1 second, while the other two take ~3-4 seconds.
In my test this is even slightly (~10-20%) faster than #Bar's answer using shuf, which surprises me.
This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:
shuf -n 100000 data/original.tsv > data/sample.tsv
The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.
Relevant question: https://unix.stackexchange.com/q/108581
Benchmark on a 7M lines csv available here (2008):
Top answer:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
Timing for pandas:
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
While using shuf:
time shuf -n 100000 2008.csv > temp.csv
real 0m1.583s
user 0m1.445s
sys 0m0.136s
So shuf is about 12x faster and importantly does not read the whole file into memory.
Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.
Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.
By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.
See code below:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line
The following code reads first the header, and then a random sample on the other lines:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)
class magic_checker:
def __init__(self,target_count):
self.target = target_count
self.count = 0
def __eq__(self,x):
self.count += 1
return self.count >= self.target
min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
f.seek(seek_target)
f.readline() #discard this line
rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")
something like that should work I think
No pandas!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
You'll end up with a sampled_lines list. What kind of statistics do you mean?
use subsample
pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv
You can also create a sample with the 10000 records before bringing it into the Python environment.
Using Git Bash (Windows 10) I just ran the following command to produce the sample
shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv
To note: If your CSV has headers this is not the best solution.
TL;DR
If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:
import pandas as pd
import numpy as np
filename = "data.csv"
sample_size = 10000
batch_size = 200
rng = np.random.default_rng()
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
sample = sample_reader.get_chunk(sample_size)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
sample.loc[chunk.index] = chunk
Explanation
It's not always trivial to know the size of the input CSV file.
If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.
So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.
To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
First, we get the initial sample:
sample = sample_reader.get_chunk(sample_size)
Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
And use this reindexed chunk to replace (some of the) lines in the sample:
sample.loc[chunk.index] = chunk
After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.
To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).
Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.
On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.
read the data file
import pandas as pd
df = pd.read_csv('data.csv', 'r')
First check the shape of df
df.shape()
create the small sample of 1000 raws from df
sample_data = df.sample(n=1000, replace='False')
#check the shape of sample_data
sample_data.shape()
For example, you have the loan.csv, you can use this script to easily load the specified number of random items.
data = pd.read_csv('loan.csv').sample(10000, random_state=44)
Let's say that you want to load a 20% sample of the dataset:
import pandas as pd
df = pd.read_csv(filepath).sample(frac = 0.20)

pandas HDF select does not recognise column name

I'm trying to process a large (2gb) csv file on a machine with only 4gb of RAM (don't ask) to produce a different, formatted csv containing a subset of data that needs some processing. I'm reading the file and creating a HDFstore that I query later for the data that I require for output. Everything works except that I cant retrieve data from the store using Term - error message comes back that PLOT is not a column name. Individual variables look fine and the store is what I expect I just can't see where the error is. (nb pandas v14 and numpy1.9.0). Very new to this so apologies for the clunky code.
#wibble wobble -*- coding: utf-8 -*-
# short version
def filesport():
import pandas as pd
import numpy as np
from pandas.io.pytables import Term
Location = r"CL_short.csv"
store = pd.HDFStore('blarg.h5')
maxlines = sum(1 for line in open (Location))
print maxlines
#set chunk small for test file
chunky=4
plotty =pd.DataFrame(columns=['PLOT'])
dfdum=pd.DataFrame(columns=['PLOT', 'mDate', 'D100'])
#read file in chunks to avoid RAM blowing up
bucket = pd.read_csv(Location, iterator=True, chunksize=chunky, usecols= ['PLOT','mDate','D100'])
for chunk in bucket:
store.append('wibble', chunk, format='table', data_columns=['PLOT','mDate','D100'], ignore_index=True)
#retrieve plot numbers and select unique items
plotty = store.select('wibble', "columns = ['PLOT']")
plotty.drop_duplicates(inplace=True)
#iterate through unique plots to retrieve data and put in dataframe for output
for index, row in plotty.iterrows():
dfdum = store.select('wibble', [Term('PLOT', '=', plotty.iloc[index]['PLOT'])])
#process dfdum for output to new csv
print("successful completion")
filesport()
Final listing for those that wish to fight through the tumbleweed to reach here and are similarly bemused by processing large .csv files and the various methods of trying to retrieve/process data. The biggest problem was getting the sytax of the pytables Term right. Despite several examples indicating that it was possible to use 'A >20' etc this never worked for me. I set up a string condition containing the Term query and this worked (it is in the documentation TBF).
Also found it easier to query the HDF to retrieve unique items direct from the store in a list which could then be sorted and iterated through to retrieve data plot by plot. Note that I wanted the final csv file to have plot and then all the D100 data in date order, hence the pivot at the end.
Reading the csv file in chunks meant that each plot retrieved from the store had a header and this got written to the final csv which messed things up. I'm sure there's a more elegant way of only writing one header than the one I've shown here.
It works, takes about 2 hours to process the data and produce the final csv file (initial file 2GB, 30+million lines, data for 100,000+ unique plots, machine has 4GB of RAM but running 32-bit which means that only 2.5GB of RAM was available).
Good luck if you have a similar problem, and I hope you find this useful
#wibble wobble -*- coding: utf-8 -*-
def filesport():
import pandas as pd
import numpy as np
from pandas.io.pytables import Term
print (pd.__version__)
print (np.__version__)
Location = r"conliq_med.csv"
store = pd.HDFStore('blarg.h5')
maxlines = sum(1 for line in open (Location))
print maxlines
chunky=100000
#read file in chunks to avoid RAM blowing up select only needed columns
bucket = pd.read_csv(Location, iterator=True, chunksize=chunky, usecols= ['PLOT','mDate','D100'])
for chunk in bucket:
store.append('wibble', chunk, format='table', data_columns=['PLOT','mDate','D100'], ignore_index=True)
#retrieve unique plots and sort
plotty = store.select_column('wibble', 'PLOT').unique()
plotty.sort()
#set flag for writing file header
i=0
#iterate through unique plots to retrieve data and put in dataframe for output
for item in plotty:
condition = 'PLOT =' + str(item)
dfdum = store.select('wibble', [Term(condition)])
dfdum["mDate"]= pd.to_datetime(dfdum["mDate"], dayfirst=True)
dfdum.sort(columns=["PLOT", "mDate"], inplace=True)
dfdum["mDate"] = dfdum["mDate"].map(lambda x: x.strftime("%Y - %m"))
dfdum=dfdum.pivot("PLOT", "mDate", "D100")
#only print one header to file
if i ==0:
dfdum.to_csv("CL_OP.csv", mode='a')
i=1
else:
dfdum.to_csv("CL_OP.csv", mode='a', header=False)
print("successful completion")
filesport()

Categories