Calculate averages over subgroups of data in extremely large (100GB+) CSV file - python

I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.
The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.
The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.
What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.
The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!
import itertools
import pandas as pd
# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)])
# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
counter = 0
total_lines = sum(1 for line in f)
f.seek(0)
for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column
sc16, pubyr = key
rows = [row.strip().split(';') for row in group]
columns = rows[0]
rows = rows[1:]
# Step 3: convert the group of rows to a dataframe
group_df = pd.DataFrame(rows, columns=columns)
# Step 4: calculate the mean for the group
mean_row = {'sc16': sc16, 'pubyr': pubyr}
for col in group_df.columns:
if col.startswith('dim_'):
mean_row[col] = group_df[col].astype(float).mean()
# Step 5: append the mean row to the mean dataframe
mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
counter += len(rows)
print(f"{counter} of {total_lines}")
# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)

You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.
Try something like this; it's using numpy to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)
Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.
import itertools
import operator
import numpy as np
def read_embeddings_file(filename):
# Read the (pre-sorted) embeddings file,
# yielding tuples of ((sc16, pubyr) and a list of dimensions).
with open(filename) as in_file:
for line in in_file:
if not line or line.startswith("sc16"): # Header or empty line
continue
line = line.split(";")
sc16, cpid, type, pubyr, *dims = line
# list(map(... is faster than the equivalent listcomp
yield (sc16, pubyr), list(map(int, dims))
def main():
output_name = "scibert_embeddings_mean.txt"
input_name = "scibert_embeddings_sorted.txt"
with open(output_name, "w") as out_f:
print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
counter = 0
for group, group_contents in itertools.groupby(
read_embeddings_file(input_name),
key=operator.itemgetter(0), # Group by (sc16, pubyr)
):
dims = [d[1] for d in group_contents]
# Calculate the mean of each dimension
mean_dims = np.mean(np.array(dims).astype(float), axis=0)
# Write group to output
print(*group, *mean_dims, sep=";", file=out_f)
# Print progress
counter += len(dims)
print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")
if __name__ == "__main__":
main()

Related

sort one giant string into 7 columns

I have a file which I read in as a string. In sublime the file looks like this:
Filename
Dataset
Level
Duration
Accuracy
Speed Ratio
Completed
file_001.mp3
datasetname_here
value
00:09:29
0.00%
7.36x
2019-07-18
file_002.mp3
datasetname_here
value
00:22:01
...etc.
in Bash:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...etc.
I want to split this into a 7 column csv. As you can see, the values repeat every 8th line. I know I can use a for loop and modulus to read each line. I have done this successfully before.
How can I use pandas to read things into columns?
I don't know how to approach the Pandas library. I have looked at other examples and all seem to start with csv.
import sys
parser = argparse.ArgumentParser()
parser.add_argument('file' , help = "this is the file you want to open")
args = parser.parse_args()
print("file name:" , args.file)
with open(args.file , 'r') as word:
print(word.readlines()) ###here is where i was making sure it read in properly
###here is where I will start to manipulate the data
This is the Bash output:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...]
First remove '\n':
raw_data = ['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', '0.01%\n', '7.39x\n', '2019-07-20\n']
raw_data = [string.replace('\n', '') for string in raw_data]
Then pack your data in 7-length arrays inside a big array:
data = [raw_data[x:x+7] for x in range(0, len(raw_data),7)]
Finally read your data as a DataFrame, the first row contains the name of the columns:
df = pd.DataFrame(data[1:], columns=data[0])
print(df.to_string())
Filename Dataset Level Duration Accuracy Speed Ratio Completed
0 file_001.mp3 datasetname_here value 00:09:29 0.00% 7.36x 2019-07-18
1 file_002.mp3 datasetname_here L1 00:20:01 0.01% 7.39x 2019-07-20
Try This
import numpy as np
import pandas as pd
with open ("data.txt") as f:
list_str = f.readlines()
list_str = map(lambda s: s.strip(), list_str) #Remove \n
n=7
list_str = [list_str[k:k+n] for k in range(0, len(list_str), n)]
df = pd.DataFrame(list_str[1:])
df.columns = list_str[0]
df.to_csv("Data_generated.csv",index=False)
Pandas is not a library to read into columns. It supports many formats to read and write (One of them is comma separated values) and mainly used as python based data analysis tool.
Best place to learn is see their documentation and practice.
Output of above code
I think you don't have to use pandas or any other library. My approach:
data = []
row = []
with open(args.file , 'r') as file:
for line in file:
row.append(line)
if len(row) == 7:
data.append(row)
row = []
How does it work?
The for loop reads the file line by line.
Add the line to row
When row's length is 7, it's completed and you can add the row to data
Create a new list for row
Repeat

How to print max and min value from a long file?

So Im having problem to print the max and min value from a file, the file has over 3000 lines and look like this:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
...
...
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
So this is my current code, I have no idea how to make it work, because now it only prints 3968 value because obviously it is the largest but I want the largest and smallest value from the second column (all the stock prices).
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[0:]
stock_file.close()
print(max(data))
Your current code outputs the "correct" output by chance, since it is using string comparison.
Consider this:
with open('test.txt') as f:
lines = [line.split(', ') for line in f.readlines()[1:]]
# lines is a list of lists, each sub-list represents a line in a format [date, value]
max_value_date, max_value = max(lines, key=lambda line: float(line[-1].strip()))
print(max_value_date, max_value)
# '2015-10-09' '112.120003'
Your current code is reading each line as a string and then finding max and min lines for your list. You can use pandas to read the file as CSV and load it as data frame and then do your min, max operations on data frame
Hope following answers your question
stocks = []
data=data[1:]
for d in data:
stocks.append(float(d.split(',')[1]))
print(max(stocks))
print( min(stocks))
I recommend Pandas module to work with tabular data and use read_csv function. Is very well documented, optimized and very popular for this purposes. You can install it with pip using pip install pandas.
I created a dumb file with your format and stored in a file called test.csv:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
Then, to parse the file you can do as follows. Names parameter defines the name of the columns. Skiprows allows you to skip the first line.
#import module
import pandas as pd
#load file
df = pd.read_csv('test.csv', names=['date', 'value'], skiprows=[0])
#get max and min values
max_value = df['value'].max()
min_value = df['value'].min()
You want to extract the second column into a float using float(datum.split(', ')[1].strip()), and ignore the first line.
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[1:] #ignore first line
stock_file.close()
data = [datum.split(', ') for datum in data]
max_value_date, max_value = max(data, key=lambda data: float(data[-1].strip()))
print(max_value_date, max_value)
or you can use do it in a simpler way: make a list of prices and then get the maximum and minimum. like this:
#as the first line in your txt is not data
datanew=data[1:]
prices=[]
line_after=[]
for line in datanew:
line_after=line.split(',')
price=line_after[1]
prices.append(float(price))
maxprice=max(prices)
minprice=min(prices)

reading csv file line by line and save lines which are satisfying certain conditions

I have an issue which was already discussed in several topics, nevertheless i would like to go a bit deeper and maybe find a better solution.
So the idea is to go through "huge" (50 to 60GB) .csv files with python, find the lines which satisfy some conditions, extract them and finally store them in a second variable for further analysis.
Initially the problem was for r scripts, which i manage with sparklyr connection, or eventually some gawk code in bash (see awk, or gawk), to extract the data I need, then analyse it with R/python.
I would like to resolve this issue exclusively with python, the idea would be to avoid mixing languages like bash/python, or bash/R (unix). So far i use the open as x, and go through file line by line, and it kinda works, but it's awfully slow. For example, going through the file is pretty fast (~500.000 lines per second, even for a 58M lines is ok), but when I try to store the data, the speed drops to ~10 lines per second. For an extraction with ~300.000 lines, it's unacceptable.
I tried several solutions and I'm guessing that it's not optimal (poor python code ? :( ) and better solutions eventually exist.
Solution 1: go through file, split the line in a list, check the conditions, if ok put the line in numpy matrix and vstack for each iteration which is satisfying the condition (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
a = numpy.array(['colnames']*35) #data is 35 columns
index = list()
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";") # csv with ";" ...
date_file = line[1][0:10] # date stored in the 2nd column
if date_file >= date_first and date_file <= date_last : #data extraction concern a time period (one month for example)
line=numpy.array(line) #go to numpy
a=numpy.vstack((a, line)) #stack it
Solution 2 : the same but store the line in a pandas data.frame with a row index if conditions ok (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0 #row index
a = pandas.DataFrame(numpy.zeros((0,35)))#data is 35 columns
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
if date_file>=date_first and date_file<=date_last :
a.loc[row] = line #store the line in the pd.data.frame at the position row
row = row + 1 #go to next row
Solution 3 : the same, but instead of storing the line somewhere, which is the main issue for me, keep an index for satisfying rows, and then open the csv with the rows i need (even slower, actually going through file to find the indexes is fast enough, the opening index's row is awfully slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0
index = list()
with open("data.csv", "r") as f:
f = csv.reader(f, delimiter = ";")
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
row = row + 1
if date_file>=date_first and date_file<=date_last :
index.append(row)
with open("data.csv") as f:
reader=csv.reader(f)
interestingrows=[row for idx, row in enumerate(reader) if idx in index]
The idea would be to keep only the data which satisfy the condition, here an extraction for a specific month. I do not understand where the problem is coming from, saving the data somewhere (vstack, or writing with in a pd.DF) is definitively an issue. I'm pretty sure i do something wrong but i'm not sure where/what.
The data is a csv with 35 columns and over 57M rows.
Thanks for the reading
O.
Appends to dataframes and numpy arrays are very expensive because each append must copy the entire data to a new memory location. Instead, you can try reading the file in chunks, processing the data, and appending back out. Here I've picked a chunk size of 100,000 but you can obviously change this.
I don't know the column names of your CSV so I guessed at 'date_file'. This should get you close:
import pandas as pd
date_first = '2008-11-01'
date_last = '2008-11-10'
df = pd.read_csv("data.csv", chunksize=100000)
for chunk in df:
chunk = chunk[(chunk['date_file'].str[:10] >= date_first)
& (chunk['date_file'].str[:10] <= date_last)]
chunk.to_csv('output.csv', mode='a')

Column statistics from given input file?

I am given a .txt file of data:
1,2,3,0,0
1,0,4,5,0
1,1,1,1,1
3,4,5,6,0
1,0,1,0,3
3,3,4,0,0
My objective is to calculate the min,max,avg,range,median of the columns of given data and write it to an output .txt file.
My logic in approaching this question is as follows
Step 1) Read the data
infile = open("Data.txt", "r")
tempLine = infile.readline()
while tempLine:
print(tempLine.split(','))
tempLine = infile.readline()
Obviously it's not perfect but the idea is that the data can be read by this...
Step 2) Store the data into corresponding list variables? row1, row2,... row6
Step 3) Combine above lists all into one, giving a final list like this...
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]]
Step 4) Using nested for loop, access elements individually and store them into list variables
col1, col2, col3, ... , col5
Step 5) Calculate min, max etc and write to output file
My question is, with my rather beginner knowledge of computer science and python, is this logic inefficient, and could there possibly be an easier, and better logic towards solving this problem?
My main problem is probably steps 2 through 5. The rest I know how to do for sure.
Any advice would be helpful!
Try numpy. Numpy library provides a fast options when dealing with nested lists in a list, or simply, matrices.
To use numpy, you must import numpy at the beginning of your code.
numpy.matrix(1,2,3,0,0;1,0,4,5,0;....;3,3,4,0,0)
will give you
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]] straight off the bat.
Also, you may look through the axis(in this case, rows) and get mean, min, max easily using
max([axis, out]) Return the maximum value along an axis.
mean([axis, dtype, out]) Returns the average of the matrix elements along the given axis.
min([axis, out]) Return the minimum value along an axis.
This is from https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html, a numpy document, so for more information, please read the numpy document.
To get the data I would to something like this:
from statistics import median
infile = open("Data.txt", "r")
rows = [line.split(',') for line in infile.readlines()]
for row in rows:
minRow = min(row)
maxRow = max(row)
avgRow = sum(row) / len(row)
rangeRow = maxRow - minRow
medianRow = median(row)
#then write the data to the output file
You can use the pandas library for this (http://pandas.pydata.org/)
The code below worked for me:
import pandas as pd
df = pd.read_csv('data.txt',header=None)
somestats = df.describe()
somestats.to_csv('dataOut.txt')
This is how I ended up doing it if anyone is curious
import numpy
infile = open("Data1.txt", "r")
outfile = open("ColStats.txt", "w")
oMat = numpy.loadtxt(infile)
tMat = numpy.transpose(oMat) #Create new matrix where Columns of oMat becomes rows and rows become columns
#print(tMat)
for x in range (5):
tempM = tMat[x]
mn = min(tempM)
mx = max(tempM)
avg = sum(tempM)/6.0
rng = mx - mn
median = numpy.median(tempM)
out = ("[{} {} {} {} {}]".format(mn, mx, avg, rng, median))
outfile.write(out + '\n')
infile.close()
outfile.close()
#print(tMat)

pandas read_csv specific row [duplicate]

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?
Assuming no header in the CSV file:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.
With header and unknown file length:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
#dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.
Note also that their answer for unknown file length relies on iterating through the file twice -- once to get the length, and then another time to read the csv. I have three solutions here which only rely on iterating through the file once, though they all have tradeoffs.
Solution 1: Approximate Percentage
If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
As pointed out in the comments, this only gives approximately the right number of lines, but I think it satisfies the desired usecase.
Solution 2: Every Nth line
This isn't actually a random sample, but depending on how your input is sorted and what you're trying to achieve, this may meet your needs.
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
Solution 3: Reservoir Sampling
(Added July 2021)
Reservoir sampling is an elegant algorithm for selecting k items randomly from a stream whose length is unknown, but that you only see once.
The big advantage is that you can use this without having the full dataset on disk, and that it gives you an exactly-sized sample without knowing the full dataset size. The disadvantage is that I don't see a way to implement it in pure pandas, I think you need to drop into python to read the file and then construct the dataframe afterwards. So you may lose some functionality from read_csv or need to reimplement it, since we're not using pandas to actually read the file.
Taking an implementation of the algorithm from Oscar Benjamin here:
from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO
def reservoir_sample(iterable, k=1):
"""Select k items uniformly from iterable.
Returns the whole population if there are k or fewer items
from https://bugs.python.org/issue41311#msg373733
"""
iterator = iter(iterable)
values = list(islice(iterator, k))
W = exp(log(random())/k)
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
selection = list(islice(iterator, skip, skip+1))
if selection:
values[randrange(k)] = selection[0]
W *= exp(log(random())/k)
else:
return values
def sample_file(filepath, k):
with open(filepath, 'r') as f:
header = next(f)
result = [header] + sample_iter(f, k)
df = pd.read_csv(StringIO(''.join(result)))
The reservoir_sample function returns a list of strings, each of which is a single row, so we just need to turn it into a dataframe at the end. This assumes there is exactly one header row, I haven't thought about how to extend it to other situations.
I tested this locally and it is much faster than the other two solutions. Using a 550 MB csv (January 2020 "Yellow Taxi Trip Records" from the NYC TLC), solution 3 runs in about 1 second, while the other two take ~3-4 seconds.
In my test this is even slightly (~10-20%) faster than #Bar's answer using shuf, which surprises me.
This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:
shuf -n 100000 data/original.tsv > data/sample.tsv
The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.
Relevant question: https://unix.stackexchange.com/q/108581
Benchmark on a 7M lines csv available here (2008):
Top answer:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
Timing for pandas:
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
While using shuf:
time shuf -n 100000 2008.csv > temp.csv
real 0m1.583s
user 0m1.445s
sys 0m0.136s
So shuf is about 12x faster and importantly does not read the whole file into memory.
Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.
Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.
By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.
See code below:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line
The following code reads first the header, and then a random sample on the other lines:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)
class magic_checker:
def __init__(self,target_count):
self.target = target_count
self.count = 0
def __eq__(self,x):
self.count += 1
return self.count >= self.target
min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
f.seek(seek_target)
f.readline() #discard this line
rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")
something like that should work I think
No pandas!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
You'll end up with a sampled_lines list. What kind of statistics do you mean?
use subsample
pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv
You can also create a sample with the 10000 records before bringing it into the Python environment.
Using Git Bash (Windows 10) I just ran the following command to produce the sample
shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv
To note: If your CSV has headers this is not the best solution.
TL;DR
If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:
import pandas as pd
import numpy as np
filename = "data.csv"
sample_size = 10000
batch_size = 200
rng = np.random.default_rng()
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
sample = sample_reader.get_chunk(sample_size)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
sample.loc[chunk.index] = chunk
Explanation
It's not always trivial to know the size of the input CSV file.
If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.
So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.
To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
First, we get the initial sample:
sample = sample_reader.get_chunk(sample_size)
Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
And use this reindexed chunk to replace (some of the) lines in the sample:
sample.loc[chunk.index] = chunk
After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.
To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).
Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.
On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.
read the data file
import pandas as pd
df = pd.read_csv('data.csv', 'r')
First check the shape of df
df.shape()
create the small sample of 1000 raws from df
sample_data = df.sample(n=1000, replace='False')
#check the shape of sample_data
sample_data.shape()
For example, you have the loan.csv, you can use this script to easily load the specified number of random items.
data = pd.read_csv('loan.csv').sample(10000, random_state=44)
Let's say that you want to load a 20% sample of the dataset:
import pandas as pd
df = pd.read_csv(filepath).sample(frac = 0.20)

Categories