comparing large csv files - python

Which method is more efficient for comparing two large (8GB & 5GB) csv files? The output should contain every id that is not in file1.
The data is a single column with GUIDs.
Method 1:
df = pd.read_csv(file)
df1 = pd.read_csv(file1)
df = df.merge(df1, on=['id'], how="outer", indicator=True).query('_merge=="left_only"')
df['id'].to_csv(output_path, index=False)
Method 2:
with open(file1, 'r') as t1:
file = set(t1)
with open(file, 'r') as t2, open(output_path, 'w') as outFile:
for line in t2:
if line not in file:
outFile.write(line)

What do you mean by efficiency? Certainly two major differences are as follows:
The first method, which pandas uses, needs to have all the data in memory. So you will need an amount of available memory to hold the data from the two csv files (note: 5+8gb may not be enough, but it will depend on the type of data in the csv files).
The second method takes advantage of python's generators, and reads
the file line by line, loading into memory one line at a time.
So if you have the memory available to load the data into memory, it will certainly be faster to load all the data into memory and do the operations on the data in memory.
If you don't have enough memory available, the second method works but is definitely slower. a good compromise might be to read the file by chunk, loading into memory an amount of data that your hardware can handle.
Extras
To estimate the memory space used by your datframe you can read this nice post:
How to estimate how much memory a Pandas' DataFrame will need?
Here you can find approndments explaining how to read a file by chunk, with or without pandas
How do I read a large csv file with pandas?
Lazy Method for Reading Big File in Python?

If this is something you'll have to run multiple times, you can just wrap them with start = time.time() at the beginning and execution_time = time.time() - start at the end to compare speed. To compare memory, you can check out this package, memory_profiler

Related

Read in large text file (~20m rows), apply function to rows, write to new text file

I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.
My code looks like this:
with open('f.txt', 'r') as f:
function(f,w)
Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.
I have tried:
def multiprocess(f,w):
cores = multiprocessing.cpu_count()
with Pool(cores) as p:
pieces = p.map(function,f,w)
f.close()
w.close()
multiprocess(f,w)
But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.
even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.
In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
The slowest solution, but most efficient RAM usage
Your initial solution to execute and process the file line by line
For maximum speed, but most RAM consumption
Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
Figure out the number of cores (say 8 cores for example)
Split the list evenly into 8 lists
pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
Figure out the number of cores in a variable cores (say 8)
Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
split that into several lists (10K chunks) to push to your respective Processes Queues
When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.
Using in to iterate is not the way but you can call more than one line by time, you just need to sum one or more to read more than one line, doing this the program will read faster.
Look this snippet.
# Python code to
# demonstrate readlines()
L = ["Geeks\n", "for\n", "Geeks\n"]
# writing to file
file1 = open('myfile.txt', 'w')
file1.writelines(L)
file1.close()
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()
count = 0
# Strips the newline character
for line in Lines:
count += 1
print("Line{}: {}".format(count, line.strip()))
I got it from: https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/.

What's the most efficient use of computing resources when reading from one file and writing to another? line by line or in a batch to a list?

I'm reading line by line from a text file and manipulating the string to then be written to a csv file.
I can think of two best ways to do this (and I welcome other ideas or modifications):
Read, process single line into a list, and go straight to writing the line.
linelist = []
with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
#Parse fields into list, replacing the previous list item with a new string that is a comma-separated row.
#Write list item into clean.csv.
Read and process the lines into a list (until reaching the size limit of a list), then writing the list to the csv in one big batch. Repeat until end of file (but I'm leaving out the loop for this example).
linelist = []
seekpos = 0
with open('dirty.txt', 'r') as dirty_text:
for line in dirty_text:
#Parse fields into list until the end of the file or the end of the list's memory space, such that each list item is a string that is a comma-separated row.
#update seek position to come back to after this batch, if looping through multiple batches
with open('clean.csv', 'a') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
#write list into clean.csv, each list item becoming a comma-separated row.
#This would likely be a loop for bigger files, but for my project and for simplicity, it's not necessary.
Which process is the most efficient use of resources?
In this case, I'm assuming nobody (human or otherwise) needs access to either file during this process (though I would gladly hear discussion about efficiency in that case).
I'm also assuming a list demands less resources than a dictionary.
Memory use is my primary concern. My hunch is that the first process uses the least memory because the list never gets longer than one item, so the maximum memory it uses at any given moment is less than that of the second process which maxes out the list memory. But, I'm not sure how dynamic memory allocation works in Python, and you have two file objects open at the same time in the first process.
As for power usage and total time it takes, I'm not sure which process is more efficient. My hunch is that with multiple batches, the second option would use more power and take more time because it opens and closes the files at each batch.
As for code complexity and length, the first option seems like it will turn out simpler and shorter.
Other considerations?
Which process is best?
Is there a better way? Ten better ways?
Thanks in advance!
Reading all the data into memory is inefficient because it uses more memory than necessary.
You can trade some CPU for memory; the program to read everything into memory will have a single, very simple main loop; but the main bottleneck will be the I/O channel, so it really won't be faster. Regardless of how fast the code runs, any reasonable implementation will spend most of its running time waiting for the disk.
If you have enough memory, reading the entire file into memory will work fine. Once the data is bigger than your available memory, performance will degrade ungracefully (i.e. the OS will start swapping regions of memory out to disk and then swap them back in when they are needed again; in the worst case, this will basically grind the system to a halt, a situation called thrashing). The main reason to prefer reading and writing a line at a time is that the program will perform without degradation even when you scale up to larger amounts of data.
I/O is already buffered; just write what looks natural, and let the file-like objects and the operating system take care of the actual disk reads and writes.
with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
row = some_function(line)
cleancsv_writer.writerow(row)
If all the work of cleaning up a line is abstracted away by some_function, you don't even need the for loop.
with open('dirty.txt', 'r') as dirty_text,\
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
cleancsv_writer.writerows(some_function(line) for line in dirty_text))

pandas - read file only to a certain limit

I have a file (in GBs) and want to read out only (let's say) 500MB of it. Is there a way I can do this?
PS: I thought of reading in first few lines of the dataset. See how much memory it uses and then accordingly get the number of lines. I'm looking for a way that can avoid this approach.
You can use generator here to read lines from a file in a memory efficient way, you can refer to this Lazy Method for Reading Big File in Python?
or
you can use f.read(number of lines) to read from line, lets suppose you want to read first 100 lines in a file
fname='your file name'
with open(fname) as f:
lines=100
content = f.read(lines)
print content
or
by using pandas nrows (number of rows)
import pandas as pd
myfile = pd.read('your file name',nrows=1000)

Python CSV parsing fills up memory

I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.
with open(file, "rb") as csvfile:
re = csv.DictReader(csvfile)
for row in re:
//insert row['column_name'] into DB
For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?
The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.
But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.
So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.
If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:
delimiter = ','
with open(filename, 'r') as fil:
headers = fil.next()
headers = headers.strip().split(delimiter)
dic_headers = {hdr: headers.index(hdr) for hdr in headers}
for line in fil:
row = line.strip().split(delimiter)
## do something with row[dic_headers['column_name']]
This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.

Python fast way to read several rows of csv text?

I wish to to the following as fast as possible with Python:
read rows i to j of a csv file
create the concatenation of all the strings in csv[row=(loop i to j)][column=3]
My first code was a loop (i to j) of the following:
with open('Train.csv', 'rt') as f:
row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
tags = (row[3].decode('utf8'))
return tags
but my code above reads the csv one column at a time and is slow.
How can I read all rows in one call and concatenate fast?
Edit for additional information:
the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).
Since I know which data you are interested in, I can speak from experience:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
row[0] # ID
row[1] # title
row[2] # body
row[3] # tags
You can of course per row select anything you want, and store it as you like.
By using an iterator variable, you can decide which rows to collect:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
linenum = 0
tags = [] # you can preallocate memory to this list if you want though.
for row in reader:
if linenum > 1000 and linenum < 2000:
tags.append(row[3]) # tags
if linenum == 2000:
break # so it won't read the next 3 million rows
linenum += 1
The good thing about it is also that this will really use low memory as you read in line by line.
As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).
Then I used:
train = pandas.io.parsers.read_csv(file, quotechar="\"")
To quickly read in the split files.
If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just
tags = " ".join(x.split("\t")[3]
for x in open("Train.csv").readlines()[from_row:to_row+1])
is going to be the fastest way.
If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.
If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.
If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.
Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.
The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?
If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:
Reduce file to the required data.
Transform the remaining data.
sed is designed for the task 'read rows i to j of a csv file'.to
If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp', then parse the output with Python would be simple and quick.

Categories