Write to files with dynamic file names? - python

I need to split a large txt file (about 100GB, 1 billion rows) by DATE. The file looks like this
ID*DATE*company
1111*201101*geico
1234*201402*travelers
3214*201003*statefarm
...
Basically there are 60 months so I should be getting 60 sub-files. My Python script is
with open("myBigFile.txt") as f:
for line in f:
claim = line.split("*")
with open("DATE-"+str(claim[1])+".txt", "a") as fy:
fy.write(claim[0]+"*"+claim[2]+"\n")
Now since the number of records is huge, this runs too slow because it needs to open/close file for every row. So I'm thinking about first open 60 sub-files and then scan the file, write each row to corresponding sub-file. The sub-files are not closed until all the rows are scanned. However since python automatically close the file whenever the reference is removed (http://blog.lerner.co.il/dont-use-python-close-files-answer-depends/), I have to use some dynamic file name, something like
claim[1].write(claim[0]+"*"+claim[2]+"\n")
Note that you can't name a fy and fy.write(claim[0]+"*"+claim[2]+"\n") because that will close the file whenever fy is changed. Is that possible in Python? Thanks!

How about something like this:
with open("myBigFile.txt") as f:
subfiles = {}
for line in f:
claim = line.split("*")
if not str(claim[1]) in subfiles:
subfiles[str(claim[1])] = open("DATE-" + str(claim[1]) + ".txt", "a")
subfile[str(claim[1])].write(claim[0]+"*"+claim[2]+"\n")
I believe this should do it.
Just to mention, I have currently not placed a limit on the number of files opened at a given moment. To implement that simply check the size of the list using 'len()' and close all files or a few files.

Here's a solution that will close the file handles as part of a context manager, unlike the other answers this also closes the subfiles when an error happens :-)
from contextlib import contextmanager
#contextmanager
def file_writer():
fp = {}
def write(line):
id, date, company = line.split('*')
outdata = "{}*{}\n".format(id, company)
try:
fp[date].write(outdata)
except KeyError:
fname = 'DATE-{}.txt'.format(date)
fp[date] = open(fname, 'a') # should it be a+?
fp[date].write(outdata)
yield write
for f in fp.values():
f.close()
def process():
with open("myBigFile.txt") as f:
with file_writer() as write:
for i, line in enumerate(f):
try:
write(line)
except:
print('the error happened on line %d [%s]' % (i, line))
I don't know if there is anything more that can be done speed-wise on a single processor/disk. You can always split the file into n chunks and use n processes to process a chunk each (where n is the number of separate disks you have available..)

You can use the csv module to simplify slightly, and use a dictionary to store the file objects:
import csv
with open("myBigFile.txt") as big_file:
reader = csv.reader(big_file, delimiter='*')
subfiles = {}
for id, date, company in reader:
try:
subfile = subfiles[date]
except KeyError:
subfile = open('DATE-{}.txt'.format(date), 'a')
subfiles[date] = subfile
subfile.write('{}*{}\n'.format(id, company))
for subfile in subfiles.values():
subfile.close()

Related

Is there a way to read file in reverse using with open using Python

I'm trying to read a file.out server file but I need to read only latest data in datetime range.
Is it possible to reverse read file using with open() with modes(methods)?
The a+ mode gives access to the end of the file:
``a+'' Open for reading and writing. The file is created if it does not
exist. The stream is positioned at the end of the file. Subsequent writes
to the file will always end up at the then current end of the file,
irrespective of any intervening fseek(3) or similar.
Is there a way to use maybe a+ or other modes(methods) to access the end of the file and read a specific range?
Since regular r mode reads file from beginning
with open('file.out','r') as file:
have tried using reversed()
for line in reversed(list(open('file.out').readlines())):
but it returns no rows for me.
Or there are other ways to reverse read file... help
EDIT
What I got so far:
import os
import time
from datetime import datetime as dt
start_0 = dt.strptime('2019-01-27','%Y-%m-%d')
stop_0 = dt.strptime('2019-01-27','%Y-%m-%d')
start_1 = dt.strptime('09:34:11.057','%H:%M:%S.%f')
stop_1 = dt.strptime('09:59:43.534','%H:%M:%S.%f')
os.system("touch temp_file.txt")
process_start = time.clock()
count = 0
print("reading data...")
for line in reversed(list(open('file.out'))):
try:
th = dt.strptime(line.split()[0],'%Y-%m-%d')
tm = dt.strptime(line.split()[1],'%H:%M:%S.%f')
if (th == start_0) and (th <= stop_0):
if (tm > start_1) and (tm < stop_1):
count += 1
print("%d occurancies" % (count))
os.system("echo '"+line.rstrip()+"' >> temp_file.txt")
if (th == start_0) and (tm < start_1):
break
except KeyboardInterrupt:
print("\nLast line before interrupt:%s" % (str(line)))
break
except IndexError as err:
continue
except ValueError as err:
continue
process_finish = time.clock()
print("Done:" + str(process_finish - process_start) + " seconds.")
I'm adding these limitations so when I find the rows it could atleast print that the occurancies appeared and then just stop reading the file.
The problem is that it's reading, but it's way too slow..
EDIT 2
(2019-04-29 9.34am)
All the answers I received works well for reverse reading logs, but in my (and maybe for other people's) case, when you have n GB size log Rocky's answer below suited me the best.
The code that works for me:
(I only added for loop to Rocky's code):
import collections
log_lines = collections.deque()
for line in open("file.out", "r"):
log_lines.appendleft(line)
if len(log_lines) > number_of_rows:
log_lines.pop()
log_lines = list(log_lines)
for line in log_lines:
print(str(line).split("\n"))
Thanks people, all the answers works.
-lpkej
There's no way to do it with open params but if you want to read the last part of a large file without loading that file into memory, (which is what reversed(list(fp)) will do) you can use a 2 pass solution.
LINES_FROM_END = 1000
with open(FILEPATH, "r") as fin:
s = 0
while fin.readline(): # fixed typo, readlines() will read everything...
s += 1
fin.seek(0)
mylines = []
for i, e in enumerate(fin):
if i >= s - LINES_FROM_END:
mylines.append(e)
This won't keep your file in the memory, you can also reduce this to one pass by using collections.deque
# one pass (a lot faster):
mylines = collections.deque()
for line in open(FILEPATH, "r"):
mylines.appendleft(line)
if len(mylines) > LINES_FROM_END:
mylines.pop()
mylines = list(mylines)
# mylines will contain #LINES_FROM_END count of lines from the end.
Sure there is:
filename = 'data.txt'
for line in reversed(list(open(filename))):
print(line.rstrip())
EDIT:
As mentioned in comments this will read the whole file into memory. This solution should not be used with large files.
Another option is to mmap.mmap the file and then use rfind from the end to search for the newlines and then slice out the lines.
Hey m8 I have made this code it works for me I can read in my file in reversed order. hope it helps :)
I start by creating a new text file, so I don't know how much that is important for you.
def main():
f = open("Textfile.txt", "w+")
for i in range(10):
f.write("line number %d\r\n" % (i+1))
f.close
def readReversed():
for line in reversed(list(open("Textfile.txt"))):
print(line.rstrip())
main()
readReversed()

Delete row from a huge csv file in python

I have a huge(240mb) csv file in which the top 2 rows are junk data.I want to remove this junk data and use the data starting after that.
I would like to know what the best options are .Since its a large file creating a copy of the file and editing it would be a time taking process.
Below is the csv eg:-
junk,,,
,,,,
No,name,place,destination
1,abx,India,SA
What I would like to have is
No,name,place,destination
1,abx,India,SA
You can do this with tail quite easily
tail -n+3 foo > result.data
You said top 3 rows but the example has remove the top 2?
tail -n+2 foo > result.data
You can find more ways here
https://unix.stackexchange.com/questions/37790/how-do-i-delete-the-first-n-lines-of-an-ascii-file-using-shell-commands
Just throw those lines away.
Use Dictreader to parse the header
import csv
with open("filename") as fp:
fp.readline()
fp.readline()
csvreader = csv.DictReader(fp, delimiter=',')
for row in csvreader:
#your code here
Due to the way file systems work, you cannot simply delete the lines from the file directly. Any method to do so will necessarily involve rewriting the entire file with the offending lines removed.
To be safe, before deleting your old file, you'll want store the new file temporarily until you are sure the new one has been successfully created. And if you want to avoid reading the entire large file into memory, you'll want to use a generator.
Here's a generator that returns every item from an iterable (such as a file-like object) after a certain number of items have already been returned:
def gen_after_x(iterable, x):
# Python 3:
yield from (item for index,item in enumerate(iterable) if index>=x)
# Python 2:
for index,item in enumerate(iterable):
if index>=x:
yield item
To make things simpler, we'll create a function to write the temporary file:
def write_file(fname, lines):
with open(fname, 'w') as f:
for line in lines:
f.write(line + '\n')
We will also need the os.remove and os.rename functions from the os module to delete the source file and rename the temp file. And we'll need copyfile from shutil to make a copy, so we can safely delete the source file.
Now to put it all together:
from os import remove, rename
from shutil import copyfile
src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2
with open(src_file) as fin:
olines = gen_after_x(fin, skip)
write_file(tmp_file, olines)
src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)
try:
remove(src_file)
rename(tmp_file, src_file)
remove(src_file_copy)
except Exception:
try:
copyfile(src_file_copy, src_file)
remove(src_file_copy)
remove(tmp_file)
except Exception:
pass
raise
However, I would note that 240 MB isn't such a huge file these days; you may find it faster to do this the usual way since it cuts down on repetitive disk writes:
src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2
with open(src_file) as f:
lines = f.readlines()
for _ in range(skip):
lines.pop(0)
with open(tmp_file, 'w') as f:
f.write('\n'.join(lines))
src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)
try:
remove(src_file)
rename(tmp_file, src_file)
remove(src_file_copy)
except Exception:
try:
copyfile(src_file_copy, src_file)
remove(src_file_copy)
remove(tmp_file)
except Exception:
pass
raise
...or if you prefer the more risky way:
with open(src_file) as f:
lines = f.readlines()
for _ in range(skip):
lines.pop(0)
with open(src_file, 'w') as f:
f.write('\n'.join(lines))

Python - Script that appends rows; checks for duplicates before writing

I'm writing a script that has a for loop to extract a list of variables from each 'data_i.csv' file in a folder, then appends that list as a new row in a single 'output.csv' file.
My objective is to define the headers of the file once and then append data to the 'output.csv' container-file so it will function as a backlog for a standard measurement.
The first time I run the script it will add all the files in the folder. Next time I run it, I want it to only append files that have been added since. I thought one way of doing this would be to check for duplicates, but the codes I found for that so far only searched for consecutive duplicates.
Do you have suggestions?
Here's how I made it so far:
import csv, os
# Find csv files
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('.csv'):
continue
# Read in csv file and choose certain cells
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
cellID = csvLines[4][3]
# Read in several variables...
csvRows = [cellID]
csvFileObj.close()
resultFile = open("Output.csv", 'a') #open in 'append' modus
wr = csv.writer(resultFile)
wr.writerows([csvRows])
csvFileObj.close()
resultFile.close()
This is the final script after mgc's answer:
import csv, os
f = open('Output.csv', 'r+')
merged_files = csv.reader(f)
merged_files = list()
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('_spm.txt'):
continue
if csvFilename in merged_files:
continue
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
waferID = csvLines[4][3]
temperature = csvLines[21][2]
csvRows = [waferID,thickness]
merged_files.append(csvRows)
csvFileObj.close()
wr = csv.writer(f)
wr.writerows(merged_files)
f.close()
You can keep track of the name of each file already handled. If this log file don't need to be human readable, you can use pickle. At the start of your script, you can do :
import pickle
try:
with open('merged_log', 'rb') as f:
merged_files = pickle.load(f)
except FileNotFoundError:
merged_files = set()
Then you can add a condition to avoid files previously treated :
if filename in merged_files: continue
Then when you are processing a file you can do :
merged_files.add(filename)
And keep trace of your variable at the end of your script (so it will be used on a next use) :
with open('merged_log', 'wb') as f:
pickle.dump(merged_files, f)
(However there is other options to your problem, for example you can slightly change the name of your file once it has been processed, like changing the extension from .csv to .csv_ or moving processed files in a subfolder, etc.)
Also, in the example in your question, i don't think that you need to open (and close) your output file on each iteration of your for loop. Open it once before your loop, write what you have to write, then close it when you have leaved the loop.

How can I split a large file csv file (7GB) in Python

I have a 7GB csv file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?
You don't need Python to split a csv file. Using your shell:
$ split -l 100 data.csv
Would split data.csv in chunks of 100 lines.
I had to do a similar task, and used the pandas package:
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
chunk.to_csv('chunk{}.csv'.format(i), index=False)
Here is a little python script I used to split a file data.csv into several CSV part files. The number of part files can be controlled with chunk_size (number of lines per part file).
The header line (column names) of the original file is copied into every part CSV file.
It works for big files because it reads one line at a time with readline() instead of loading the complete file into memory at once.
#!/usr/bin/env python3
def main():
chunk_size = 9998 # lines
def write_chunk(part, lines):
with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
f_out.write(header)
f_out.writelines(lines)
with open('data.csv', 'r') as f:
count = 0
header = f.readline()
lines = []
for line in f:
count += 1
lines.append(line)
if count % chunk_size == 0:
write_chunk(count // chunk_size, lines)
lines = []
# write remainder
if len(lines) > 0:
write_chunk((count // chunk_size) + 1, lines)
if __name__ == '__main__':
main()
This graph shows the runtime difference of the different approaches outlined by other posters (on an 8 core machine when splitting a 2.9 GB file with 11.8 million rows of data into ~290 files).
The shell approach is from Thomas Orozco, Python approach s from Roberto, Pandas approach is from Quentin Febvre and here's the Dask snippet:
ddf = dd.read_csv("../nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv", blocksize=10000000, dtype=dtypes)
ddf.to_csv("../tmp/split_csv_dask")
I'd recommend Dask for splitting files, even though it's not the fastest, because it's the most flexible solution (you can write out different file formats, perform processing operations before writing, easily modify compression formats, etc.). The Pandas approach is almost as flexible, but cannot perform processing on the entire dataset (like sorting the entire dataset before writing).
Bash / native Python filesystem operations are clearly quicker, but that's not what I'm typically looking for when I have a large CSV. I'm typically interested in splitting large CSVs into smaller Parquet files, for performant, production data analyses. I don't usually care if the actually splitting takes a couple minutes more. I'm more interested in splitting accurately.
I wrote a blog post that discusses this in more detail. You can probably Google around and find the post.
See the Python docs on file objects (the object returned by open(filename) - you can choose to read a specified number of bytes, or use readline to work through one line at a time.
Maybe something like this?
#!/usr/local/cpython-3.3/bin/python
import csv
divisor = 10
outfileno = 1
outfile = None
with open('big.csv', 'r') as infile:
for index, row in enumerate(csv.reader(infile)):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(row)
I agree with #jonrsharpe readline should be able to read one line at a time even for big files.
If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.
Hope it helps.
here is my code which might help
import os
import pandas as pd
import uuid
class FileSettings(object):
def __init__(self, file_name, row_size=100):
self.file_name = file_name
self.row_size = row_size
class FileSplitter(object):
def __init__(self, file_settings):
self.file_settings = file_settings
if type(self.file_settings).__name__ != "FileSettings":
raise Exception("Please pass correct instance ")
self.df = pd.read_csv(self.file_settings.file_name,
chunksize=self.file_settings.row_size)
def run(self, directory="temp"):
try:os.makedirs(directory)
except Exception as e:pass
counter = 0
while True:
try:
file_name = "{}/{}_{}_row_{}_{}.csv".format(
directory, self.file_settings.file_name.split(".")[0], counter, self.file_settings.row_size, uuid.uuid4().__str__()
)
df = next(self.df).to_csv(file_name)
counter = counter + 1
except StopIteration:
break
except Exception as e:
print("Error:",e)
break
return True
def main():
helper = FileSplitter(FileSettings(
file_name='sample1.csv',
row_size=10
))
helper.run()
main()
In the case of wanting to split by rough boundaries in bytes, the newest datapoints being the bottom-most ones and wanting to put the newest datapoints in the first file:
from pathlib import Path
TEN_MB = 10000000
FIVE_MB = 5000000
def split_file_into_chunks(path, chunk_size=TEN_MB):
path = str(path)
output_prefix = path.rpartition('.')[0]
output_ext = path.rpartition('.')[-1]
with open(path, 'rb') as f:
seek_positions = []
for x, line in enumerate(f):
if not x:
header = line
seek_positions.append(f.tell())
part = 0
last_seek_pos = seek_positions[-1]
for seek_pos in reversed(seek_positions):
if last_seek_pos-seek_pos >= chunk_size:
with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
f.seek(seek_pos)
f_out.write(header)
f_out.write(f.read(last_seek_pos-seek_pos))
last_seek_pos = seek_pos
part += 1
with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
f.seek(0)
f_out.write(f.read(last_seek_pos))
Path(path).rename(path+'~')
Path(f'{output_prefix}.arch.0.{output_ext}').rename(path)
Path(path+'~').unlink()

fast method in Python to split a large text file using number of lines as input variable

I am splitting a text file using the number of lines as variable. I wrote this function in order to save in a temporary directory the spitted files. Each file has 4 millions of lines expect the last file.
import tempfile
from itertools import groupby, count
temp_dir = tempfile.mkdtemp()
def tempfile_split(filename, temp_dir, chunk=4000000):
with open(filename, 'r') as datafile:
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
the main problem is the speed of this function. In order to split one file of 8 million of lines in two files of 4 millions of line the time is than more of 30 min of my windows OS and Python 2.7
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
is opening the file, and writing one line, for each line in group.
This is slow.
Instead, write once per group.
with open(output_name, 'a') as outfile:
outfile.write(''.join(group))
Just did a quick test with an 8million line file(uptime lines) to run the length of the file and split the file in half. Basically, one pass to get the line count, second pass to do the split write.
On my system, the time it took to perform the first pass run was about 2-3 seconds. To complete the run and the write of the split file(s), total time took was under 21 seconds.
Did not implement the lamba functions in the OP's post. Code used below:
#!/usr/bin/env python
import sys
import math
infile = open("input","r")
linecount=0
for line in infile:
linecount=linecount+1
splitpoint=linecount/2
infile.close()
infile = open("input","r")
outfile1 = open("output1","w")
outfile2 = open("output2","w")
print linecount , splitpoint
linecount=0
for line in infile:
linecount=linecount+1
if ( linecount <= splitpoint ):
outfile1.write(line)
else:
outfile2.write(line)
infile.close()
outfile1.close()
outfile2.close()
No, it's not going to win any performance or code elegance tests. :) But short of something else being a performance bottleneck, the lambda functions causing the file to be cached in memory and forcing a swap issue, or that the lines in the file are extremely long, I don't see why it would take 30 minutes to read/split the 8million line file.
EDIT:
My environment: Mac OS X, storage was a single FW800 connected hard drive. File was created fresh to avoid filesystem caching benefits.
You can use tempfile.NamedTemporaryFile directly in the context manager:
import tempfile
import time
from itertools import groupby, count
def tempfile_split(filename, temp_dir, chunk=4*10**6):
fns={}
with open(filename, 'r') as datafile:
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
with tempfile.NamedTemporaryFile(delete=False,
dir=temp_dir,prefix='{}_'.format(str(k))) as outfile:
outfile.write(''.join(group))
fns[k]=outfile.name
return fns
def make_test(size=8*10**6+1000):
with tempfile.NamedTemporaryFile(delete=False) as fn:
for i in xrange(size):
fn.write('Line {}\n'.format(i))
return fn.name
fn=make_test()
t0=time.time()
print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0
On my computer, the tempfile_split part runs in 3.6 seconds. It is OS X.
If you're in a linux or unix environment you could cheat a little and use the split command from inside python. Does the trick for me, and very fast too:
def split_file(file_path, chunk=4000):
p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path,
os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
p.communicate()
# Remove the original file if required
try:
os.remove(file_path)
except OSError:
pass
return True

Categories