Dynamically splitting a file into multiple smaller ones

Dynamically splitting a file into multiple smaller ones - python

I'm trying to split up a very large text file into multiple smaller ones. When I run the code below, the first created file is correct. Everything after that just contains the 'INSERT INTO ...' string and nothing else. Thanks in advance
import math
interval = 100000
with open('my-big-file','r') as c:
for i, l in enumerate(c):
pass
length = i + 1
numOfFiles = int(math.ceil(length / interval))
with open('my-big-file','r') as c:
for j in range(0, numOfFiles):
with open('my-smaller-file_{}.sql'.format(j),'w') as n:
print >> n, 'INSERT INTO codes (code, some-field, some-other-field) VALUES'
for i, line in enumerate(c):
if i >= j * interval and i < (j + 1) * interval:
line = line.rstrip()
if not line: continue
print >> n, '(%s,'something','something else'),' % (line)
else:
break

You don't need to count the number of lines before iterating the file, you can directly write to a new file whenever you reach the number of given lines:
#!/usr/bin/env python
def split(fn, num=1000, suffix="_%03d"):
import os
full, ext = os.path.splitext(fn)
with open(fn, 'r') as f:
for i, l in enumerate(f):
if i%num == 0:
try:
out.close()
except UnboundLocalError:
pass
out = open(full+suffix%(i/num)+ext, 'w')
out.write(l)
else:
out.close()
if __name__ == '__main__':
import sys
split(sys.argv[1])
You can run this from the command line. Though probably the split command is more useful, since it supports a multitude of options.
It's also possible to rewrite this code to also use with for the file(s) being written to, but that's another topic.

Related

Extract text in string until certain new line ("\n") [duplicate]

We have a large raw data file that we would like to trim to a specified size.
How would I go about getting the first N lines of a text file in python? Will the OS being used have any effect on the implementation?

Python 3:
with open(path_to_file) as input_file:
head = [next(input_file) for _ in range(lines_number)]
print(head)
Python 2:
with open(path_to_file) as input_file:
head = [next(input_file) for _ in xrange(lines_number)]
print head
Here's another way (both Python 2 & 3):
from itertools import islice
with open(path_to_file) as input_file:
head = list(islice(path_to_file, lines_number))
print(head)

N = 10
with open("file.txt", "a") as file: # the a opens it in append mode
for i in range(N):
line = next(file).strip()
print(line)

If you want to read the first lines quickly and you don't care about performance you can use .readlines() which returns list object and then slice the list.
E.g. for the first 5 lines:
with open("pathofmyfileandfileandname") as myfile:
firstNlines=myfile.readlines()[0:5] #put here the interval you want
Note: the whole file is read so is not the best from the performance point of view but it
is easy to use, fast to write and easy to remember so if you want just perform
some one-time calculation is very convenient
print firstNlines
One advantage compared to the other answers is the possibility to select easily the range of lines e.g. skipping the first 10 lines [10:30] or the lasts 10 [:-10] or taking only even lines [::2].

What I do is to call the N lines using pandas. I think the performance is not the best, but for example if N=1000:
import pandas as pd
yourfile = pd.read_csv('path/to/your/file.csv',nrows=1000)

There is no specific method to read number of lines exposed by file object.
I guess the easiest way would be following:
lines =[]
with open(file_name) as f:
lines.extend(f.readline() for i in xrange(N))

The two most intuitive ways of doing this would be:
Iterate on the file line-by-line, and break after N lines.
Iterate on the file line-by-line using the next() method N times. (This is essentially just a different syntax for what the top answer does.)
Here is the code:
# Method 1:
with open("fileName", "r") as f:
counter = 0
for line in f:
print line
counter += 1
if counter == N: break
# Method 2:
with open("fileName", "r") as f:
for i in xrange(N):
line = f.next()
print line
The bottom line is, as long as you don't use readlines() or enumerateing the whole file into memory, you have plenty of options.

Based on gnibbler top voted answer (Nov 20 '09 at 0:27): this class add head() and tail() method to file object.
class File(file):
def head(self, lines_2find=1):
self.seek(0) #Rewind file
return [self.next() for x in xrange(lines_2find)]
def tail(self, lines_2find=1):
self.seek(0, 2) #go to end of file
bytes_in_file = self.tell()
lines_found, total_bytes_scanned = 0, 0
while (lines_2find+1 > lines_found and
bytes_in_file > total_bytes_scanned):
byte_block = min(1024, bytes_in_file-total_bytes_scanned)
self.seek(-(byte_block+total_bytes_scanned), 2)
total_bytes_scanned += byte_block
lines_found += self.read(1024).count('\n')
self.seek(-total_bytes_scanned, 2)
line_list = list(self.readlines())
return line_list[-lines_2find:]
Usage:
f = File('path/to/file', 'r')
f.head(3)
f.tail(3)

most convinient way on my own:
LINE_COUNT = 3
print [s for (i, s) in enumerate(open('test.txt')) if i < LINE_COUNT]
Solution based on List Comprehension
The function open() supports an iteration interface. The enumerate() covers open() and return tuples (index, item), then we check that we're inside an accepted range (if i < LINE_COUNT) and then simply print the result.
Enjoy the Python. ;)

For first 5 lines, simply do:
N=5
with open("data_file", "r") as file:
for i in range(N):
print file.next()

If you want something that obviously (without looking up esoteric stuff in manuals) works without imports and try/except and works on a fair range of Python 2.x versions (2.2 to 2.6):
def headn(file_name, n):
"""Like *x head -N command"""
result = []
nlines = 0
assert n >= 1
for line in open(file_name):
result.append(line)
nlines += 1
if nlines >= n:
break
return result
if __name__ == "__main__":
import sys
rval = headn(sys.argv[1], int(sys.argv[2]))
print rval
print len(rval)

If you have a really big file, and assuming you want the output to be a numpy array, using np.genfromtxt will freeze your computer. This is so much better in my experience:
def load_big_file(fname,maxrows):
'''only works for well-formed text file of space-separated doubles'''
rows = [] # unknown number of lines, so use list
with open(fname) as f:
j=0
for line in f:
if j==maxrows:
break
else:
line = [float(s) for s in line.split()]
rows.append(np.array(line, dtype = np.double))
j+=1
return np.vstack(rows) # convert list of vectors to array

This worked for me
f = open("history_export.csv", "r")
line= 5
for x in range(line):
a = f.readline()
print(a)

I would like to handle the file with less than n-lines by reading the whole file
def head(filename: str, n: int):
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
Credit go to John La Rooy and Ilian Iliev. Use the function for the best performance with exception handle
Revise 1: Thanks FrankM for the feedback, to handle file existence and read permission we can futher add
import errno
import os
def head(filename: str, n: int):
if not os.path.isfile(filename):
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), filename)
if not os.access(filename, os.R_OK):
raise PermissionError(errno.EACCES, os.strerror(errno.EACCES), filename)
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
You can either go with second version or go with the first one and handle the file exception later. The check is quick and mostly free from performance standpoint

Starting at Python 2.6, you can take advantage of more sophisticated functions in the IO base clase. So the top rated answer above can be rewritten as:
with open("datafile") as myfile:
head = myfile.readlines(N)
print head
(You don't have to worry about your file having less than N lines since no StopIteration exception is thrown.)

This works for Python 2 & 3:
from itertools import islice
with open('/tmp/filename.txt') as inf:
for line in islice(inf, N, N+M):
print(line)

fname = input("Enter file name: ")
num_lines = 0
with open(fname, 'r') as f: #lines count
for line in f:
num_lines += 1
num_lines_input = int (input("Enter line numbers: "))
if num_lines_input <= num_lines:
f = open(fname, "r")
for x in range(num_lines_input):
a = f.readline()
print(a)
else:
f = open(fname, "r")
for x in range(num_lines_input):
a = f.readline()
print(a)
print("Don't have", num_lines_input, " lines print as much as you can")
print("Total lines in the text",num_lines)

Here's another decent solution with a list comprehension:
file = open('file.txt', 'r')
lines = [next(file) for x in range(3)] # first 3 lines will be in this list
file.close()

An easy way to get first 10 lines:
with open('fileName.txt', mode = 'r') as file:
list = [line.rstrip('\n') for line in file][:10]
print(list)

#!/usr/bin/python
import subprocess
p = subprocess.Popen(["tail", "-n 3", "passlist"], stdout=subprocess.PIPE)
output, err = p.communicate()
print output
This Method Worked for me

Simply Convert your CSV file object to a list using list(file_data)
import csv;
with open('your_csv_file.csv') as file_obj:
file_data = csv.reader(file_obj);
file_list = list(file_data)
for row in file_list[:4]:
print(row)

Python program to read multiple files at a time

I have a 1000 files in a folder named md_1.mdp, md_2.mdp, ..., md_1000.mdp and the 186th line of every file reads:
gen_seed = 35086
This value is different in every file and it is what I want to extract and print as the output.
I have written the following code but it is not displaying any output.
import numpy as np
idx = np.arange(1,1000)
for i in idx:
f = open('/home/abc/xyz/mdp_200/md_'+str(i)+'.mdp','r')
l = f.readlines()
l = l[185].split(" ")
flag = 0
for k in l:
if flag==1:
if k!='':
print(k)
flag=0
if k=="t=":
flag=1
f.close()
What should I add to this program so that it prints the required value for each file one by one in the order of md_1.mdp, md_2.mdp and so on?

you can use:
for i in range(1, 1001):
with open('/home/abc/xyz/mdp_200/md_'+ str(i)+ '.mdp') as fp:
l = fp.readlines()
print(l[185].split('=')[-1].strip())
or you can use linecache.getline:
import linecache
for i in range(1, 1001):
file = f'/home/abc/xyz/mdp_200/md_{i}.mdp'
line = linecache.getline(file, 185)
print(line.split('=')[-1].strip())
after you get your line the split is done by = character

I need to format a write statement in python to format it to a text file. I also am not sure how to keep the updated list when writing the file

I know it's not completely finished, but I'm very confused as how to format the save inventory function so it prints like the original text file. During the add_item function, it shows that the item has been added to the lists. But when going to write nothing is there or updated.
Example of how the text file needs to look
def save_inventory(inventoryFile, descriptionArray, quantityArray, priceArray, intrecords):
outfile = open(inventoryFile, "w")
with open('inventory1.txt', 'r') as f:
count = -1
for line in f:
count+=1
if count % 3 == 0: #this is the remainder operator
outfile.write(descriptionArray)
print(descriptionArray)
with open('inventory1.txt', 'r') as f:
count = -2
for line in f:
count+=1
if count % 3 == 0: #this is the remainder operator
outfile.write(str(quantityArray))
print(quantityArray)
with open('inventory1.txt', 'r') as f:
count = -3
for line in f:
count+=1
if count % 3 == 0: #this is the remainder operator
outfile.write(str(priceArray))
print(priceArray)
outfile.close()

You are only writing to the file when you have read a line. If your text file is empty you will never write to the file.
What I would do is zip the lists together and loop through them. Then write three lines to the file for each pass through the loop. You can print a carriage return with '\n'
with open(inventoryFile, 'w') as f:
for d, q, p in zip(descriptionArray, quantityArray, priceArray):
f.write('%s\n%s\n%s\n' % (d, q, p))

Improving a python code reading files

I wrote a python script to treat text files.
The input is a file with several lines. At the beginning of each line, there is a number (1, 2, 3... , n). Then an empty line and the last line on which some text is written.
I need to read through this file to delete some lines at the beginning and some in the end (say number 1 to 5 and then number 78 to end). I want to write the remaining lines on a new file (in a new directory) and renumber the first numbers written on these lines (in my example, 6 would become 1, 7 2 etc.)
I wrote the following:
def treatFiles(oldFile,newFile,firstF, startF, lastF):
% firstF is simply an index
% startF corresponds to the first line I want to keep
% lastF corresponds to the last line I want to keep
numberFToDeleteBeginning = int(startF) - int(firstF)
with open(oldFile) as old, open(newFile, 'w') as new:
countLine = 0
for line in old:
countLine += 1
if countLine <= numberFToDeleteBeginning:
pass
elif countLine > int(lastF) - int(firstF):
pass
elif line.split(',')[0] == '\n':
newLineList = line.split(',')
new.write(line)
else:
newLineList = [str(countLine - numberFToDeleteBeginning)] + line.split(',')
del newLineList[1]
newLine = str(newLineList[0])
for k in range(1, len(newLineList)):
newLine = newLine + ',' + str(newLineList[k])
new.write(newLine)
if __name__ == '__main__':
from sys import argv
import os
os.makedirs('treatedFiles')
new = 'treatedFiles/' + argv[1]
treatFiles(argv[1], argv[2], newFile, argv[3], argv[4], argv[5])
My code works correctly but is far too slow (I have files of about 10Gb to treat and it's been running for hours).
Does anyone know how I can improve it?

I would get rid of the for loop in the middle and the expensive .split():
from itertools import islice
def treatFiles(old_file, new_file, index, start, end):
with open(old_file, 'r') as old, open(new_file, 'w') as new:
sliced_file = islice(old, start - index, end - index)
for line_number, line in enumerate(sliced_file, start=1):
number, rest = line.split(',', 1)
if number == '\n':
new.write(line)
else:
new.write(str(line_number) + ',' + rest)
Also, convert your three numerical arguments to integers before passing them into the function:
treatFiles(argv[1], argv[2], newFile, int(argv[3]), int(argv[4]), int(argv[5]))

How to read first N lines of a file?

We have a large raw data file that we would like to trim to a specified size.
How would I go about getting the first N lines of a text file in python? Will the OS being used have any effect on the implementation?

Python 3:
with open(path_to_file) as input_file:
head = [next(input_file) for _ in range(lines_number)]
print(head)
Python 2:
with open(path_to_file) as input_file:
head = [next(input_file) for _ in xrange(lines_number)]
print head
Here's another way (both Python 2 & 3):
from itertools import islice
with open(path_to_file) as input_file:
head = list(islice(path_to_file, lines_number))
print(head)

N = 10
with open("file.txt", "a") as file: # the a opens it in append mode
for i in range(N):
line = next(file).strip()
print(line)

If you want to read the first lines quickly and you don't care about performance you can use .readlines() which returns list object and then slice the list.
E.g. for the first 5 lines:
with open("pathofmyfileandfileandname") as myfile:
firstNlines=myfile.readlines()[0:5] #put here the interval you want
Note: the whole file is read so is not the best from the performance point of view but it
is easy to use, fast to write and easy to remember so if you want just perform
some one-time calculation is very convenient
print firstNlines
One advantage compared to the other answers is the possibility to select easily the range of lines e.g. skipping the first 10 lines [10:30] or the lasts 10 [:-10] or taking only even lines [::2].

What I do is to call the N lines using pandas. I think the performance is not the best, but for example if N=1000:
import pandas as pd
yourfile = pd.read_csv('path/to/your/file.csv',nrows=1000)

There is no specific method to read number of lines exposed by file object.
I guess the easiest way would be following:
lines =[]
with open(file_name) as f:
lines.extend(f.readline() for i in xrange(N))

The two most intuitive ways of doing this would be:
Iterate on the file line-by-line, and break after N lines.
Iterate on the file line-by-line using the next() method N times. (This is essentially just a different syntax for what the top answer does.)
Here is the code:
# Method 1:
with open("fileName", "r") as f:
counter = 0
for line in f:
print line
counter += 1
if counter == N: break
# Method 2:
with open("fileName", "r") as f:
for i in xrange(N):
line = f.next()
print line
The bottom line is, as long as you don't use readlines() or enumerateing the whole file into memory, you have plenty of options.

Based on gnibbler top voted answer (Nov 20 '09 at 0:27): this class add head() and tail() method to file object.
class File(file):
def head(self, lines_2find=1):
self.seek(0) #Rewind file
return [self.next() for x in xrange(lines_2find)]
def tail(self, lines_2find=1):
self.seek(0, 2) #go to end of file
bytes_in_file = self.tell()
lines_found, total_bytes_scanned = 0, 0
while (lines_2find+1 > lines_found and
bytes_in_file > total_bytes_scanned):
byte_block = min(1024, bytes_in_file-total_bytes_scanned)
self.seek(-(byte_block+total_bytes_scanned), 2)
total_bytes_scanned += byte_block
lines_found += self.read(1024).count('\n')
self.seek(-total_bytes_scanned, 2)
line_list = list(self.readlines())
return line_list[-lines_2find:]
Usage:
f = File('path/to/file', 'r')
f.head(3)
f.tail(3)

most convinient way on my own:
LINE_COUNT = 3
print [s for (i, s) in enumerate(open('test.txt')) if i < LINE_COUNT]
Solution based on List Comprehension
The function open() supports an iteration interface. The enumerate() covers open() and return tuples (index, item), then we check that we're inside an accepted range (if i < LINE_COUNT) and then simply print the result.
Enjoy the Python. ;)

For first 5 lines, simply do:
N=5
with open("data_file", "r") as file:
for i in range(N):
print file.next()

If you want something that obviously (without looking up esoteric stuff in manuals) works without imports and try/except and works on a fair range of Python 2.x versions (2.2 to 2.6):
def headn(file_name, n):
"""Like *x head -N command"""
result = []
nlines = 0
assert n >= 1
for line in open(file_name):
result.append(line)
nlines += 1
if nlines >= n:
break
return result
if __name__ == "__main__":
import sys
rval = headn(sys.argv[1], int(sys.argv[2]))
print rval
print len(rval)

If you have a really big file, and assuming you want the output to be a numpy array, using np.genfromtxt will freeze your computer. This is so much better in my experience:
def load_big_file(fname,maxrows):
'''only works for well-formed text file of space-separated doubles'''
rows = [] # unknown number of lines, so use list
with open(fname) as f:
j=0
for line in f:
if j==maxrows:
break
else:
line = [float(s) for s in line.split()]
rows.append(np.array(line, dtype = np.double))
j+=1
return np.vstack(rows) # convert list of vectors to array

This worked for me
f = open("history_export.csv", "r")
line= 5
for x in range(line):
a = f.readline()
print(a)

I would like to handle the file with less than n-lines by reading the whole file
def head(filename: str, n: int):
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
Credit go to John La Rooy and Ilian Iliev. Use the function for the best performance with exception handle
Revise 1: Thanks FrankM for the feedback, to handle file existence and read permission we can futher add
import errno
import os
def head(filename: str, n: int):
if not os.path.isfile(filename):
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), filename)
if not os.access(filename, os.R_OK):
raise PermissionError(errno.EACCES, os.strerror(errno.EACCES), filename)
try:
with open(filename) as f:
head_lines = [next(f).rstrip() for x in range(n)]
except StopIteration:
with open(filename) as f:
head_lines = f.read().splitlines()
return head_lines
You can either go with second version or go with the first one and handle the file exception later. The check is quick and mostly free from performance standpoint

Starting at Python 2.6, you can take advantage of more sophisticated functions in the IO base clase. So the top rated answer above can be rewritten as:
with open("datafile") as myfile:
head = myfile.readlines(N)
print head
(You don't have to worry about your file having less than N lines since no StopIteration exception is thrown.)

This works for Python 2 & 3:
from itertools import islice
with open('/tmp/filename.txt') as inf:
for line in islice(inf, N, N+M):
print(line)

fname = input("Enter file name: ")
num_lines = 0
with open(fname, 'r') as f: #lines count
for line in f:
num_lines += 1
num_lines_input = int (input("Enter line numbers: "))
if num_lines_input <= num_lines:
f = open(fname, "r")
for x in range(num_lines_input):
a = f.readline()
print(a)
else:
f = open(fname, "r")
for x in range(num_lines_input):
a = f.readline()
print(a)
print("Don't have", num_lines_input, " lines print as much as you can")
print("Total lines in the text",num_lines)

Here's another decent solution with a list comprehension:
file = open('file.txt', 'r')
lines = [next(file) for x in range(3)] # first 3 lines will be in this list
file.close()

An easy way to get first 10 lines:
with open('fileName.txt', mode = 'r') as file:
list = [line.rstrip('\n') for line in file][:10]
print(list)

#!/usr/bin/python
import subprocess
p = subprocess.Popen(["tail", "-n 3", "passlist"], stdout=subprocess.PIPE)
output, err = p.communicate()
print output
This Method Worked for me

Simply Convert your CSV file object to a list using list(file_data)
import csv;
with open('your_csv_file.csv') as file_obj:
file_data = csv.reader(file_obj);
file_list = list(file_data)
for row in file_list[:4]:
print(row)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dynamically splitting a file into multiple smaller ones - python

Related

Extract text in string until certain new line ("\n") [duplicate]

Python program to read multiple files at a time

I need to format a write statement in python to format it to a text file. I also am not sure how to keep the updated list when writing the file

Improving a python code reading files

How to read first N lines of a file?

Categories

Resources