split the file in many other files using python

split the file in many other files using python - python

I have a file which I want to split it into many other parts. i want to use python code...
Eg: the data in my file is like this
>2165320 21411 200802 8894-,...,765644-
TTCGGAGCTTACTAATTTTAAATATGAAGAATGCCAATATAAGTTTTGATTTCGAAAATACTTTTTTACTAGTTAAAAATTCATGATTTTCTACATCTATAACAATTTGTGTTTTTTTTAAACATCTTCCAGTGTCCTAAGTGTATATTTTTTAACGCAATGTTTGAATACTTTTAGGGTTTACCTTATTTAATTTGATTTTTAATGTGAGTTGTAATCACTGGTGAGCATACTGTTTTTCTTTTGTTCAGTAATATTGCATTTGTAGCTTTTGTATTGCTTAGATATATCACATTAAATCCTTTGTTCAGAAACCCATCCGACAGGGAGTCATAGGTGCCACACTAGTGGTCGAGGATCTAGGATGTCGGAAGGTCAACAATGGGGTAAAACACTAATTTTTTAATTTCTTGTATTTACCAAATTTACTGATTTTGCATTTAGTAGATGGTATATATACTCTTCTACCTTGTACAGTTGATGGTACCTGACTAAATATGTTTTATTTCCTTCTCCAGGATCTTTATGTAGTACGATTCTACAGTCGTCAAGAGGAGGGTAGAAAAGGAGAAGTAAGTTATAATATTTCTGAGCTTTTTTCTTTTTAATTGTTGTTGATAGAAAGTTGTGCCATATACATGTTTTAAGGTGGTGTA
>2165799 14641 135356 16580+,...,680341-
AAGGTAGGAGGTACTCGTGCTAATGGAGGAGCTAATGGTACACCAAACCGACGGCTGTCACTTAATGCTCATCAAAACGGAAGCAGGTCCACAACAAAAGATGGAAAAAAAGACATCAGACCAGTTGCTCCTGTGAATTATGTGGCCATATCAAAAGAAGATGCTGCTTCCCATGTTTCTGGTACCGAACCAATCCCGGCATCACCCTAATAATGAGATCTTCATTATCAACCCTACAATTTCATCTTTGTAGCATGATCAAATACTAGTTACTGCTTTAGGAATTATAATATGGAGTGACAAGTAATTAGAGAGGAACTGTTTTGAGCTGTGTATGTTCAATTTGCCATTTGGAGGTTTTCTCAATACATGTGCCCTTTAATATGAAAATATAGTGCTATTCTTGCCTTTCTCCAAACCCTGGCTCCTCCTATTCATCGGTTTCTT
>2169677 23891 1928391 1298391,…..,739483-
CTAGCTGATCGAGCTGATCGTAGTGAGCTATCGAGCTGACTACTAGCTAGTCGTGATAGCTGATCGAGCTGACTGATGTGCTAGTAGTAGTTTCATGATTTTCTACATCTATAACAATTTGTGTTTTTTTTAAACATCTTCCAGTGTCCTAAGTGTATATTTTTTAACGCAATGTTTGAATACTTTTAGGGTTTACCTTATTTAATTTGATTTTTAATGTGAGTTGTAATCACTGGTGAGCATACTGTTTTTCTTTTGTTCAGTAATATTGCATTTGTAGCTTTTGTATTGCTTAGATATATCACATTAAATCCTTTGTTCAGAAACCCATCCGACAGGGAGTCATAGGTGCCACACTAGTGGTCGAGGATCTAGGATGTCGGAAGGTCAACAATGGGGTAAAACACTAATTTTTTAATTTCTTGTATTTACCAAATTTACTGATTTTGCATTTAGTAGATGGTATATATACTCTTCTACCTTGTACAGTTGATGGTACCTGACTAAATATGTTTTATTTCCTTCTCCAGGATCTTTATGTAGTACGATTCTACAGTCGTCAAGAGGAGGGTAGAAAAGGAGAAGTAAGTTATAATATTTCTGAGCTTTTTTCTTTTTAATTGTTGTTGATAGAAAGTTGTGCCATATACATGTTTTA
And so on.
So now I want to split the file from ’>’ sing to next one n store this in a separate file.
Like 1st file will have
>2165320 21411 200802 8894-,...,765644-
TTCG…..GTA
data.
2nd file will have
>2165799 14641 135356 16580+,...,680341-
AAGG….GTTTCTT
data and so on.

To write each blank-line-separated group of lines to a separate file you could use itertools.groupby():
#!/usr/bin/env python
import sys
from itertools import groupby
def blank(line, mark=[0]):
if not line.strip(): # blank line
mark[0] ^= 1 # mark the start of new group
return mark[0]
for i, (_, group) in enumerate(groupby(sys.stdin, blank), start=1):
with open("group-%03d.txt" % (i,), "w") as outfile:
outfile.writelines(group)
Usage:
$ python split-on-blank.py < input_file.txt
If you work with such formats often; consider using a proper parser such as provided by Bio.SeqIO.parse() function from biopython.

It seems your data is just newline separated, so all you would need to do is loop over the lines and write the non-blank ones to incrementing files:
with open("source.txt") as f:
counter = 1
for line in f:
if not line.strip():
continue
with open("out_%03d.txt" % counter, 'w') as out:
out.write(line)
counter += 1
This will assume that each group is really one long line (it wasn't clear to me the real format).
Because you haven't given us much explanation about the real format of this file, here is another option in case those really are newline characters between lines that should be in the same file. If "#" is a solid indicator of a new group, we can just use it to indicate a new file:
with open("source.txt") as f:
counter = 1
out = None
for line in f:
if line.lstrip().startswith("#"):
if out is not None:
out.close()
out_name = "out_%03d.txt" % counter
counter += 1
out = open(out_name, 'w')
out.write(line)
if out is not None:
out.close()

with open("source.txt") as f:
counter = 1
for line in f:
if counter % 3 == 0：
continue
with open("out_%03d.txt" % counter, 'w') as out:
out.write(line)
counter += 1

Related

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.

import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.

Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

How can I delete multipe lines in a text by Python?

I have a similar question to delete multiple line
I want to delete the line and the next 4 lines. This is my code:
bind = open('/etc/bind/named.conf.local','r')
a = dict['name']
for line in bind:
if a in line:
print('line exist')
''' and delete this line and 4 line after it'''
else:
print('line does not exist')
I want to save modify text in /etc/bind/named.conf.local in place, without fileinput. I do not want skip 4 line I want to delete them from the file. I do not want to read it and write it again and skip 4 lines.
What should I do?

I think the following code does what you're looking for. You will have to adjust settings filename, keyword and delete to your needs. The code will delete delete lines from file filename every time keyword is found in a line. (Including the keyword line.)
# Settings
filename = "test.txt"
keyword = "def"
delete = 2
# Read lines from file
with open(filename) as f:
lines = f.readlines()
# Process lines
for i, line in enumerate(lines):
if keyword in line:
del lines[i:i + delete]
# Save modified lines to file
with open(filename, "w") as f:
f.writelines(lines)
Example test.txt before:
abc
def
ghi
jkl
Example test.txt afterwards:
abc
jkl

If you don't want to use fileinput, you can also read all the lines of your file, write them (except for the lines you skip using next(f)) to a tempfile.NamedTemporaryFile and replace the original file with the temporary file.
from pathlib import Path
from tempfile import NamedTemporaryFile
named_conf = Path('/etc/bind/named.conf.local')
with open(named_conf) as infile:
with NamedTemporaryFile("w", delete=False) as outfile:
for line in infile:
if line_should_be_deleted(line):
# skip this line and the 4 lines after it
for _ in range(4):
next(infile)
else:
outfile.write(line)
Path(outfile.name).replace(named_conf)
But you should just use fileinput, like the answer to the question you linked to says, since it does the tempfile stuff for you.

It all boils down to keeping a skip count that you initialize with the first occurrence of the matching line and increase afterward:
match = "text line to match"
with open('input.txt','r') as lines:
with open('output.txt','w') as output:
skip = -1
for line in lines:
skip += skip >= 0 or skip < 0 and line.strip("\n") == match
if skip not in range(5):
output.write(line)
If what you're trying to avoid is reading lines one by one, you could write it like this (but you still need to open the files)
match = "text line to match"
lines = open('input.txt','r').read().split("\n")
matchPos = lines.index(match)
del lines[matchPos:matchPos+5]
open('output.txt','w').write("\n".join(lines))

bind = open('text.txt','r')
a = dict['name']
lines = bind.readlines()
i = 0
while i < len(lines):
if a in lines[i]:
del lines[i:i+5]
i += 1
print(lines)

bind = open('/etc/bind/named.conf.local','r')
textfile = bind.readlines()
a = 'some text'
for line_num in range(len(textfile)):
try:
if a in textfile[line_num]:
print('line exists')
del textfile[line_num:line_num+5]
except IndexError:
break
writer = open("/etc/bind/named.conf.local","w")
writer.write(''.join(textfile))
writer.close()

Counting the number of lines in a gzip file using python

I'm trying to count the number of lines in a gz archive. There is only 1 json format text file per gz. But when I open the archive and count the lines the count is way off what I'd expect. The file contains 522 lines, but my code is returning 668480 lines.
import gzip
f = gzip.open(myfile, 'rb')
file_content = f.read()
for i, l in enumerate(file_content):
pass
i += 1
print("File {1} contain {0} lines".format(i, myfile))

You are iterating over all characters not the lines. You can iterate lines the following way
import gzip
with gzip.open(myfile, 'rb') as f:
for i, l in enumerate(f):
pass
print("File {1} contain {0} lines".format(i + 1, myfile))

For a performant way to count the lines in a gzip file you can use the pragzip package:
import pragzip
result = 0
with pragzip.open(myfile) as file:
while chunk := file.read( 1024*1024 ):
result += chunk.count(b'\n')
print(f"Number of lines: {result}")
Comparing the timing of the above with #DmitryKovriga's answer:
Number of lines: 33468793
Elapsed time is 22.373915 seconds.
File datasets/binance-futures_incremental_book_L2_2020-07-01_BTCUSDT.csv.gz contain 33468793 lines
Elapsed time is 31.278056 seconds.
A speed up of more like 10x should be possible with a suitable setup. See https://unix.stackexchange.com/a/713093/163459 for more info.

Two simple questions about python

I have 2 simple questions about python:
1.How to get number of lines of a file in python?
2.How to locate the position in a file object to the
last line easily?

lines are just data delimited by the newline char '\n'.
1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:
count = 0
for line in open('myfile'):
count += 1
print count, line # it will be the last line
2) reading a chunk from the end of the file is the fastest method to find the last newline char.
def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
if not file_obj.tell(): return # already in beginning of file
# All lines end with \n, including the last one, so assuming we are just
# after one end of line char
file_obj.seek(-1, os.SEEK_CUR)
while file_obj.tell():
ammount = min(buffer_size, file_obj.tell())
file_obj.seek(-ammount, os.SEEK_CUR)
data = file_obj.read(ammount)
eol_pos = data.rfind(eol_char)
if eol_pos != -1:
file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
break
file_obj.seek(-len(data), os.SEEK_CUR)
You can use that like this:
f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())

Let's not forget
f = open("myfile.txt")
lines = f.readlines()
numlines = len(lines)
lastline = lines[-1]
NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.

The easiest way is simply to read the file into memory. eg:
f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]
However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:
f = open('filename.txt')
num_lines = sum(1 for line in f)
This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:
f = open('filename.txt')
count=0
last_line = None
for line in f:
num_lines += 1
last_line = line
print "There were %d lines. The last was: %s" % (num_lines, last_line)
One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.

For small files that fit memory,
how about using str.count() for getting the number of lines of a file:
line_count = open("myfile.txt").read().count('\n')

I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.

The only way to count lines [that I know of] is to read all lines, like this:
count = 0
for line in open("file.txt"): count = count + 1
After the loop, count will have the number of lines read.

For the first question there're already a few good ones, I'll suggest #Brian's one as the best (most pythonic, line ending character proof and memory efficient):
f = open('filename.txt')
num_lines = sum(1 for line in f)
For the second one, I like #nosklo's one, but modified to be more general should be:
import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
fro = max(0, to-1024)
f.seek(fro)
chunk = f.read(to-fro)
found = chunk.rfind("\n")
to -= 1024
if found != -1:
found += fro
It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.

Answer to the first question (beware of poor performance on large files when using this method):
f = open("myfile.txt").readlines()
print len(f) - 1
Answer to the second question:
f = open("myfile.txt").read()
print f.rfind("\n")
P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.

Answer1:
x = open("file.txt")
opens the file or we have x associated with file.txt
y = x.readlines()
returns all lines in list
length = len(y)
returns length of list to Length
Or in one line
length = len(open("file.txt").readlines())
Answer2 :
last = y[-1]
returns the last element of list

Approach:
Open the file in read-mode and assign a file object named “file”.
Assign 0 to the counter variable.
Read the content of the file using the read function and assign it to a
variable named “Content”.
Create a list of the content where the elements are split wherever they encounter an “\n”.
Traverse the list using a for loop and iterate the counter variable respectively.
Further the value now present in the variable Counter is displayed
which is the required action in this program.
Python program to count the number of lines in a text file
# Opening a file
file = open("filename","file mode")#file mode like r,w,a...
Counter = 0
# Reading from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
The above code will print the number of lines present in a file. Replace filename with the file with extension and file mode with read - 'r'.

How can I split a file in python?

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?

This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')
at = 1
for lines in range(0, len(input), splitLen):
# First, get the list slice
outputData = input[lines:lines+splitLen]
# Now open the output file, join the new slice with newlines
# and write it out. Then close the file.
output = open(outputBase + str(at) + '.txt', 'w')
output.write('\n'.join(outputData))
output.close()
# Increment the counter
at += 1

A better loop for sli's example, not hogging memory :
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
input = open('input.txt', 'r')
count = 0
at = 0
dest = None
for line in input:
if count % splitLen == 0:
if dest: dest.close()
dest = open(outputBase + str(at) + '.txt', 'w')
at += 1
dest.write(line)
count += 1

Solution to split binary files into chapters .000, .001, etc.:
FILE = 'scons-conversion.7z'
MAX = 500*1024*1024 # 500Mb - max chapter size
BUF = 50*1024*1024*1024 # 50GB - memory buffer size
chapters = 0
uglybuf = ''
with open(FILE, 'rb') as src:
while True:
tgt = open(FILE + '.%03d' % chapters, 'wb')
written = 0
while written < MAX:
if len(uglybuf) > 0:
tgt.write(uglybuf)
tgt.write(src.read(min(BUF, MAX - written)))
written += min(BUF, MAX - written)
uglybuf = src.read(1)
if len(uglybuf) == 0:
break
tgt.close()
if len(uglybuf) == 0:
break
chapters += 1

def split_file(file, prefix, max_size, buffer=1024):
"""
file: the input file
prefix: prefix of the output files that will be created
max_size: maximum size of each created file in bytes
buffer: buffer size in bytes
Returns the number of parts created.
"""
with open(file, 'r+b') as src:
suffix = 0
while True:
with open(prefix + '.%s' % suffix, 'w+b') as tgt:
written = 0
while written < max_size:
data = src.read(buffer)
if data:
tgt.write(data)
written += buffer
else:
return suffix
suffix += 1
def cat_files(infiles, outfile, buffer=1024):
"""
infiles: a list of files
outfile: the file that will be created
buffer: buffer size in bytes
"""
with open(outfile, 'w+b') as tgt:
for infile in sorted(infiles):
with open(infile, 'r+b') as src:
while True:
data = src.read(buffer)
if data:
tgt.write(data)
else:
break

Sure it's possible:
open input file
open output file 1
count = 0
for each line in file:
write to output file
count = count + 1
if count > maxlines:
close output file
open next output file
count = 0

import re
PATENTS = 'patent.data'
def split_file(filename):
# Open file to read
with open(filename, "r") as r:
# Counter
n=0
# Start reading file line by line
for i, line in enumerate(r):
# If line match with teplate -- <?xml --increase counter n
if re.match(r'\<\?xml', line):
n+=1
# This "if" can be deleted, without it will start naming from 1
# or you can keep it. It depends where is "re" will find at
# first time the template. In my case it was first line
if i == 0:
n = 0
# Write lines to file
with open("{}-{}".format(PATENTS, n), "a") as f:
f.write(line)
split_file(PATENTS)
As a result you will get:
patent.data-0
patent.data-1
patent.data-N

You can use use this pypi filesplit module.

This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby.
Assuming you have a (huge) file file.txt that you want to split in chunks of MAXLINES lines file_part1.txt, ..., file_partn.txt, you could do:
with open(file.txt) as fdin:
for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
fdout = open("file_part{}.txt".format(i))
for _, line in sub:
fdout.write(line)

import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file
You can change it manually if Windows

All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).
I found the following is the most efficient way to do it:
import os
MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"
if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
print("Done:")
print(os.system(f"ls {PREFIX}??"))
else:
print("Failed!")
Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split the file in many other files using python - python

with open("source.txt") as f: counter = 1 for line in f: if counter % 3 == 0： continue with open("out_%03d.txt" % counter, 'w') as out: out.write(line) counter += 1

Related

Use a file to search another file and print lines matching a pattern to first file

How can I delete multipe lines in a text by Python?

Counting the number of lines in a gzip file using python

Two simple questions about python

How can I split a file in python?

Categories

Resources