Python: Splitting a txt file and keep header - python

Say I have an input file like this (splitfile.txt):
INPUT
HEADER
OF A TXT FILE
line 1
line 2
line 3
line 4
line 5
line 6
I want to split these files and keep the three header lines like this:
INPUT
HEADER
OF A TXT FILE
line 1
line 2
INPUT
HEADER
OF A TXT FILE
line 3
line 4
INPUT
HEADER
OF A TXT FILE
line 5
line 6
My Python code so far is just only splitting up this textfile:
lines_per_file = 2
s = None
with open('splitfile.txt') as split:
for lineno, line in enumerate(split):
if lineno % lines_per_file == 0:
if s:
s.close()
sfilename = 'step_{}.txt'.format(lineno + lines_per_file)
s = open(sfilename, "w")
s.write(line)
if s:
s.close()
How can I do this?

you can read the header and save it in a variable to write to each new file you create.
lines_per_file = 2
s = None
with open('a.txt') as f:
lines = f.readlines()
headers, lines = lines[:3], lines[3:]
for lineno, line in enumerate(lines):
if lineno % lines_per_file == 0:
if s:
s.close()
sfilename = f'step_{lineno + lines_per_file}.txt'
s = open(sfilename, "w")
s.writelines(headers)
s.write(line)
if s:
s.close()
A cleaner answer:
LINES_PER_FILE = 2
def writer_to_file(name, headers, lines):
with open(name, "w") as f:
print(headers + lines)
f.writelines(headers + lines)
with open('a.txt') as f:
lines = f.readlines()
headers, lines = lines[:3], lines[3:]
[writer_to_file(f'step_{i + LINES_PER_FILE}.txt', headers, lines[i: i+LINES_PER_FILE]) for i in range(0, len(lines), LINES_PER_FILE)]
I prefer this one because there is no global variable s and by using with statement there is no need to worry about closing file.
Also it's better to have UPPER_CASE constant variables.

Related

Lines missing in python

I am writing a code in python where I am removing all the text after a specific word but in output lines are missing. I have a text file in unicode which have 3 lines:
my name is test1
my name is
my name is test 2
What I want is to remove text after word "test" so I could get the output as below
my name is test
my name is
my name is test
I have written a code but it does the task but also removes the second line "my name is"
My code is below
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
with open(r"test.txt", "w") as fp:
fp.write(txt)
It looks like if there is no keyword found the index become -1.
So you are avoiding the lines w/o keyword.
I would modify your if by adding the condition as follows:
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
elif index < 0:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)
No need to add \n because the line already contains it.
Your code does not append the line if the splitStr is not defined.
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index != -1:
txt += line[:index + len(splitStr)] + "\n"
else:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)
In my solution I simulate the input file via io.StringIO. Compared to your code my solution remove the else branch and only use one += operater. Also splitStr is set only one time and not on each iteration. This makes the code more clear and reduces possible errore sources.
import io
# simulates a file for this example
the_file = io.StringIO("""my name is test1
my name is
my name is test 2""")
txt = ""
splitStr = "test"
with the_file as fp:
# each line
for line in fp.readlines():
# cut somoething?
if splitStr in line:
# find index
index = line.find(splitStr)
# cut after 'splitStr' and add newline
line = line[:index + len(splitStr)] + "\n"
# append line to output
txt += line
print(txt)
When handling with files in Python 3 it is recommended to use pathlib for that like this.
import pathlib
file_path = pathlib.Path("test.txt")
# read from wile
with file_path.open('r') as fp:
# do something
# write back to the file
with file_path.open('w') as fp:
# do something
Suggestion:
for line in fp.readlines():
i = line.find('test')
if i != -1:
line = line[:i]

Reading a Python File to EOF while performing if statments

I am working on creating a program to concatenate rows within a file. Each file has a header, datarows labeled DAT001 to DAT113 and a trailer. Each line of concatenated rows will have DAT001 to DAT100 and 102-113 is optional. I need to print the header, concatenating DAT001-113 and when the file finds a row with DAT001 I need to start a new line concatenating DAT001-113 again. After that is all done, I will print the trailer. I have an IF statement started but it only writes the header and skips all other logic. I apologize that this is very basic - but I am struggling with reading rows over and over again without knowing how long the file might be.
I have tried the below code but it won't read or print after the header.
import pandas as pd
destinationFile = "./destination-file.csv"
sourceFile = "./TEST.txt"
header = "RHR"
data = "DPSPOS"
beg_data = "DAT001"
data2 = "DAT002"
data3 = "DAT003"
data4 = "DAT004"
data5 = "DAT005"
data6 = "DAT006"
data7 = "DAT007"
data8 = "DAT008"
data100 = "DAT100"
data101 = "DAT101"
data102 = "DAT102"
data103 = "DAT103"
data104 = "DAT104"
data105 = "DAT105"
data106 = "DAT106"
data107 = "DAT107"
data108 = "DAT108"
data109 = "DAT109"
data110 = "DAT110"
data111 = "DAT111"
data112 = "DAT112"
data113 = "DAT113"
req_data = ''
opt101 = ''
opt102 = ''
with open(sourceFile) as Tst:
for line in Tst.read().split("\n"):
if header in line:
with open(destinationFile, "w+") as dst:
dst.write(line)
elif data in line:
if beg_data in line:
req_data = line+line+line+line+line+line+line+line+line
if data101 in line:
opt101 = line
if data102 in line:
opt102 = line
new_line = pd.concat(req_data,opt101,opt102)
with open(destinationFile, "w+") as dst:
dst.write(new_line)
else:
if trailer in line:
with open(destinationFile, "w+") as dst:
dst.write(line)
Just open the output file once for the whole loop, not every time through the loop.
Check whether the line begins with DAT101. If it does, write the trailer to the current line and start a new line by printing the header.
Then for every line that begins with DAT, write it to the file in the current line.
first_line = True
with open(sourceFile) as Tst, open(destinationFile, "w+") as dst:
for line in Tst.read().split("\n"):
# start a new line when reading DAT101
if line.startswith(beg_data):
if not first_line: # need to end the current line
dst.write(trailer + '\n')
first_line = False
dst.write(header)
# copy all the lines that begin with `DAT`
if line.startswith('DAT'):
dst.write(line)
# end the last line
dst.write(trailer + '\n')
See if the following code helps make progress. It was not tested because no
Minimum Runnable Example is provided.
with open(destinationFile, "a") as dst:
# The above will keep the file open until after all the indented code runs
with open(sourceFile) as Tst:
# The above will keep the file open until after all the indented code runs
for line in Tst.read().split("\n"):
if header in line:
dst.write(line)
elif data in line:
if beg_data in line:
req_data = line + line + line + line + line + line + line + line + line
if data101 in line:
opt101 = line
if data102 in line:
opt102 = line
new_line = pd.concat(req_data, opt101, opt102)
dst.write(new_line)
else:
if trailer in line:
dst.write(line)
# With is a context manager which will automatically close the files.

Efficient method to replace text in file

I have a bunch of large text files (more that 5 million lines each). I have to add a prefix to the line if it contains any of the keywords from a list (6000 keywords).
This is the code I have written:
import os
mypath = "D:\\Temp"
files = os.listdir(mypath)
fcounter = 0
for file in files:
fl = open(mypath + file, 'r')
lines = fl.readlines()
fl.close
fl2 = open("D:\\Temp\\Keys.txt", 'r')
keys = fl2.readlines()
for i, line in enumerate(lines):
if i % 100000 == 0:
print(i)
for key in keys:
if line.find(key.strip()) > 3:
lines[i] = '$' + line
print(line)
fl = open(mypath + file, 'w')
for line in lines:
fl.write("".join(line))
fl.close()
fl2.close()
fcounter += 1
print(f'Processed {fcounter}')
This is extremely slow. It takes several hours to process a single text file on my system.
Is there a better way of doing this?

Can't properly remove line from text file

I have a text file called test.txt, with the following content:
This is a test
I want this line removed
I'm trying to write an algorithm in Python 2 that removes the second line ("I want this line removed") as well as the line break on the first line. I'm trying to output this to a second file called test_2.txt; however, the resulting test_2.txt file is empty, and the first line is not there. Why? Here is my code:
#coding: iso-8859-1
Fil = open("test.txt", "wb")
Fil.write("This is a test" + "\n" + "I want this line removed")
Fil.close()
Fil = open("test.txt", "rb")
Fil_2 = open("test_2.txt", "wb")
number_of_lines = 0
for line in Fil:
if line.find("I want") != 0:
number_of_lines += 1
line_number = 1
for line in Fil:
if line.find("I want") != 0:
if line_number == number_of_lines:
for g in range(0, len(line)):
if g == 0:
a = line[0]
elif g < len(line) - 1:
a += line[g]
Fil_2.write(a)
else:
Fil_2.write(line)
line_number += 1
Fil.close()
Fil_2.close()
You are overly complicating your algorithm. Try this instead:
with open('test.txt') as infile, open('test_2.txt', 'w') as outfile:
for line in infile:
if not line.startswith("I want"):
outfile.write(line.strip())
Remembering that open returns an iterator you can simplify, as well as generalise the solution, by writing it like this.
with open('test.txt') as infile:
first_line = next(infile)
with open('test_2.txt', 'w') as outfile:
outfile.write(first_line.strip())
# both files will be automatically closed at this point

Python find, replace or add string in file

Hi and thanks for looking :)
I have a text tile that is over 2500 lines, each line contains information about a video file. One of the tags(so to speak) is for watched status, I am looking for a way of changing it from one value to another or add in a new value if it is not set. The code below works but it has to open and close the file for each searchvalue, this means its very slow. Can any one suggest a way of opening the file once and doing all the searches in one pass?
Thanks
for x in y:
print ' --> ' + x['title'].encode('utf-8')
searchValue = x['movieid']
addValue = "\t_w\t1\t"
checkvalue = "\t_w\t0\t"
for line in fileinput.input(file, inplace=1):
if searchValue in line:
if checkvalue in line:
line = line.replace(checkvalue, addValue)
elif not addValue in line:
line = line + addValue
sys.stdout.write(line)
This is what i ended up with, thanks to everyone for your input.
myfile_list = open(file).readlines()
newList = []
for line in myfile_list:
for x in y:
if x['movieid'] in line:
print ' --> ' + x['title'].encode('utf-8')
if checkvalue in line:
line = line.replace(checkvalue, addValue)
elif not addValue in line:
line = line.replace('\n', addValue+'\n')
newList.append(line)
outref = open(file,'w')
outref.writelines(newList)
outref.close()
Edit
I have come across an issue with the encoding, The file is encoded in utf-8 but it errors out or just does not find a match when the search value is
'Hannibal - S01E01 - Ap\xe9ritif.mkv'
the matching line in the file looks like
_F /share/Storage/NAS/Videos/Tv/Hannibal/Season 01/Hannibal - S01E01 - Apéritif.mkv _rt 43 _r 8.4 _s 1 _v c0=h264,f0=24,h0=720,w0=1280 _IT 717ac9d _id 1671 _et Apéritif _DT 7142d53 _FT 7142d53 _A 4212,4211,2533,4216 _C T _G j|d|h|t _R GB:TV-MA _T Hannibal _U thetvdb:259063 imdb:tt2243973 _V HDTV _W 4210 _Y 71 _ad 2013-04-04 _e 1 _ai Apéritif _m 1117
I have tried codecs.open and decode().encode() options but it allways errors out, I believe its the accented letters in the line that is the issue as it can do the if searchValue in line: if the line does not have an accented letter. This is what I am currently trying but I am open to other methods.
if os.path.isfile("/share/Apps/oversight/index.db"):
newfile = ""
#searchValueFix = searchValue.decode('latin-1', 'replace').encode('utf8', 'replace')
#print searchValueFix
#print searchValue
addValue = "\t_w\t1\t"
replacevalue = "\t_w\t0\t"
file = open("/share/Apps/oversight/index.db", "r")
for line in file:
if searchValue in line:
if replacevalue in line:
line = line.replace(replacevalue, addValue)
elif not addValue in line:
line = line.replace(searchValue+"\t", searchValue+addValue)
newfile = newfile + line
file.close()
file = open("/share/Apps/oversight/index.db", "w")
file.write(newfile)
file.close()
newfile = ""
Similar to the method proposed by PyNEwbie, you can write lines 1 by 1:
myfile_list = open(file).readlines()
outref = open(myfile, 'w')
for line in myfile_list:
# do something to line
outref.write(line)
outref.close()
Yes read in your file to a list
myfile_list = open(file).readlines()
newList = []
for line in myfile_list:
.
.
newList.append(line) # this is the line after you have made your changes
outref = open(myfile,'w')
outref.writelines(newList)
outref.close()

Categories