Python -- How to split headers/chapters into separate files automatically

Python -- How to split headers/chapters into separate files automatically - python

I'm converting text directly to epub and I'm having a problem automatically splitting the HTML book file into separate header/chapter files. At the moment, the code below partially works but only creates every other chapter file. So half the header/chapter files are missing from the output. Here is the code:
def splitHeaderstoFiles(fpath):
infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:
# format and split headers to files
if '<h1' in line:
#-----------format header file names and other stuff ------------#
# create a new file for the header/chapter section
path = os.getcwd() + os.sep + header
with open(path, 'wt', encoding=('utf-8')) as outfp:
# write html top meta headers
outfp = addMetaHeaders(outfp)
# add the header
outfp = outfp.write(line)
# add the chapter/header bodytext
for line in infp:
if '<h1' not in line:
outfp.write(line)
else:
outfp.write('</body>\n</html>')
break
else:
continue
infp.close()
The problem occurs in the second 'for loop' at the bottom of the code, when I look for the next h1 tag to stop the split. I cannot use seek() or tell() to rewind or move back one line so the program can find the next header/chapter on the next iteration. Apparently you cannot use these in python in a for loop containing an implicit iter or next object in operation. Just gives a 'can't do non-zero cur-relative seeks' error.
I've also tried the while line != ' ' + readline() combination in the code which also gives the same error as above.
Does anyone know an easy way to split HTML headers/chapters of varying lengths into separate files in python? Are there any special python modules(such as pickles) that could help make this task easier?
I'm using Python 3.4
My grateful thanks in advance for any solutions to this problem...

I ran into similar problem a while ago, here is a simplified solution:
from itertools import count
chapter_number = count(1)
output_file = open('000-intro.html', 'wb')
with open('index.html', 'rt') as input_file:
for line in input_file:
if '<h1' in line:
output_file.close()
output_file = open('{:03}-chapter'.format(next(chapter_number)), 'wb')
output_file.write(line)
output_file.close()
In this approach, the first block of text leading to the first h1 block is written into 000-intro.html, the first chapter will be written into 001-chapter.html and so on. Please modify it to taste.
The solution is a simple one: Upon encountering the h1 tag, close the last output file and open a new one.

You are looping over your input file twice, which is likely causing your problems:
for line in infp:
...
with open(path, 'wt', encoding=('utf-8')) as outfp:
...
for line in infp:
...
Each for is going to have it's own iterator, so you are going to loop over the file many times.
You might try transforming your for loop into a while so you're not using two different iterators:
while infp:
line = infp.readline()
if '<h1' in line:
with open(...) as outfp:
while infp:
line = infp.readline()
if '<h1' in line:
break
outfp.writeline(...)
Alternatively, you may wish to use an HTML parser (i.e., BeautifulSoup). Then you can do something like what is described here: https://stackoverflow.com/a/8735688/65295.
Update from comment - essentially, read the entire file all at once so you can freely move back or forward as necessary. This probably won't be a performance issue unless you have a really really big file (or very little memory).
lines = infp.readlines() # read the entire file
i = 0
while i < len(lines):
if '<h1' in lines[i]:
with open(...) as outfp:
j = i + 1
while j < len(lines):
if '<h1' in lines[j]:
break
outfp.writeline(lines[j])
# line j has an <h1>, set i to j so we detect the it at the
# top of the next loop iteration.
i = j
else:
i += 1

I eventually found the answer to the above problem. The code below does alot more that just get the file header. It also simultaneously loads two parallel list arrays with formatted file name data(with extension) and pure header name data respectively so I can use these lists to fill in the and formatted filename extension in these html files within a while loop in one hit. The code now works well and is shown below.
def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []
inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:
for line in infp:
if '<h1' in line:
#strip html tags, convert to start caps
p = re.compile(r'<.*?>')
header = p.sub('', line)
header = capwords(header)
line_save = header
# Add 0 for count below 10
if count < 10:
header = '0' + str(count) + '_' + header
else:
header = str(count) + '_' + header
# remove all spaces + add extension in header
header = header.replace(' ', '_')
header = header + '.xhtml'
count = count + 1
#create two parallel lists used later
out_path = dir + os.sep + header
outfp = open(out_path, 'wt', encoding=('utf-8'))
file_path_names.insert(t_count, out_path)
pure_header_names.insert(t_count, line_save)
t_count = t_count + 1
# Add html meta headers and write it
outfp = addMainHeaders(outfp)
outfp.write(line)
write_bodytext = True
# add header bodytext
elif write_bodytext == True:
outfp.write(line)
# now add html titles and close the html tails on all files
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0
while i < max_num_files:
outfp = open(tmp, 'wt', encoding=('utf-8'))
infp = open(file_path_names[i], 'rt', encoding=('utf-8'))
for line in infp:
if '<title>' in line:
line = line.strip(' ')
line = line.replace('<title></title>', '<title>' + pure_header_names[i] + '</title>')
outfp.write(line)
else:
outfp.write(line)
# add the html tail
if '</body>' in line or '</html>' in line:
pass
else:
outfp.write(' </body>' + '\n</html>')
# clean up
infp.close()
outfp.close()
shutil.copy2(tmp, file_path_names[i])
os.remove(tmp)
i = i + 1
# now rename just the title page
if os.path.isfile(file_path_names[0]):
title_page_name = file_path_names[0]
new_title_page_name = dir + os.sep + '01_Title.xhtml'
os.rename(title_page_name, new_title_page_name)
file_path_names[0] = '01_Title.xhtml'
else:
logmsg27(DEBUG_FLAG)
os._exit(0)
# xhtml file is no longer needed
if os.path.isfile(inpath):
os.remove(inpath)
# returned list values are also used
# later to create epub opf and ncx files
return(file_path_names, pure_header_names)
#Hai Vu and #Seth -- Thanks for all your help.

Related

How to manipulate txt line by line and export to new destination

I have a folder with source txt files and a destination folder. The source txt files could look like these two examples:
File1:
0;122214;stringvalue1;10;string;value;1012;1014
0;1222155;stringvalue20;10;anotherstring;v;value;10000015;0
0;1222155;stringvalue20;10;anotherstring;with;;;;;;;;;;;;;;a lot of ;;;;;;;;;;;;;;;;;;;;value;10000016;0
0;1222155;stringvalue20;10;;value;7;0
File2:
0;122214;stringvalue1;10;string;value;1012;1014
0;1222155;stringvalue20;10;anotherstring;v;value;10000015;0
0;1222155;stringvalue20;10;anotherstring;with;;;;;;;;;;;;;;a lot of ;;;;;;;;;;;;;;;;;;;;value;10000016;0
0;1222155;stringvalue20;10;;value;7;0
1;122214;stringvalue1;10;string;value;1012;1014
1;1222155;stringvalue20;10;another;"string;v;value;10000015;0
1;1222155;stringvalue20;10;anoth";erstring;with;;;;;;;;;;;;;;a lot of ;;;;;;;;;;;;;;;;;;;;value;10000016;0
1;1222155;stringvalue20;10;--;value;7;0
I have a code now which inserts quote characters into a specific column. My current code is as follows:
import glob
import os
def findnth(string, substring, n):
parts = string.split(substring, n + 1)
if len(parts) <= n + 1:
return -1
return len(string) - len(parts[-1]) - len(substring)
path = "D:\source\*.txt"
path2 = "D:\destination"
for fname in glob.glob(path):
with open(fname) as f:
content = f.readline()
content2 = content[:findnth(content, ";", 3)+1]+'"'+content[findnth(content, ";", 3)+1:(len(content)-findnth(content[::-1], ";", 2))-1]+'"'+content[(len(content)-findnth(content[::-1], ";", 2))-1:]
print(content2)
with open(os.path.join(path2,os.path.basename(fname)), "w") as output:
output.write(content2)
The code works and no errors result. However, only the first line of each file is written to a new file:
0;122214;stringvalue1;10;"string";value;1012;1014
0;122214;stringvalue1;10;"string";value;1012;1014
So the inserting of '"' works, however I have problems to do it line by line and export it to a new file. I tried read, readline and readlines, but did not get it working. So how can I get this working that the code runs for all lines and not just the first line of each file? Furthemore I do not want to have empty lines between each line being inserted in the final file.
Update:
Desired output:
File1:
0;122214;stringvalue1;10;"string";value;1012;1014
0;1222155;stringvalue20;10;"anotherstring;v";value;10000015;0
0;1222155;stringvalue20;10;"anotherstring;with;;;;;;;;;;;;;;a lot of ;;;;;;;;;;;;;;;;;;;";value;10000016;0
0;1222155;stringvalue20;10;"";value;7;0
File2:
0;122214;stringvalue1;10;"string";value;1012;1014
0;1222155;stringvalue20;10;"anotherstring;v";value;10000015;0
0;1222155;stringvalue20;10;"anotherstring;with;;;;;;;;;;;;;;a lot of ;;;;;;;;;;;;;;;;;;;";value;10000016;0
0;1222155;stringvalue20;10;"";value;7;0
1;122214;stringvalue1;10;"string";value;1012;1014
1;1222155;stringvalue20;10;"another;"string;v";value;10000015;0
1;1222155;stringvalue20;10;"anoth";erstring;with;;;;;;;;;;;;;;a lot of ;;;;;;;;;;;;;;;;;;;";value;10000016;0
1;1222155;stringvalue20;10;"--";value;7;0

I think problem will be solved now , I've tried on my system and it works :
import glob
import os
def findnth(string, substring, n):
parts = string.split(substring, n + 1)
if len(parts) <= n + 1:
return -1
return len(string) - len(parts[-1]) - len(substring)
path = "D:\source\*.txt"
path2 = "D:\destination"
for fname in glob.glob(path):
newcontent = ""
with open(fname) as f:
content = f.read().splitlines()
for line in content :
content2 = line[:findnth(line, ";", 3)+1]+'"'+line[findnth(line, ";", 3)+1:(len(line)-findnth(line[::-1], ";", 2))-1]+'"'+line[(len(line)-findnth(line[::-1], ";", 2))-1:]
print(content2)
newcontent = newcontent + content2 + "\n"
with open(os.path.join(path2,os.path.basename(fname)), "w") as output:
output.write(newcontent)
Explanation :
The variable content has the list of each lines the text file is containing.
We then iterate through all lines and place the ["quotation marks"] at the correct places which is returned in the variable content2.
We also have a newcontent variable which is temporarily used to store the contents of text file with quotation mark added.
At the beginning newcontent is set to " " which signifies that it is blank string variable. Then when every line is manipulated (quotation mark gets added) it is appended to the newcontent variable. The newcontent = newcontent + content2 + "\n" represents that the previous content2's will be added with previuos newcontent variable's value with "\n" added which creates a newline in the file and again stored in newcontent variable.
After whole text file is manipulated it is stored in a new file in a separate directory.

replace line if found or append - python

I have text that is key-value pairs separated by '='. I would like to replace the line if the key matches. if not, i would like to append it at the bottom. I've tried several ways, including:
def split_command_key_and_value(command):
if '=' in command:
command2 = command.split('=')
return command2
def test(command, path):
command2 = split_command_key_and_value(command)
pattern = command2[0]
myfile = open(path,'r') # open file handle for read
# use r'', you don't need to replace '\' with '/'
result = open(path, 'w') # open file handle for write
for line in myfile:
line = line.strip() # it's always a good behave to strip what you read from files
if pattern in line:
line = command # if match, replace line
result.write(line) # write every line
myfile.close() # don't forget to close file handle
result.close()
I know the above is just to replace text, but it deletes the text in the file, and I can't see why. Could someone point me in the right direction?
Thanks
Update:
I'm almost there, but some of my lines have similar keys, so mutiple lines are matching when only 1 should. I've tried to incorporate a regex boundary in my loop with no luck. My code is below. Does anyone have a suggestion?
There is some text in the file that isn't key-value, so I would like to skip that.
def modify(self, name, value):
comb = name + ' ' + '=' + ' ' + value + '\n'
with open('/file/', 'w') as tmpstream:
with open('/file/', 'r') as stream:
for line in stream:
if setting_name in line:
tmpstream.write(comb)
else:
tmpstream.write(line)
I think I got it. See code below.
def modify(self, name, value):
comb = name + ' ' + '=' + ' ' + value + '\n'
mylist = []
with open('/file/', 'w') as tmpstream:
with open('/file/', 'r') as stream:
for line in stream:
a = line.split()
b = re.compile('\\b'+name+'\\b')
if len(a) > 0:
if b.search(a[0]):
tmpstream.write(comb)
else:
tmpstream.write(line)
I spoke too soon. It stops at the key-value I provide. So, it only writes one line, and doesn't write the lines that don't match.
def modify(name, value):
comb = name + ' ' + '=' + ' ' + value + '\n'
mylist = []
with open('/file1', 'w') as tmpstream:
with open('/file2', 'r') as stream:
for line in stream:
a = line.split()
b = re.compile('\\b'+name+'\\b')
if len(a) > 0:
if b.search(a[0]):
tmpstream.write(comb)
else:
tmpstream.write(line)
Can anyone see the issue?

Because when you open file for writing
result = open(path, 'w') # open file handle for write
you just erase it content. Try to write in different file and after all work done replace old file with new one. Or read all data into memory and then process it and write to file.
with open(path) as f:
data = f.read()
with open(path, 'w') as f:
for l in data:
# make job here

first of all you are reading an writing the same file ...
you could first read it all and the write line by line
with open(path,'r') as f:
myfile = f.read() # read everything in the variable "myfile"
result = open(path, 'w') # open file handle for write
for line in myfile.splitlines(): # process the original file content 1 line at a time
# as before

I strongly recommend reading python's documentation on how to read and write files.
If you open an existing file in write-mode open(path, 'w'), its content will be erased:
mode can be (...) 'w' for only writing (an existing file with the same name will be erased)
To replace a line in python you can have a look at this: Search and replace a line in a file in Python
Here is one the solutions provided there adapted to your context (tested for python3):
from tempfile import mkstemp
from shutil import move
from os import close
def test(filepath, command):
# Split command into key/value
key, _ = command.split('=')
matched_key = False
# Create a temporary file
fh, tmp_absolute_path = mkstemp()
with open(tmp_absolute_path, 'w') as tmp_stream:
with open(filepath, 'r') as stream:
for line in stream:
if key in line:
matched_key = True
tmp_stream.write(command + '\n')
else:
tmp_stream.write(line)
if not matched_key:
tmp_stream.write(command + '\n')
close(fh)
move(tmp_absolute_path, filepath)
Note that with the code above every line that matches key (key=blob or blob=key) will be replaced.

Cleaning up a messy data file to a more readable format in Python?

I have a text file (heavily modified for this example) which has some data that I want to extract and do some calculations with it. However the text file is extremely messy, so I'm trying to clean it up and write it out to new files first.
Here is the .txt file I'm working with: http://textuploader.com/5elql
I am trying to extract the data which is under the titles (called “Important title”). The only possible way to do that is to first locate a string which always occurs in the file, and its called “DATASET” because all the mess above and below the important data will cover an arbitrary number of lines, difficult to remove manually. Once that’s done I want to store the data in separate files so that it is easier to analyse like this:
http://textuploader.com/5elqw
The file names will be concatenated with the title + the date.
Here is what I have tried so far
with open("example.txt") as file:
for line in file:
if line.startswith('DATASET:'):
fileTitle = line[9:]
if line.startswith("DATE:"):
fileDate = line[:]
print(fileTitle+fileDate)
OUTPUT
IMPORTANT TITLE 1
DATE: 12/30/2015
IMPORTANT TITLE 2
DATE: 01/03/2016
So it appears my loop manages to locate the lines where the titles inside the file are and print them out. But this is where I run out of steam. I have no idea on how to extract the data under those titles from there onwards. I have tried using file.readlines() but it outputs all the mess that is in between Important Title 1 and Important Title 2.
Any advice on how I can read all the data under the titles and output them into separate files? Thanks for your time.

You could use regex.
import re
pattern = r"(\s+X\s+Y\s*)|(\s*\d+\s+\d+\s*)"
prog = re.compile(pattern)
with open("example.txt") as file:
cur_filename = ''
content = ""
for line in file:
if line.startswith('DATASET:'):
fileTitle = line[9:]
elif line.startswith("DATE:"):
fileDate = line[6:]
cur_filename = (fileTitle.strip() + fileDate.strip()).replace('/', '-')
print(cur_filename)
content_title = fileTitle + line
elif prog.match(line):
content += line
elif cur_filename and content:
with open(cur_filename, 'w') as fp:
fp.write(content_title)
fp.write(content)
cur_filename = ''
content = ''

I don't know exactly how you want to store your data but assuming you want a dictionary you could use regex to check if the incoming line matched the pattern, then because fileTitle isn't global you could use that as the key and add the values. I also added rstrip('\r\n') to remove the newline characters after fileTitle.
import re
#if you don't want to store the X and Y, just use re.compile('\d\s+\d+')
p = re.compile('(\d\s+\d+)|(X\s+Y)')
data={}
with open("input.txt") as file:
for line in file:
if line.startswith('DATASET:'):
fileTitle = line[9:].rstrip('\r\n')
if line.startswith("DATE:"):
fileDate = line[:]
print(fileTitle+fileDate)
if p.match(line):
if fileTitle not in data:
data[fileTitle]=[]
line=line.rstrip('\r\n')
data[fileTitle].append(line.split('\t'))
if len(data[fileTitle][len(data[fileTitle])-1]) == 3:
data[fileTitle][len(data[fileTitle])-1].pop()
print data

Yet another regex solution:
sep = '*************************\n'
pattern = r'DATASET[^%]*'
good_stuff = re.compile(pattern)
pattern = r'^DATASET: (.*?)$'
title = re.compile(pattern, flags = re.MULTILINE)
pattern = r'^DATE: (.*?)$'
date = re.compile(pattern, flags = re.MULTILINE)
with open(r'foo.txt') as f:
data = f.read()
for match in good_stuff.finditer(data):
data = match.group()
important_title = title.search(data).group(1)
important_date = date.search(data).group(1)
important_date = important_date.replace(r'/', '-')
fname = important_title + important_date + '.txt'
print(sep, fname)
print(data)
##with open(fname, 'w') as f:
## f.write(data)

Double quotes keep coming in csv when merging file in python, how to remove?

I have many files with URL links and I want to merge them all into one big file. The links in the inidividual files do not have double quotes around them. The merged file somehow added double quote to each link in the final csv (MergedURLs.csv). I read up the csv module documentation for python, and added the line "writeFile = csv.QUOTE_NONE", but it made no difference.
import csv
def mergeFile(a, b, x, y):
for loop1 in range(a, b):
for loop2 in range(x, y):
try:
fileName1 = "FoundValidURLs_"
fileName2 = "_"
fileName3 = ".csv"
fileNameComplete = fileName1 + str(loop1) + fileName2 + str(loop2) + fileName3
with open(fileNameComplete, "rb") as f:
for URLrecords in f:
with open("MergedURLs.csv", "ab") as fi:
writeFile = csv.writer(fi)
writerFile = csv.QUOTE_NONE
writeFile.writerow([URLrecords])
except IOError:
continue
loop2 += 1
loop1 += 1
mergeFile(1, 2, 1, 3)

This seems to be working now, a line is needed to remove the unnecessary, added double quote (hidden and only at the end of each url).
with open(fileNameComplete, "rb") as f:
for URLrecords in f:
with open("MergedURLs.csv", "ab") as fi:
writeFile = csv.writer(fi)
URLrecords_strip = URLrecords[0:-1] # strip away the quotation at the end
writeFile.writerow([str(URLrecords_strip)])

You need to pass csv.QUOTE_NONE as an argument to your writer, not assign to a separate variable:
writeFile = csv.writer(fi, quoting=csv.QUOTE_NONE)

How to use Python to find a string in a line and change the text n lines after the string

I need to find every instance of "translate" in a text file and replace a value 4 lines after finding the text:
"(many lines)
}
}
translateX xtran
{
keys
{
k 0 0.5678
}
}
(many lines)"
The value 0.5678 needs to be 0. It will always be 4 lines below the "translate" string
The file has up to about 10,000 lines.
example text file name: 01F.pz2.
I'd also like to cycle through the folder and repeat the process for every file with the pz2 extension (up to 40).
Any help would be appreciated!
Thanks.

I'm not quite sure about the logic for replacing 0.5678 in your file, therefore I use a function for that - change it to whatever you need, or explain more in details what you want. Last number in line? only floating-point number?
Try:
import os
dirname = "14432826"
lines_distance= 4
def replace_whatever(line):
# Put your logic for replacing here
return line.replace("0.5678", "0")
for filename in filter(lambda x:x.endswith(".pz2") and not x.startswith("m_"), os.listdir(dirname)):
print filename
with open(os.path.join(dirname, filename), "r") as f_in, open(os.path.join(dirname,"m_%s" % filename), "w") as f_out:
replace_tasks = []
for line in f_in:
# search marker in line
if line.strip().startswith("translate"):
print "Found marker in", line,
replace_tasks.append(lines_distance)
# replace if necessary
if len(replace_tasks)>0 and replace_tasks[0] == 0:
del replace_tasks[0]
print "line to change is", line,
line_to_write = replace_whatever(line)
else:
line_to_write = line
# Write to output
f_out.write(line_to_write)
# decrease counters
for i, task in enumerate(replace_tasks):
replace_tasks[i] -= 1
The comments within the code should help understanding. The main concept is the list replace_tasks that keeps record of when the next line to modify will come.
Remarks: Your code sample suggests that the data in your file are structured. It will definitely be saver to read this structure and work on it instead of search-and-replace approach on a plain text file.

Thorsten, I renamed my original files to have the .old extension and the following code works:
import os
target_dir = "."
# cycle through files
for path, dirs, files in os.walk(target_dir):
# file is the file counter
for file in files:
# get the filename and extension
filename, ext = os.path.splitext(file)
# see if the file is a pz2
if ext.endswith('.old') :
# rename the file to "old"
oldfilename = filename + ".old"
newfilename = filename + ".pz2"
old_filepath = os.path.join(path, oldfilename)
new_filepath = os.path.join(path, newfilename)
# open the old file for reading
oldpz2 = open (old_filepath,"r")
# open the new file for writing
newpz2 = open (new_filepath,"w")
# reset changeline
changeline = 0
currentline = 0
# cycle through old lines
for line in oldpz2 :
currentline = currentline + 1
if line.strip().startswith("translate"):
changeline = currentline + 4
if currentline == changeline :
print >>newpz2," k 0 0"
else :
print >>newpz2,line

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.