I'm working on a minor content analysis program that I was hoping that I could have running through several pdf-files and return the sum of frequencies that some specific words are mentioned in the text. The words that are searched for are specified in a separate text file (list.txt) and can be altered. The program runs just fine through files with .txt format, but the result is completely different when running the program on a .pdf file. To illustrate, the test text that I have the program running trhough is the following:
"Hello
This is a product development notice
We’re working with innovative measures
A nice Innovation
The world that we live in is innovative
We are currently working on a new process
And in the fall, you will experience our new product development introduction"
The list of words grouped in categories are the following (marked in .txt file with ">>"):
innovation: innovat
product: Product, development, introduction
organization: Process
The output from running the code with a .txt file is the following:
Whereas the ouput from running it with a .pdf is the following:
As you can see, my issue is pertaining to the splitting of the words, where in the .pdf output i can have a string like "world" be split into 'w','o','rld'. I have tried to search for why this happens tirelessly, without success. As I am rather new to Python programming, I would appreciate any answe or direction to where I can fin and answer to why this happens, should you know any source.
Thanks
The code for the .txt is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.txt'
textfile = open(f)
text = textfile.read().split() # lowercase the text
print (text)
textfile.close()
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in text:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])
While the code for the .pdf is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.pdf'
textfile = open(f, 'rb')
text = PyPDF2.PdfFileReader(textfile)# lowercase the text
for pageNum in range(0, text.numPages):
texts = text.getPage(pageNum)
textfile = texts.extractText().split()
print (textfile)
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in textfile:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])
I'm trying to create a function that accepts a file as input and prints the number of lines that are full-line comments (i.e. the line begins with #followed by some comments).
For example a file that contains say the following lines should print the result 2:
abc
#some random comment
cde
fgh
#another random comment
So far I tried along the lines of but just not picking up the hash symbol:
infile = open("code.py", "r")
line = infile.readline()
def countHashedLines(filename) :
while line != "" :
hashes = '#'
value = line
print(value) #here you will get all
#if(value == hashes): tried this but just wasn't working
# print("hi")
for line in value:
line = line.split('#', 1)[1]
line = line.rstrip()
print(value)
line = infile.readline()
return()
Thanks in advance,
Jemma
I re-worded a few statements for ease of use (subjective) but this will give you the desired output.
def countHashedLines(lines):
tally = 0
for line in lines:
if line.startswith('#'): tally += 1
return tally
infile = open('code.py', 'r')
all_lines = infile.readlines()
num_hash_nums = countHashedLines(all_lines) # <- 2
infile.close()
...or if you want a compact and clean version of the function...
def countHashedLines(lines):
return len([line for line in lines if line.startswith('#')])
I would pass the file through standard input
import sys
count = 0
for line in sys.stdin: """ Note: you could also open the file and iterate through it"""
if line[0] == '#': """ Every time a line begins with # """
count += 1 """ Increment """
print(count)
Here is another solution that uses regular expressions and will detect comments that have white space in front.
import re
def countFullLineComments(infile) :
count = 0
p = re.compile(r"^\s*#.*$")
for line in infile.readlines():
m = p.match(line)
if m:
count += 1
print(m.group(0))
return count
infile = open("code.py", "r")
print(countFullLineComments(infile))
I'm fairly new to python programming and I'm trying to learn File I/O as best I can.
I am currently in the process of making a simple program to read from a text document and print out the result. So far I've been able to create this program with the help of many resources and questions on this website.
However I'm curious on how I can read from a text document for multiple individual strings and save the resulting strings to a text document.
The program below is one i've made that allows me to search a text document for a Keyword and print the results between those Keywords into another Text File. However I can only do one set of Starting and Ending Keyword per search:
from Tkinter import *
import tkSimpleDialog
import tkMessageBox
from tkFileDialog import askopenfilename
root = Tk()
w = Label(root, text ="Configuration Inspector")
w.pack()
tkMessageBox.showinfo("Welcome", "This is version 1.00 of Configuration Inspector Text")
filename = askopenfilename() # Data Search Text File
outputfilename = askopenfilename() #Output Text File
with open(filename, "rb") as f_input:
start_token = tkSimpleDialog.askstring("Serial Number", "What is the device serial number?")
end_token = tkSimpleDialog.askstring("End Keyword", "What is the end keyword")
reText = re.search("%s(.*?)%s" % (re.escape(start_token + ",SHOWALL"), re.escape(end_token)), f_input.read(), re.S)
if reText:
output = reText.group(1)
fo = open(outputfilename, "wb")
fo.write(output)
fo.close()
print output
else:
tkMessageBox.showinfo("Output", "Sorry that input was not found in the file")
print "not found"
So what this program does is, it allows a user to select a text document search that document for a Beginning Keyword and an End Keyword then print out everything in between those two key words into a new text document.
What I am trying to achieve is allow a user to select a text document and search that text document for multiple sets keywords and print the result to the same output text file.
In other words let's say I have the following Text Document:
something something something something
something something something something STARTkeyword1 something
data1
data2
data3
data4
data5
ENDkeyword1
something something something something
something something something something STARTkeyword2 something
data1
data2
data3
data4
data5
Data6
ENDkeyword2
something something something something
something something something something STARTkeyword3 something
data1
data2
data3
data4
data5
data6
data7
data8
ENDkeyword3
I want to be able to search this text document with 3 different starting keywords and 3 different ending keywords then print whats in between to the same output text file.
So for example my output text document would look something like:
something
data1
data2
data3
data4
data5
ENDkeyword1
something
data1
data2
data3
data4
data5
Data6
ENDkeyword2
something
data1
data2
data3
data4
data5
data6
data7
data8
ENDkeyword3
One brute force method I've tried is to make a loop to make the user input a new Keyword one at a time however whenever I try to write to the same Output File in the Text document it will over write the previous entry using Append. Is there any way to make it so a user can search a text document for multiple strings and print out the multiple results with or without a loop?
----------------- EDIT:
So many thanks to all of you Im getting closer with your tips to a nice finalized version or so.. This is my current code:
def process(infile, outfile, keywords):
keys = [ [k[0], k[1], 0] for k in keywords ]
endk = None
with open(infile, "rb") as fdin:
with open(outfile, "wb") as fdout:
for line in fdin:
if endk is not None:
fdout.write(line)
if line.find(endk) >= 0:
fdout.write("\n")
endk = None
else:
for k in keys:
index = line.find(k[0])
if index >= 0:
fdout.write(line[index + len(k[0]):].lstrip())
endk = k[1]
k[2] += 1
if endk is not None:
raise Exception(endk + " not found before end of file")
return keys
from Tkinter import *
import tkSimpleDialog
import tkMessageBox
from tkFileDialog import askopenfilename
root = Tk()
w = Label(root, text ="Configuration Inspector")
w.pack()
tkMessageBox.showinfo("Welcome", "This is version 1.00 of Configuration Inspector ")
infile = askopenfilename() #
outfile = askopenfilename() #
start_token = tkSimpleDialog.askstring("Serial Number", "What is the device serial number?")
end_token = tkSimpleDialog.askstring("End Keyword", "What is the end keyword")
process(infile,outfile,((start_token + ",SHOWALL",end_token),))
So far It works however now it's time to for part im getting myself lost on and that is a multiple string input separated by a Delimiter. So if i had inputted
STARTKeyword1, STARTKeyword2, STARTKeyword3, STARTKeyword4
into the program prompt I want to be able to separate those keywords and place them into the
process(infile,outfile,keywords)
function so that the user is only prompted to input once and allow for multiple strings to search through the files. I was thinking of using maybe a loop or creating the separated inputs into an array.
If this question is far from the original I ask I will close this one and open another so i can give credit where credit is due.
I would use a separate function that takes:
the path of the input file
the path of the output file
an iterable containing the (startkeyword, endkeyword) pairs
Then I would process the file line by line copying line if between a start and an end, counting how many time each pair has been found. That way caller could know what pairs were found and how many times for each.
Here is a possible implemenatation:
def process(infile, outfile, keywords):
'''Search through inputfile whatever is between a pair startkeyword (excluded)
and endkeyword (included). Each chunk if copied to outfile and followed with
an empty line.
infile and outfile are strings representing file paths
keyword is an iterable containing pairs (startkeyword, endkeyword)
Raises an exception if an endkeyword is not found before end of file
Returns a list of lists [ startkeyword, endkeyword, nb of occurences]'''
keys = [ [k[0], k[1], 0] for k in keywords ]
endk = None
with open(infile, "r") as fdin:
with open(outfile, "w") as fdout:
for line in fdin:
if endk is not None:
fdout.write(line)
if line.find(endk) >= 0:
fdout.write("\n")
endk = None
else:
for k in keys:
index = line.find(k[0])
if index >= 0:
fdout.write(line[index + len(k[0]):].lstrip())
endk = k[1]
k[2] += 1
if endk is not None:
raise Exception(endk + " not found before end of file")
return keys
I try to write to the same Output File in the Text document it will
over write the previous entry.
Have you tried using append rather than write?
f = open('filename', 'a')
This could be the beginning of something
import os
directory = os.getcwd()
path = directory + "/" + "textdoc.txt"
newFileName = directory + '/' + "data" + '.txt'
newFob = open(newFileName, 'w')
keywords = [["STARTkeyword1","ENDkeyword1"],["STARTkeyword2","ENDkeyword2"]]
fob = open(path, 'r')
objectLines = fob.readlines()
startfound=False
for keyword in keywords:
for line in objectLines:
if startfound and keyword[1] in line:
startfound=False
if startfound:
newFob.write(line)
if keyword[0] in line:
startfound=True
newFob.close()
If a text called textdoc.txt with the data you have provided is in the current directory, the script will create a file in the current directory called data.txt with the following output:
data1
data2
data3
data4
data5
data1
data2
data3
data4
data5
Data6
Ideally you would input more keywords, or allow the user to input some, and fill that up in the keywords list.
For each start/end pair, create a instance of a class, say Pair, with a feed(line) method, its start and key and a buffer or list to store data of interest.
Put all the pair instances in a list. Use those keywords to scan for matches and feed each of those Pair instances each line. If an instance is not between start and end keyword the Pair instance is inactive & forgets the data, otherwise feed adds it to its own StringBuffer or line list. Or even directly to your output file.
Write your file at end.
I am not 100% sure I understand the problem. But if I understand correctly, you can just have a list of lists, containing each start/end keyword couple. For each word in the document, check if it is equal to the first element in one of the lists (the start keyword). If it is, pop the keyword from the list (making the end-keyword be the first element of the list) and start saving all subsequent words into a string (different strings for each start/end keyword-couple). Once you hit the end keyword, just pop it from the list as well and remove the list that used to contain the start/end keywords from the surrounding list. At the end you should have 3 strings containing all the words between the different start/end keywords. Now just print all 3 of them to the file
EDIT:
If the main problem is that you're not able to append to the file, but you're actually rewriting the file every time, try opening the file this way instead of the way you're doing it today:
fo = open(outputfilename, "ab")
From the python docs:
"The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be 'r' when the file will only be read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for appending; any data written to the file is automatically added to the end. 'r+' opens the file for both reading and writing. The mode argument is optional; 'r' will be assumed if it’s omitted."
I tested the code using the following example text file.
something something something something
something something something something
something something something something
something something something something
something something something something
something something something something STARTkeyword3 something
data3
data3
data3
data3
data3
ENDkeyword3
something something something something
something something something something
something something something something
something something something something STARTkeyword1 something
data1
data1
data1
ENDkeyword1
something something something something
something something something something
something something something something
something something something something
something something something something STARTkeyword2 something
data2
data2
data2
Data2
ENDkeyword2
something something something something
something something something something
something something something something
something something something something
something something something something
something something something something STARTkeyword3 something
data3
data3
data3
data3
data3
ENDkeyword3
Visit https://www.youtube.com/watch?v=D6LYa6X2Otg&feature=youtu.be
Here is the code:
## ---------------------------------------------------------------------------
## Created by: James P. Lopez
## Created date: 9/21/2020
## Modified date: 9/29/2020, 10/20/2020
## https://stackoverflow.com/questions/32097118/search-text-file-for-multiple-strings-and-print-out-results-to-a-new-text-file
## Search text file for multiple strings and print out results to a new text file
## ---------------------------------------------------------------------------
import time, datetime, string
############################################################################################
############ Adding date to end of output file name
start_c = datetime.datetime.now();##print start_c
c1 = (("%(start_c)s" % vars()).partition(' ')[0]);##print c1
new_str = string.replace(c1, '-', '_');new_str = "_"+new_str;##print(new_str)
#### Path
path = "S:\\PythonProjects\\TextFileReadWrite"
Bslash = "\\"
#### Text files
#### Read file
##filename1 = "openfilename1" ## This is the read file
#### Below sample with exceptions
filename1 = "openfilenameexception1" ## This is the read file
#### Write file
outputfilename1 = "writefilename2" ## This is the write file
#### Text file extension
extT = ".txt"
#### Full Text file name
filename = ("%(path)s%(Bslash)s%(filename1)s%(extT)s" % vars());print filename
outputfilename = ("%(path)s%(Bslash)s%(outputfilename1)s%(new_str)s%(extT)s" % vars())
print outputfilename
#### Sum rows in text file
with open(filename) as filename1:
SumReadFile = sum(1 for _ in filename1)
filename1.close()
##print SumReadFile
############################################################################################
############ Create text file if it does not exist
############ or truncate if it does exist
outputfilename2=open('%(outputfilename)s' % vars(),'w')
outputfilename2.close()
#### Counters
foundstart = 0
CountAllStartKeys = 0
CountAllEndKeys = 0
CountAllBetweenData = 0
CountAllRecordsProcessed = 0
#### Set of keys
s1 = 'STARTkeyword1';s2 = 'STARTkeyword2';s3 = 'STARTkeyword3'
e1 = 'ENDkeyword1';e2 = 'ENDkeyword2';e3 = 'ENDkeyword3'
#### Opening output file append
outputFiles=open('%(outputfilename)s' % vars(),'a')
SetOfKeys = [(s1,e1),(s2,e2),(s3,e3)]
for keys in SetOfKeys:
print(keys)
search1 = keys[0]; ##print search1 ## This is the start key
search2 = keys[1]; ##print search2 ## This is the end key
with open("%(filename)s"% vars(), "r") as readFile1:
for line in readFile1:
print line
CountAllRecordsProcessed += 1
if foundstart == 0:
if search1 in line:
#### It will write the start key
print ("Yes found start key %(search1)s within line = %(line)s" % vars())
outputFiles.write(line)
foundstart += 1;CountAllStartKeys += 1
## time.sleep(2)
continue
if foundstart >= 1:
if search2 in line:
#### It will write the end key
print ("Yes found end key %(search2)s within line = %(line)s\nn" % vars())
outputFiles.write(line)
foundstart = 0;CountAllEndKeys += 1
## time.sleep(2)
elif search1 in line:
#### It will append to output text file and write no end key found
print ("\nATTENTION! No matching end key within line = %(line)s\n" % vars())
print ("\nHowever, found start key %(search1)s within line = %(line)s\n" % vars())
outputFiles.write("\nNo matching end key within line = %(line)s\n" % vars())
outputFiles.write("\nHowever, found start key %(search1)s within line = %(line)s\n" % vars())
outputFiles.write(line)
CountAllStartKeys += 1
## time.sleep(5)
continue
else:
#### It will write the rows between start and end key
print ("Yes, found between data = %(line)s" % vars())
outputFiles.write(line)
CountAllBetweenData += 1
## time.sleep(2)
readFile1.close()
outputFiles.close()
print "\nFinished Text File Read and Write"
print "\nTotal Number of Start Key Words Processed = " + str(CountAllStartKeys)
print "Total Number of End Key Words Processed = " + str(CountAllEndKeys)
print "\nTotal Number of Between Data Processed = " + str(CountAllBetweenData)
print "\nTotal Sum of Lines in Read File = " + str(SumReadFile)
NumberOfSetOfKeys = CountAllRecordsProcessed / SumReadFile; ##print NumberOfSetOfKeys
print "Total Number of Set of Keys = " + str(NumberOfSetOfKeys)
print "Total Number of Records Processed = " + str(CountAllRecordsProcessed)
print ("\n%(SumReadFile)s multiplied by %(NumberOfSetOfKeys)s = %(CountAllRecordsProcessed)s" % vars())
This program will process the read file only once instead of my previous post which processed the start and end keys one at a time for a total of three full iterations. It has been tested with the sample data in my previous post.
Visit: https://youtu.be/PJjBftGhSNc
## ---------------------------------------------------------------------------
## Created by: James P. Lopez
## Created date: 9/21/2020
## Modified date: 10/1/2020, 10/20/2020
## https://stackoverflow.com/questions/32097118/search-text-file-for-multiple-strings-and-print-out-results-to-a-new-text-file
## Search text file for multiple strings and print out results to a new text file
## ---------------------------------------------------------------------------
import time, datetime, string
############################################################################################
############ Adding date to end of file name
start_c = datetime.datetime.now(); ##print start_c
c1 = (("%(start_c)s" % vars()).partition(' ')[0]); ##print c1
new_str = string.replace(c1, '-', '_');new_str = "_"+new_str;##print(new_str)
#### Path
path = "S:\\PythonProjects\\TextFileReadWrite"
Bslash = "\\"
#### Text files
#### Read file
##filename1 = "openfilename1" ## This is the read file
#### Below sample with exceptions
filename1 = "openfilenameexception1" ## This is the read file
#### Write file
outputfilename1 = "writefilename2" ## This is the write file
#### Full Text file name
filename = ("%(path)s%(Bslash)s%(filename1)s.txt" % vars());print filename
outputfilename = ("%(path)s%(Bslash)s%(outputfilename1)s%(new_str)s.txt" % vars())
print outputfilename
#### Counters
foundstart = 0
CountAllStartKeys = 0
CountAllEndKeys = 0
CountAllBetweenData = 0
CountAllRecordsProcessed = 0
#### Start Key Not Found
SKNF = 0
#### Sum number or rows in text file
with open(filename) as filename1:
SumReadFile = sum(1 for _ in filename1)
filename1.close()
print SumReadFile
#### Total set of keys
start1 = 'STARTkeyword1';start2 = 'STARTkeyword2';start3 = 'STARTkeyword3'
end1 = 'ENDkeyword1';end2 = 'ENDkeyword2';end3 = 'ENDkeyword3'
#### Count number of unique start and end keys
SK1 = 0;SK2 = 0;SK3 = 0;EK1 = 0;EK2 = 0;EK3 = 0
Keys = [start1,start2,start3,end1,end2,end3]
##print Keys
with open(filename) as filename1:
for line in filename1:
## print line
if any(word in line for word in Keys):
if start1 in line:
SK1+=1
elif start2 in line:
SK2+=1
elif start3 in line:
SK3+=1
elif end1 in line:
EK1+=1
elif end2 in line:
EK2+=1
elif end3 in line:
EK3+=1
filename1.close()
############################################################################################
############ Create if it does not exist or truncate if it does exist
outputfilename2=open('%(outputfilename)s' % vars(),'w')
outputfilename2.close()
### Opening output file to append data
outputFiles=open('%(outputfilename)s' % vars(),'a')
#### We are only checking for the start keys
StartKeys = [start1,start2,start3]
#### Opening and reading the first line in read text file
with open("%(filename)s"% vars(), "r") as readFile1:
for line in readFile1:
CountAllRecordsProcessed += 1
#### We are checking if one of the StartKeys is in line
if any(word in line for word in StartKeys):
#### Setting the variables (s1 and e1) for the start and end keys
if start1 in line:
s1 = start1; e1 = end1; SKNF = 1; ##print ('%(s1)s , %(e1)s' % vars())
elif start2 in line:
s1 = start2; e1 = end2; SKNF = 1; ##print ('%(s1)s , %(e1)s' % vars())
elif start3 in line:
s1 = start3; e1 = end3; SKNF = 1; ##print ('%(s1)s , %(e1)s' % vars())
## time.sleep(2)
if foundstart == 0 and SKNF <> 0:
if s1 in line:
#### It will append to output text file and write the start key
print ("Yes found start key %(s1)s within line = %(line)s" % vars())
outputFiles.write(line)
foundstart += 1; CountAllStartKeys += 1
## time.sleep(2)
continue
if foundstart >= 1 and SKNF <> 0:
if e1 in line:
#### It will append to output text file and write the end key
print ("Yes found end key %(e1)s within line = %(line)s" % vars())
outputFiles.write(line)
foundstart = 0; SKNF = 0; CountAllEndKeys += 1
## time.sleep(2)
elif s1 in line:
#### It will append to output text file and write no end key found
print ("\nATTENTION! No matching end key within line = %(line)s\n" % vars())
print ("\nHowever, found start key %(s1)s within line = %(line)s\n" % vars())
outputFiles.write("\nNo matching end key within line = %(line)s\n" % vars())
outputFiles.write("\nHowever, found start key %(s1)s within line = %(line)s\n" % vars())
outputFiles.write(line)
CountAllStartKeys += 1
## time.sleep(2)
continue
else:
#### It will append to output text file and write the rows between start and end key
print ("Yes found between data = %(line)s" % vars())
outputFiles.write(line)
CountAllBetweenData += 1
## time.sleep(2)
#### Closing read and write text files
readFile1.close()
outputFiles.close()
print "\nFinished Text File Read and Write"
print '\nTotal Number of Unique Start Keys'
print ("%(start1)s = " % vars())+str(SK1)
print ("%(start2)s = " % vars())+str(SK2)
print ("%(start3)s = " % vars())+str(SK3)
print "Total Number of Start Key Words Processed = " + str(CountAllStartKeys)
print '\nTotal Number of Unique End Keys'
print ("%(end1)s = " % vars())+str(EK1)
print ("%(end2)s = " % vars())+str(EK2)
print ("%(end3)s = " % vars())+str(EK3)
print "Total Number of End Key Words Processed = " + str(CountAllEndKeys)
print "\nTotal Number of Between Data Processed = " + str(CountAllBetweenData)
print "\nTotal Sum of Lines in Read File = " + str(SumReadFile)
print "Total Number of Records Processed = " + str(CountAllRecordsProcessed)
I have config file:
$ cat ../secure/test.property
#<TITLE>Connection setting
#MAIN DEV
jdbc.main.url=
jdbc.main.username=
jdbc.main.password=
#<TITLE>Mail settings
mail.smtp.host=127.0.0.1
mail.smtp.port=25
mail.smtp.on=false
email.subject.prefix=[DEV]
#<TITLE>Batch size for package processing
exposureImportService.batchSize=10
exposureImportService.waitTimeInSecs=10
ImportService.batchSize=400
ImportService.waitTimeInSecs=10
#<TITLE>Other settings
usePrecalculatedAggregation=true
###################### Datasource wrappers, which allow to log additional information
bean.datasource.query_log_wrapper=mainDataSourceWrapper
bean.gpc_datasource.query_log_wrapper=gpcDataSourceWrapper
time.to.keep.domain=7*12
time.to.keep.uncompress=1
#oracle max batch size
dao.batch.size.max=30
And function, which return line "#<TITLE>Other settings" (for example), to select "config section".
Next, need to print all lines between selected "section", and next line, startwith #<TITLE>.
How it can be realized?
P.S.
def select_section(property_file):
while True:
with open(os.path.join(CONF_DIR, property_file), 'r+') as file:
text = file.readlines()
list = []
print()
for i in text:
if '<TITLE>' in i:
line = i.lstrip('#<TITLE>').rstrip('\n')
list.append(line)
print((list.index(line)), line)
res_section = int(raw_input('\nPlease, select section to edit: '))
print('You selected: %s' % list[res_section])
if answer('Is it OK? '):
return(list[res_section])
break
And it's work like:
...
0 Connection setting
1 Mail settings
2 Batch size for package processing
3 Other settings
Please, select section to edit:
...
And expected output, if select Connection setting:
...
0 jdbc.main.url
1 jdbc.main.username
2 jdbc.main.password
Please, select line to edit:
...
If I understand the problem correctly, here's a solution that assembles the requested section as it reads the file:
def get_section(section):
marker_line = '#<TITLE>{}'.format(section)
in_section = False
section_lines = []
with open('test.property') as f:
while True:
line = f.readline()
if not line:
break
line = line.rstrip()
if line == marker_line:
in_section = True
elif in_section and line.startswith('#<TITLE>'):
break
if in_section:
if not line or line.startswith('#'):
continue
section_lines.append(line)
return '\n'.join(['{} {}'.format(i, line)
for i, line in enumerate(section_lines)])
print get_section('Connection setting')
Output:
0 jdbc.main.url=
1 jdbc.main.username=
2 jdbc.main.password=
Perhaps this will get you started.
Here's a quick solution:
def get_section(section):
results = ''
with open('../secure/test.property') as f:
lines = [l.strip() for l in f.readlines()]
indices = [i for i in range(len(lines)) if lines[i].startswith('#<TITLE>')]
for i in xrange(len(indices)):
if lines[indices[i]] == '#<TITLE>' + section:
for j in xrange(indices[i], indices[i+1] if i < len(indices)-1 else len(lines) - 1):
results += lines[j] + '\n'
break
return results
You can use it like:
print get_section('Connection setting')
Not very elegant but it works!
This is the which i am doing
import csv
output = open('output.txt' , 'wb')
# this functions return the min for num.txt
def get_min(num):
return int(open('%s.txt' % num, 'r+').readlines()[0])
# temporary variables
last_line = ''
input_list = []
#iterate over input.txt in sort the input in a list of tuples
for i, line in enumerate(open('input.txt', 'r+').readlines()):
if i%2 == 0:
last_line = line
else:
input_list.append((last_line, line))
filtered = [(header, data[:get_min(header[-2])] + '\n' ) for (header, data) in input_list]
[output.write(''.join(data)) for data in filtered]
output.close()
In this code input.txt is something like this
>012|013|0|3|M
AFDSFASDFASDFA
>005|5|67|0|6
ACCTCTGACC
>029|032|4|5|S
GGCAGGGAGCAGGCCTGTA
and num.txt is something like this
M 4
P 10
I want that in above input.txt check the amount of value from the num.txt by looking at its last column which is same like in num.txt and cut its character according to that values
I think the error in my code is that it only accept the integer text file , where it should also accept file which contain alphabets
The totally revised version, after a long chat with the OP;
import os
import re
# Fetch all hashes and counts
file_c = open('num.txt')
file_c = file_c.read()
lines = re.findall(r'\w+\.txt \d+', file_c)
numbers = {}
for line in lines:
line_split = line.split('.txt ')
hash_name = line_split[0]
count = line_split[1]
numbers[hash_name] = count
#print(numbers)
# The input file
file_i = open('input.txt')
file_i = file_i.read()
for hash_name, count in numbers.iteritems():
regex = '(' + hash_name.strip() + ')'
result = re.findall(r'>.*\|(' + regex + ')(.*?)>', file_i, re.S)
if len(result) > 0:
data_original = result[0][2]
stripped_data = result[0][2][int(count):]
file_i = file_i.replace(data_original, '\n' + stripped_data)
#print(data_original)
#print(stripped_data)
#print(file_i)
# Write the input file to new input_new.txt
f = open('input_new.txt', 'wt')
f.write(file_i)
You can do it like so;
import re
min_count = 4 # this variable will contain that count integer from where to start removing
str_to_match = 'EOG6CC67M' # this variable will contain the filename you read
input = '' # The file input (input.txt) will go in here
counter = 0
def callback_f(e):
global min_count
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Only replace the value with nothing (remove it) after a certain count
if counter > min_count:
return '' # replace with nothing
result = re.sub(r''+str_to_match, callback_f, input)
With this tactic you can keep count with a global counter and there's no need to do hard line-loops with complex structures.
Update
More detailed version with file access;
import os
import re
def callback_f(e):
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Fetch all hash-file names and their content (count)
num_files = os.listdir('./num_files')
numbers = {}
for file in num_files:
if file[0] != '.':
file_c = open('./num_files/' + file)
file_c = file_c.read()
numbers[file.split('.')[0]] = file_c
# Now the CSV files
csv_files = os.listdir('./csv_files')
for file in csv_files:
if file[0] != '.':
for hash_name, min_count in numbers.iteritems():
file_c = open('./csv_files/' + file)
file_c = file_c.read()
counter = 0
result = re.sub(r''+hash_name, callback_f, file_c)
# Write the replaced content back to the file here
Considered directory/file structure;
+ Projects
+ Project_folder
+ csv_files
- input1.csv
- input2.csv
~ etc.
+ num_files
- EOG6CC67M.txt
- EOG62JQZP.txt
~ etc.
- python_file.py
The CSV files contain the big chunks of text you state in your original question.
The Num files contain the hash-files with an Integer in them
What happens in this script;
Collect all Hash files (in a dictionary) and it's inner count number
Loop through all CSV files
Subloop through the collected numbers for each CSV file
Replace/remove (based on what you do in callback_f()) hashes after a certain count
Write the output back (it's the last comment in the script, would contain the file.write() functionality)