Python text processing/finding data

Python text processing/finding data - python

I am trying to parse/process some information from a text file using Python. This file contains names, employee numbers and other data. I do not know the names or employee numbers ahead of time. I do know that after the names there is the text: "Per End" and before the employee number there is the text: "File:". I can find these items using the .find() method. But, how do I ask Python to look at the information that comes before or after "Per End" and "File:"? In this specific case the output should be the name and employee number.
The text looks like this:
SMITH, John
Per End: 12/10/2016
File:
002013
Dept:
000400
Rate:10384 60
My code is thus:
file = open("Register.txt", "rt")
lines = file.readlines()
file.close()
countPer = 0
for line in lines:
line = line.strip()
print (line)
if line.find('Per End') != -1:
countPer += 1
print ("Per End #'s: ", countPer)

file = open("Register.txt", "rt")
lines = file.readlines()
file.close()
for indx, line in enumerate(lines):
line = line.strip()
print (line)
if line.find('Per End') != -1:
print lines[indx-1].strip()
if line.find('File:') != -1:
print lines[indx+1].strip()
enumerate(lines) gives access to indices and line as well, there by you can access previous and next lines as well
here is my stdout directly ran in python shell:
>>> file = open("r.txt", "rt")
>>> lines = file.readlines()
>>> file.close()
>>> lines
['SMITH, John\n', 'Per End: 12/10/2016\n', 'File:\n', '002013\n', 'Dept:\n', '000400\n', 'Rate:10384 60\n']
>>> for indx, line in enumerate(lines):
... line = line.strip()
... if line.find('Per End') != -1:
... print lines[indx-1].strip()
... if line.find('File:') != -1:
... print lines[indx+1].strip()
SMITH, John
002013

Here is how I would do it.
First, some test data.
test = """SMITH, John\n
Per End: 12/10/2016\n
File:\n
002013\n
Dept:\n
000400\n
Rate:10384 60\n"""
text = [line for line in test.splitlines(keepends=False) if line != ""]
Now for the real answer.
count_per, count_num = 0, 0
Using enumerate on an iterable gives you an index automagically.
for idx, line in enumerate(text):
# Just test whether what you're looking for is in the `str`
if 'Per End' in line:
print(text[idx - 1]) # access the full set of lines with idx
count_per += 1
if 'File:' in line:
print(text[idx + 1])
count_num += 1
print("Per Ends = {}".format(count_per))
print("Files = {}".format(count_num))
yields for me:
SMITH, John
002013
Per Ends = 1
Files = 1

Related

Lines missing in python

I am writing a code in python where I am removing all the text after a specific word but in output lines are missing. I have a text file in unicode which have 3 lines:
my name is test1
my name is
my name is test 2
What I want is to remove text after word "test" so I could get the output as below
my name is test
my name is
my name is test
I have written a code but it does the task but also removes the second line "my name is"
My code is below
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
with open(r"test.txt", "w") as fp:
fp.write(txt)

It looks like if there is no keyword found the index become -1.
So you are avoiding the lines w/o keyword.
I would modify your if by adding the condition as follows:
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
elif index < 0:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)
No need to add \n because the line already contains it.

Your code does not append the line if the splitStr is not defined.
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index != -1:
txt += line[:index + len(splitStr)] + "\n"
else:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)

In my solution I simulate the input file via io.StringIO. Compared to your code my solution remove the else branch and only use one += operater. Also splitStr is set only one time and not on each iteration. This makes the code more clear and reduces possible errore sources.
import io
# simulates a file for this example
the_file = io.StringIO("""my name is test1
my name is
my name is test 2""")
txt = ""
splitStr = "test"
with the_file as fp:
# each line
for line in fp.readlines():
# cut somoething?
if splitStr in line:
# find index
index = line.find(splitStr)
# cut after 'splitStr' and add newline
line = line[:index + len(splitStr)] + "\n"
# append line to output
txt += line
print(txt)
When handling with files in Python 3 it is recommended to use pathlib for that like this.
import pathlib
file_path = pathlib.Path("test.txt")
# read from wile
with file_path.open('r') as fp:
# do something
# write back to the file
with file_path.open('w') as fp:
# do something

Suggestion:
for line in fp.readlines():
i = line.find('test')
if i != -1:
line = line[:i]

Reading a specific portion of data from a .txt file in Python

>gene1
ATGATGATGGCG
>gene2
GGCATATC
CGGATACC
>gene3
TAGCTAGCCCGC
This is the text file which I am trying to read.
I want to read every gene in a different string and then add it in a list
There are header lines starting with ’>’ character to recognize if this is a start or end of a gene
with open('sequences1.txt') as input_data:
for line in input_data:
while line != ">":
list.append(line)
print(list)
When printed the list should display list should be
list =["ATGATGATGGCG","GGCATATCCGGATACC","TAGCTAGCCCGC"]

with open('sequences1.txt') as input_data:
sequences = []
gene = []
for line in input_data:
if line.startswith('>gene'):
if gene:
sequences.append(''.join(gene))
gene = []
else:
gene.append(line.strip())
sequences.append(''.join(gene)) # append last gene
print(sequences)
output:
['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']

You have multiple mistakes in your code, look here:
with open('sequences1.txt', 'r') as file:
list = []
for line in file.read().split('\n'):
if not line.startswith(">") and len(line$
list.append(line)
print(list)

Try this:
$ cat genes.txt
>gene1
ATGATGATGGCG
>gene2
GGCATATC
CGGATACC
>gene3
TAGCTAGCCCGC
$ python
>>> genes = []
>>> with open('genes.txt') as file_:
... for line in f:
... if not line.startswith('>'):
... genes.append(line.strip())
...
>>> print(genes)
['ATGATGATGGCG', 'GGCATATC', 'CGGATACC', 'TAGCTAGCCCGC']

sequences1.txt:
>gene1
ATGATGATGGCG
>gene2
GGCATATC
CGGATACC
>gene3
TAGCTAGCCCGC
and then:
desired_text = []
with open('sequences1.txt') as input_data:
content = input_data.readlines()
content = [l.strip() for l in content if l.strip()]
for line in content:
if not line.startswith('>'):
desired_text.append(line)
print(desired_text)
OUTPUT:
['ATGATGATGGCG', 'GGCATATC', 'CGGATACC', 'TAGCTAGCCCGC']
EDIT:
Sped-read it, fixed it with the desired output
with open('sequences1.txt') as input_data:
content = input_data.readlines()
# you may also want to remove empty lines
content = [l.strip() for l in content if l.strip()]
# flag
nextLine = False
# list to save the lines
textList = []
concatenated = ''
for line in content:
find_TC = line.find('gene')
if find_TC > 0:
nextLine = not nextLine
else:
if nextLine:
textList.append(line)
else:
if find_TC < 0:
if concatenated != '':
concatenated = concatenated + line
textList.append(concatenated)
else:
concatenated = line
print(textList)
OUTPUT:
['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']

reading and deleting lines in python

I am trying to read all the lines in a specific file, and it prints the number of the line as an index.
What I am trying to do is to delete the line by inputting the number of the line by the user.
As far as it is now, it prints all the lines with the number of that line, but when I enter the number of the line to be deleted, it's not deleted.
This is the code of the delete function:
def deleteorders ():
index = 0
fh = open ('orders.txt', 'r')
lines = fh.readlines()
for line in lines:
lines = fh.readlines()
index = index+1
print (str(index) + ' ' + line)
try:
indexinp = int(input('Enter the number of the order to be deleted, or "B" to go back: '))
if indexinp == 'B':
return
else:
del line[indexinp]
print (line)
fh = open ('orders.txt', 'w')
fh.writelines(line)
fh.close()
except:
print ('The entered number is not in the range')
return

This should work (you'll need to add the error handling back in):
lines = enumerate(open('orders.txt'))
for i, line in lines:
print i, line
i = int(input(">"))
open('orders.txt', 'w').write(''.join((v for k, v in lines if k != i)))

How do I count the number of lines that are full-line comments in python?

I'm trying to create a function that accepts a file as input and prints the number of lines that are full-line comments (i.e. the line begins with #followed by some comments).
For example a file that contains say the following lines should print the result 2:
abc
#some random comment
cde
fgh
#another random comment
So far I tried along the lines of but just not picking up the hash symbol:
infile = open("code.py", "r")
line = infile.readline()
def countHashedLines(filename) :
while line != "" :
hashes = '#'
value = line
print(value) #here you will get all
#if(value == hashes): tried this but just wasn't working
# print("hi")
for line in value:
line = line.split('#', 1)[1]
line = line.rstrip()
print(value)
line = infile.readline()
return()
Thanks in advance,
Jemma

I re-worded a few statements for ease of use (subjective) but this will give you the desired output.
def countHashedLines(lines):
tally = 0
for line in lines:
if line.startswith('#'): tally += 1
return tally
infile = open('code.py', 'r')
all_lines = infile.readlines()
num_hash_nums = countHashedLines(all_lines) # <- 2
infile.close()
...or if you want a compact and clean version of the function...
def countHashedLines(lines):
return len([line for line in lines if line.startswith('#')])

I would pass the file through standard input
import sys
count = 0
for line in sys.stdin: """ Note: you could also open the file and iterate through it"""
if line[0] == '#': """ Every time a line begins with # """
count += 1 """ Increment """
print(count)

Here is another solution that uses regular expressions and will detect comments that have white space in front.
import re
def countFullLineComments(infile) :
count = 0
p = re.compile(r"^\s*#.*$")
for line in infile.readlines():
m = p.match(line)
if m:
count += 1
print(m.group(0))
return count
infile = open("code.py", "r")
print(countFullLineComments(infile))

Print text between two separators?

I have config file:
$ cat ../secure/test.property
#<TITLE>Connection setting
#MAIN DEV
jdbc.main.url=
jdbc.main.username=
jdbc.main.password=
#<TITLE>Mail settings
mail.smtp.host=127.0.0.1
mail.smtp.port=25
mail.smtp.on=false
email.subject.prefix=[DEV]
#<TITLE>Batch size for package processing
exposureImportService.batchSize=10
exposureImportService.waitTimeInSecs=10
ImportService.batchSize=400
ImportService.waitTimeInSecs=10
#<TITLE>Other settings
usePrecalculatedAggregation=true
###################### Datasource wrappers, which allow to log additional information
bean.datasource.query_log_wrapper=mainDataSourceWrapper
bean.gpc_datasource.query_log_wrapper=gpcDataSourceWrapper
time.to.keep.domain=7*12
time.to.keep.uncompress=1
#oracle max batch size
dao.batch.size.max=30
And function, which return line "#<TITLE>Other settings" (for example), to select "config section".
Next, need to print all lines between selected "section", and next line, startwith #<TITLE>.
How it can be realized?
P.S.
def select_section(property_file):
while True:
with open(os.path.join(CONF_DIR, property_file), 'r+') as file:
text = file.readlines()
list = []
print()
for i in text:
if '<TITLE>' in i:
line = i.lstrip('#<TITLE>').rstrip('\n')
list.append(line)
print((list.index(line)), line)
res_section = int(raw_input('\nPlease, select section to edit: '))
print('You selected: %s' % list[res_section])
if answer('Is it OK? '):
return(list[res_section])
break
And it's work like:
...
0 Connection setting
1 Mail settings
2 Batch size for package processing
3 Other settings
Please, select section to edit:
...
And expected output, if select Connection setting:
...
0 jdbc.main.url
1 jdbc.main.username
2 jdbc.main.password
Please, select line to edit:
...

If I understand the problem correctly, here's a solution that assembles the requested section as it reads the file:
def get_section(section):
marker_line = '#<TITLE>{}'.format(section)
in_section = False
section_lines = []
with open('test.property') as f:
while True:
line = f.readline()
if not line:
break
line = line.rstrip()
if line == marker_line:
in_section = True
elif in_section and line.startswith('#<TITLE>'):
break
if in_section:
if not line or line.startswith('#'):
continue
section_lines.append(line)
return '\n'.join(['{} {}'.format(i, line)
for i, line in enumerate(section_lines)])
print get_section('Connection setting')
Output:
0 jdbc.main.url=
1 jdbc.main.username=
2 jdbc.main.password=
Perhaps this will get you started.

Here's a quick solution:
def get_section(section):
results = ''
with open('../secure/test.property') as f:
lines = [l.strip() for l in f.readlines()]
indices = [i for i in range(len(lines)) if lines[i].startswith('#<TITLE>')]
for i in xrange(len(indices)):
if lines[indices[i]] == '#<TITLE>' + section:
for j in xrange(indices[i], indices[i+1] if i < len(indices)-1 else len(lines) - 1):
results += lines[j] + '\n'
break
return results
You can use it like:
print get_section('Connection setting')
Not very elegant but it works!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python text processing/finding data - python

Related

Lines missing in python

Reading a specific portion of data from a .txt file in Python

reading and deleting lines in python

How do I count the number of lines that are full-line comments in python?

Print text between two separators?

Categories

Resources