Read file and get certain value from each line of file - python

I'm stuck on a particular point on something, I'm hoping you guys could perhaps suggest a better method.
For each line of file I'm reading I want to get the nth word of the line, store this and print them on a single line.
I have the following code:
import os
p = './output.txt'
word_line = ' '
myfile = open(p, 'r')
for words in myfile.readlines()[1:]: # I remove the first line because I don't want it
current_word = words.strip().split(' ')[4]
word_line += current_word
print word_line
myfile.close()
The file it reads looks like this:
1 abc-abc.abc (1235456) [AS100] bla 123 etc
2 abc-abc.abc (1235456) [AS10] bla 123 etc
3 abc-abc.abc (1235456) [AS1] bla 123 etc
4 abc-abc.abc (1235456) [AS56] bla 123 etc
5 abc-abc.abc (1235456) [AS8] bla 123 etc
6 abc-abc.abc (1235456) [AS200] bla 123 etc
etc
My current code outputs the following:
[AS100][AS10][AS1][AS56][AS8][AS200]
Only problem is, it is not always fixed as the 4th value of the line, as sometimes it appears as 5th, etc or not at all.
I'm currently trying out:
if re.match("[AS", words):
f_word = re.match(".*[(.*)",words)
This isn't working out, I'm trying to see if in the current line it finds an open "[" If it does to display the content of it before the closing "]. Move on to the new line and keep on doing this.
Eventually have the following desired output:
AS100 AS10 AS1 AS56 AS8 AS200
I could really use some advise on this. Thanks
EDIT:
m = re.search(r'\[AS(.*?)]', words)
if m:
f_word += ' ' + m.group(1)
Thanks

[ is a special character in regular expressions and denotes the start of a character class. Escape it.
m = re.search(r'\[AS(.*?)]', words)
if m:
f_word = m.group(1)

Related

how to get a sequence after a word with whitespace

For school I have to parse a string after a word with a lot of whitespace, but I just can't get it.
Because the file is a genbank.
So for example:
BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//
What I have tried is this.
if line.startswith("BLA"):
start = line.find("BLA")
end = line.find("//")
line = line[:end]
s_string = ""
string = list()
if s_string:
string.append(line)
else:
line = line.strip()
my_seq += line
But what I get is:
**output**
BLA
and that is the only thing it get and I want to get the output be like
**output**
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
So I don't know what to do, I tried to get it like that last output. But without success. My teacher told me that I had to do like. If BLA is True you can go iterate it. And if you see "//" you have to stop, but when I tried it with that True - statement I get nothing.
I tried to search it up online, and it said I had to do it with bio seqIO. But the teacher said we can't use that.
Here is my solution:
lines = """BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//"""
lines = lines.strip().split("//")
lines = lines[0].split("BLA")
lines = [i.strip() for i in lines]
print("BLA", " ", lines[1])
Output:
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj

Finding data in-between two strings in python

I have a text file which contain some format like :
PAGE(leave) 'Data1'
line 1
line 2
line 2
...
...
...
PAGE(enter) 'Data1'
I need to get all the lines in between the two keywords and save it a text file. I have come across the following so far. But I have an issue with single quotes as regular expression thinks it as the quote in the expression rather than the keyword.
My codes so far:
log_file = open('messages','r')
data = log_file.read()
block = re.compile(ur'PAGE\(leave\) \'Data1\'[\S ]+\s((?:(?![^\n]+PAGE\(enter\) \'Data1\').)*)', re.IGNORECASE | re.DOTALL)
data_in_home_block=re.findall(block, data)
file = 0
make_directory("home_to_home_data",1)
for line in data_in_home_block:
file = file + 1
with open("home_to_home_" + str(file) , "a") as data_in_home_to_home:
data_in_home_to_home.write(str(line))
It would be great if someone could guide me how to implement it..
As pointed out by #JoanCharmant, it is not necessary to use regex for this task, because the records are delimited by fixed strings.
Something like this should be enough:
messages = open('messages').read()
blocks = [block.rpartition(r"PAGE\(enter\) 'Data1'")[0]
for block in messages.split(r"PAGE\(leave\) 'Data1'")
if block and not block.isspace()]
for count, block in enumerate(blocks, 1):
with open('home_to_home_%d' % count, 'a') as stream:
stream.write(block)
If it's single quotes what worry you, you can start the regular expression string with double quotes...
'hello "howdy"' # Correct
"hello 'howdy'" # Correct
Now, there are more issues here... Even when declared asr, you still must escape your regular expression's backslashes in the .compile (see What does the "r" in pythons re.compile(r' pattern flags') mean? ) Is just that without the r, you probably would need a lot more of backslashes.
I've created a test file with two "sections":
PAGE\(leave\) 'Data1'
line 1
line 2
line 3
PAGE\(enter\) 'Data1'
PAGE\(leave\) 'Data1'
line 4
line 5
line 6
PAGE\(enter\) 'Data1'
The code below will do what you want (I think)
import re
log_file = open('test.txt', 'r')
data = log_file.read()
log_file.close()
block = re.compile(
ur"(PAGE\\\(leave\\\) 'Data1'\n)"
"(.*?)"
"(PAGE\\\(enter\\\) 'Data1')",
re.IGNORECASE | re.DOTALL | re.MULTILINE
)
data_in_home_block = [result[1] for result in re.findall(block, data)]
for data_block in data_in_home_block:
print "Found data_block: %s" % (data_block,)
Outputs:
Found data_block: line 1
line 2
line 3
Found data_block: line 4
line 5
line 6

Merging Words into a Line

I am currently using Python v2.6 and trying to merge words into a line. My code supposed to read data from a text file, in which I have two rows of data both of which are strings. Then, it takes the second row data every time, which are the words of sentences, those are separated by delimiter strings, such that:
Inside the .txt:
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
...
Those "row_2_data" will add-up to a sentence later. Sorry for the long introduction btw.
Here is my code:
import sys
import re
newLine = ''
for line in sys.stdin:
word = line.split(' ')[1]
if word == '<S>+BSTag':
continue
elif word == '</S>+ESTag':
print newLine
newLine = ''
continue
else:
w = re.sub('\[.*?]', '', word)
if newLine == '':
newLine += w
else:
newLine += ' ' + w
"BSTag" is the tag for "Sentence Begins" and "ESTag" is for "Sentence Ends": the so called "delimiters". "re.sub" is used for a special purpose and it works as far as I checked.
The problem is that, when I execute this python script from the command line in linux with the following command: $ cat file.txt | script.py | less, I can not see any output, but just a blank file.
For those who are not familiar with linux, I guess the problem has nothing to do with terminal execution, thus you can neglect that part. Simply, the code does not work as intended and I can not find a single mistake.
Any help will be appreciated, and thanks for reading the long post :)
Ok, the problem is solved, which was actually a corpus error instead of a coding one. A very odd entry was detected in the text file, which was causing problems. Removing it solved it. You can use both of these approaches: mine and the one presented by "snurre" if you want a similar text processing.
Cheers.
def foo(lines):
output = []
for line in lines:
words = line.split()
if len(words) < 2:
word = words[0]
else:
word = words[1]
if word == '</S>+ESTag':
yield ' '.join(output)
output = []
elif word != '<S>+BSTag':
output.append(words[1])
for sentence in foo(sys.stdin):
print sentence
Your regex is a little funky. From what I can tell, it's replacing anything between (and including) a pair of [ and ] with '', so it ends up printing empty strings.
I think the problem is that the script isn't being executed (unless you just excluded the shebang in the code you posted)
Try this
cat file.txt | python script.py | less

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

Python - go to two lines above match

In a text file like this:
First Name last name #
secone name
Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
First Name last name #
....same as above...
I need to match string 'Work Phone:' then go two lines up and insert character '|' in the begining of line. so pseudo code would be:
if "Work Phone:" in line:
go up two lines:
write | + line
write rest of the lines.
File is about 10 mb and there are about 1000 paragraphs like this.
Then i need to write it to another file. So desired result would be:
First Name last name #
secone name
|Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
thanks for any help.
This solution doesn't read whole file into memory
p=""
q=""
for line in open("file"):
line=line.rstrip()
if "Work Phone" in line:
p="|"+p
if p: print p
p,q=q,line
print p
print q
output
$ python test.py
First Name last name #
secone name
|Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
You can use this regex
(.*\n){2}(Work Phone:)
and replace the matches with
|\1\2
You don't even need Python, you can do such a thing in any modern text editor, like Vim.
Something like this?
lines = text.splitlines()
for i, line in enumerate(lines):
if 'Work Phone:' in line:
lines[i-2] = '|' + lines[i-2]

Categories