Merging Words into a Line - python

I am currently using Python v2.6 and trying to merge words into a line. My code supposed to read data from a text file, in which I have two rows of data both of which are strings. Then, it takes the second row data every time, which are the words of sentences, those are separated by delimiter strings, such that:
Inside the .txt:
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
...
Those "row_2_data" will add-up to a sentence later. Sorry for the long introduction btw.
Here is my code:
import sys
import re
newLine = ''
for line in sys.stdin:
word = line.split(' ')[1]
if word == '<S>+BSTag':
continue
elif word == '</S>+ESTag':
print newLine
newLine = ''
continue
else:
w = re.sub('\[.*?]', '', word)
if newLine == '':
newLine += w
else:
newLine += ' ' + w
"BSTag" is the tag for "Sentence Begins" and "ESTag" is for "Sentence Ends": the so called "delimiters". "re.sub" is used for a special purpose and it works as far as I checked.
The problem is that, when I execute this python script from the command line in linux with the following command: $ cat file.txt | script.py | less, I can not see any output, but just a blank file.
For those who are not familiar with linux, I guess the problem has nothing to do with terminal execution, thus you can neglect that part. Simply, the code does not work as intended and I can not find a single mistake.
Any help will be appreciated, and thanks for reading the long post :)
Ok, the problem is solved, which was actually a corpus error instead of a coding one. A very odd entry was detected in the text file, which was causing problems. Removing it solved it. You can use both of these approaches: mine and the one presented by "snurre" if you want a similar text processing.
Cheers.

def foo(lines):
output = []
for line in lines:
words = line.split()
if len(words) < 2:
word = words[0]
else:
word = words[1]
if word == '</S>+ESTag':
yield ' '.join(output)
output = []
elif word != '<S>+BSTag':
output.append(words[1])
for sentence in foo(sys.stdin):
print sentence
Your regex is a little funky. From what I can tell, it's replacing anything between (and including) a pair of [ and ] with '', so it ends up printing empty strings.

I think the problem is that the script isn't being executed (unless you just excluded the shebang in the code you posted)
Try this
cat file.txt | python script.py | less

Related

Printing words with the same first and last character and words with the same second character and one before the last character

#!/ usr/bin/python3
import sys
def main():
for line in sys.stdin:
line = line.split()
x = -1
for word in line:
if word[-1]==word[0] or word[x-1]==word[1]:
print(word)
main()
It also prints dots at the end of the sentences, why?
And words like 'cat' and 'moon' should also be out of the question. But it also prints these words.
Can someone point me in the right direction please?
I think your problem is because the second and second last characters of 'cat' are the same.
def main():
for line in sys.stdin:
line = line.split()
x = -1
for word in line:
if (word[-1]==word[0] and len(word)<=2) or (word[x-1]==word[1] and len(word)<=4):
print(word)
or something like that, depending on your preference.
This should get rid of that pesky cat, although moon stays.
It will also include words that use upper and lower case characters, so sadly not only will moon print but also Moon, MOon, mooN and moOn.
Edit: Forgot to test for one character words (a, I etc)
import sys
def main():
for line in sys.stdin:
line = line.split()
for word in line:
uword = word.lower()
if len(uword) > 1:
if uword[0:1]==uword[-1] or (uword[1:2]==uword[-2] and len(uword) > 3):
print(word)
main()
I got it guys, understood the question wrong. This prints the right words, that I got beforehand. That cleared things up for me. This is the right code but it still gives "sys.excepthook is missing". I run this code with another code that gives a space an newline. So every space between words becomes a newline:
cat cdb.sentences| python3 newline.py| python3 word.py |head -n 5
import sys
def main():
for line in sys.stdin:
line = line.split()
for word in line:
letterword = lw = word.lower()
if len(lw) > 1:
if lw[0:1]==lw[-1] and (lw[1:2]==lw[-2]):
print(word)
main()
import sys
def main():
for line in sys.stdin:
line = line.rstrip()
text = ""
for word in line:
if word in ' ':
text=text + '\n'
else:
text=text + word
print(text)
main()
It should give the 5 first words that have the same first, last letter, -2 and 1 letters. With an white line between each one of them. First i want to solve that hook.
Thx
You are not helping yourself by answering your own question with what is essentially a completely different question in an answer.
You should have closed your original off by accepting one of the answers, if one of them helped, which it looked like they did and then asked a new question.
However, the answer to your 2nd question/answer can be found here:
http://python.developermemo.com/7757_12807216/ and it is a brilliant answer
Synopsis:
The reason this is happening is that you're piping a nonzero amount of output from your Python script to something which never reads from standard input. You can get the same result by piping to any command which doesn't read standard input, such as
python testscript.py | cd .
Or for a simpler example, consider a script printer.py containing nothing more than
print 'abcde'
Then
python printer.py | python printer.py
will produce the same error.
The following however will trap the sys.excepthook error:
import sys
import logging
def log_uncaught_exceptions(exception_type, exception, tb):
logging.critical(''.join(traceback.format_tb(tb)))
logging.critical('{0}: {1}'.format(exception_type, exception))
sys.excepthook = log_uncaught_exceptions
print "abcdfe"

Python - How to make sure that a line being read from a file contain only a given string and nothing else

In order to make sure I start and stop reading a text file exactly where I want to, I am providing 'start1'<->'end1', 'start2'<->'end2' as tags in between the text file and providing that to my python script. In my script I read it as:
start_end = ['start1','end1']
line_num = []
with open(file_path) as fp1:
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
line_num.append(num)
fp1.close()
print '\nLine number: ', line_num
fp2 = open(file_path)
for k, line2 in enumerate(fp2):
for x in range(line_num[0], line_num[1] - 1):
if k == x:
header.append(line2)
fp2.close()
This works well until I reach start10 <-> end10 and further. Eg. it checks if I have "start2" in the line and also reads the text that has "start21" and similarly for end tag as well. so providing "start1, end1" as input also reads "start10, end10". If I replace the line:
if i in line:
with
if i == line:
it throws an error.
How can I make sure that the script reads the line that contains ONLY "start1" and not "start10"?
import re
prog = re.compile('start1$')
if prog.match(line):
print line
That should return None if there is no match and return a regex match object if the line matches the compiled regex. The '$' at the end of the regex says that's the end of the line, so 'start1' works but 'start10' doesn't.
or another way..
def test(line):
import re
prog = re.compile('start1$')
return prog.match(line) != None
> test('start1')
True
> test('start10')
False
Since your markers are always at the end of the line, change:
start_end = ['start1','end1']
to:
start_end = ['start1\n','end1\n']
You probably want to look into regular expressions. The Python re library has some good regex tools. It would let you define a string to compare your line to and it has the ability to check for start and end of lines.
If you can control the input file, consider adding an underscore (or any non-number character) to the end of each tag.
'start1_'<->'end1_'
'start10_'<->'end10_'
The regular expression solution presented in other answers is more elegant, but requires using regular expressions.
You can do this with find():
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
# make sure the next char isn't '0'
if line[line.find(i)+len(i)] != '0':
line_num.append(num)

How to obtain overlapping patterns with Python finditer?

I'm using a python script that is searching for a pattern in a fasta file. It is working very well but it does not return overlapping strings. Unfortunately, I'm interested in potential overlapping strings. Since I'm not a programmer (I'm just trying to learn Python), I was wondering if someone could modify the script in order to find overlapping strings. I think that the regex module could do it but I tried to install it on my computer (Windows) without succes. I got this:
C:\Python33>regex-2014.02.19>python setup.py install
running install
running build
running build_py
runnning built_ext
building'_regex' extension
error:Unable to find vcvarsall.bat
For me it would be easier to work with a modified script. So here is my script:
import re
import sys
psq_re_f= re.compile('G{3,}.{1,7}?G{3,}.{1,7}?G{3,}.{1,7}?G{3,}') #((?<=G)[^G]|(?<!G).)
psq_re_r= re.compile('C{3,}.{1,7}?C{3,}.{1,7}?C{3,}.{1,7}?C{3,}') #((?<=C)[^C]|(?<!C).)
filename = input('Enter the name of the input fasta file: ')
ref_seq_fh = open(filename)
outputfileg = open("strelkaindels_quadg.txt",'wt')
outputfilec = open("strelkaindels_quadc.txt",'wt')
outputfileg.write('#\tID\tEntry Length\tStart\tEnd\tLength\tStrand\tSequence\n')
outputfilec.write('#\tID\tEntry Length\tStart\tEnd\tLength\tStrand\tSequence\n')
count = 0
ref_seq = []
line = (ref_seq_fh.readline()).strip()
chr = re.sub('^>', '', line)
chr1 = chr.split (":")
#line = (ref_seq_fh.readline()).strip()
while True:
while line.startswith('>') is False:
ref_seq.append(line)
line = (ref_seq_fh.readline()).strip()
if line == '':
break
ref_seq = ''.join(ref_seq)
for m in re.finditer(psq_re_f, ref_seq):
count=count+1
outputfileg.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' %(count, chr1[0], len(ref_seq), m.start(), m.end(), len(m.group(0)), '+', m.group(0))+'\n')
for m in re.finditer(psq_re_r, ref_seq):
count=count+1
outputfilec.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' %(count, chr1[0], len(ref_seq), m.start(), m.end(), len(m.group(0)), '-', m.group(0))+'\n')
chr = re.sub('^>', '', line)
chr1 = chr.split (":")
ref_seq = []
line= (ref_seq_fh.readline()).strip()
if line == '':
break
outputfileg.close()
outputfilec.close()
Finally an example of fasta file (text-based format for representing nucleotide sequences) widely used in biology:
>id_1
agatagatgatagatatagagagcgcgctagatcgatcgatcgagtcgatcgcgcggggggcccctctctctctatagggacatacga
>id_2
agacatcagatacagagatatttacataacaagagatacag
>id_3
cgctctagctcctcctctcgcgtagctagctctctctaacatgattagaattcagatcgatcgatcgatggttttttttctctct
and so on...
For example, let's imagine the following sequence:
GGGTGGGTGGGCGGGAGGG
The script will return only this string:
GGGTGGGTGGGCGGG
But I would like to also get that one too:
GGGTGGGCGGGAGGG
You could try using a positive looakehead:
(?=(G{3,}.{1,7}?G{3,}.{1,7}?G{3,}.{1,7}?G{3,}))
regex101 demo
And in your code, you will have to change your groups to .group(1) but m.end() will be the same as m.start(), so you might work around it a bit, maybe my using len():
for m in re.finditer(psq_re_f, ref_seq):
count=count+1
outputfileg.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' % (count,
chr1[0], len(ref_seq), m.start(),
m.start() + len(m.group(1)), len(m.group(1)),
'+',m.group(1))+'\n')

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

str.startswith() not working as I intended

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19
\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.
Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.
You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'

Categories