Faster operation reading file - python

I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:
ABCDEF
would return 2, as we have both ABCDE and BCDEF, but
AAAAAA
would return 1. My code:
control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
f.seek(control_var)
aux = f.read(5)
if(aux not in list_of_substrings):
list_of_substrings.append(aux)
control_var += 1
f.close()
print len(list_of_substrings)
Would another approach be faster (instead of comparing the strings direct from the file)?

Depending on what your definition of a legal substring is, here is a possible solution:
import re
regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
input = fh.read()
print len(set(re.findall(regex, input)))
Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.
Here is an execution example:
>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])
EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):
import timeit
import re
FILE_NAME = r'input.txt'
def re_approach():
return len(set(re.findall(r'(?=(.{5}))', input[:-1])))
def iter_approach():
return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))
with open(FILE_NAME, 'r') as fh:
input = fh.read()
# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)

15MB doesn't sound like a lot. Something like this probably would work fine:
import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)
Update
I think user590028 gave a great solution, but here is another option:
contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))
# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()

You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.
So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.

Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:
fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
key = buf[x:x+5]
fives[key] = 1
for keys in fives.keys():
print keys

Related

Searching text file for string in python

I'm using Python to search a large text file for a certain string, below the string is the data that I am interested in performing data analysis on.
def my_function(filename, variable2, variable3, variable4):
array1 = []
with open(filename) as a:
special_string = str('info %d info =*' %variable3)
for line in a:
if special_string == array1:
array1 = [next(a) for i in range(9)]
line = next(a)
break
elif special_string != c:
c = line.strip()
In the special_string variable, whatever comes after info = can vary, so I am trying to put a wildcard operator as seen above. The only way I can get the function to run though is if I put in the exact string I want to search for, including everything after the equals sign as follows:
special_string = str('info %d info = more_stuff' %variable3)
How can I assign a wildcard operator to the rest of the string to make my function more robust?
If your special string always occurs at the start of a line, then you can use the below check (where special_string does not have the * at the end):
line.startswith(special_string)
Otherwise, please do look at the module re in the standard library for working with regular expressions.
Have you thought about using something like this?
Based on your input, I'm assuming the following:
variable3 = 100000
special_string = str('info %d info = more_stuff' %variable3)
import re
pattern = re.compile('(info\s*\d+\s*info\s=)(.*)')
output = pattern.findall(special_string)
print(output[0][1])
Which would return:
more_stuff

Replace "*" (asterics) in HTML file with increasing number with python

I have a HTML file that has a series of * (asterics) in it and would like to replace it with numbers starting from 0 and on until it replaces all * (asterics) with a number.
I am unsure if this is possible in python or if another methods would be better.
Edit 2
Here is a short snippet from the TXT file that I am working on
<td nowrap>4/29/2011 14.42</td>
<td align="center">*</td></tr>
I made a file just containing those lines to test out the code.
And here is the code that I am attempting to use to change the asterics:
number = 0
with open('index.txt', 'r+') as inf:
text = inf.read()
while "*" in text:
print "I am in the loop"
text = text.replace("*", str(number), 1)
number += 1
I think that is as much detail as I can go into. Please let me know if I should just add this edit as another comment or keep it as an edit.
And thanks for all the quick responses so far~!
Use the re.sub() function, this lets you produce a new value for each replacement by using a function for the repl argument:
from itertools import count
with open('index.txt', 'r') as inf:
text = inf.read()
text = re.sub(r'\*', lambda m, c=count(): str(next(c)), text)
with open('index.txt', 'w') as outf:
outf.write(text)
The count is taken care of by itertools.count(); each time you call next() on such an object the next value in the series is produced:
>>> import re
>>> from itertools import count
>>> sample = '''\
... foo*bar
... bar**foo
... *hello*world
... '''
>>> print(re.sub(r'\*', lambda m, c=count(): str(next(c)), sample))
foo0bar
bar12foo
3hello4world
Huapito's approach would work too, albeit slowly, provided you limit the number of replacements and actually store the result of the replacement:
with open('index.txt', 'r') as inf:
text = inf.read()
while "*" in text:
text = text.replace("*", str(number), 1)
number += 1
Note the third argument to str.replace(); that tells the method to only replace the first instance of the character.
html = 'some string containing html'
new_html = list(html)
count = 0
for char in range(0, len(new_html)):
if new_html[char] == '*':
new_html[char] = count
count += 1
new_html = ''.join(new_html)
This would replace each asteric with the numbers 1 to one less than the number of asterics, in order.
You need to iterate over each char, you can write to a tempfile and then replace the original with shutil.move using itertools.count to assign a number incrementally each time you find an asterix:
from tempfile import NamedTemporaryFile
from shutil import move
from itertools import count
cn = count()
with open("in.html") as f, NamedTemporaryFile("w+",dir="",delete=False) as out:
out.writelines((ch if ch != "*" else str(next(cn))
for line in f for ch in line ))
move(out.name,"in.html")
using a test file with:
foo*bar
bar**foo
*hello*world
Will output:
foo1bar
bar23foo
4hello5world
It is possible. Have a look at the docs. You should use something like a 'while' loop and 'replace'
Example:
number=0 # the first number
while "*" in text: #repeats the following code until this is false
text = text.replace("*", str(number), maxreplace=1) # replace with 'number'
number+=1 #increase number
Use fileinput
import fileinput
with fileinput.FileInput(fileToSearch, inplace=True) as file:
number=0
for line in file:
print(line.replace("*", str(number))
number+=1

Python: Counting a specific set of character occurrences in lines of a file

I am struggling with a small program in Python which aims at counting the occurrences of a specific set of characters in the lines of a text file.
As an example, if I want to count '!' and '#' from the following lines
hi!
hello#gmail.com
collection!
I'd expect the following output:
!;2
#;1
So far I got a functional code, but it's inefficient and does not use the potential that Python libraries have.
I have tried using collections.counter, with limited success. The efficiency blocker I found is that I couldn't select specific sets of characters on counter.update(), all the rest of the characters found were also counted. Then I would have to filter the characters I am not interested in, which adds another loop...
I also considered regular expressions, but I can't see an advantage in this case.
This is the functional code I have right now (the simplest idea I could imagine), which looks for special characters in file's lines. I'd like to see if someone can come up with a neater Python-specific idea:
def count_special_chars(filename):
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = dict(zip(special_chars, [0] * len(special_chars)))
with open(filename) as f:
for passw in f:
for c in passw:
if c in special_chars:
dict_count[c] += 1
return dict_count
thanks for checking
Why not count the whole file all together? You should avoid looping through string for each line of the file. Use string.count instead.
from pprint import pprint
# Better coding style: put constant out of the function
SPECIAL_CHARS = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ '
def count_special_chars(filename):
with open(filename) as f:
content = f.read()
return dict([(i, content.count(i)) for i in SPECIAL_CHARS])
pprint(count_special_chars('example.txt'))
example output:
{' ': 0,
'!': 2,
'.': 1,
'#': 1,
'[': 0,
'~': 0
# the remaining keys with a value of zero are ignored
...}
Eliminating the extra counts from collections.Counter is probably not significant either way, but if it bothers you, do it during the initial iteration:
from collections import Counter
special_chars = '''!"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~ '''
found_chars = [c for c in open(yourfile).read() if c in special_chars]
counted_chars = Counter(found_chars)
need not to process file contents line-by-line
to avoid nested loops, which increase complexity of your program
If you want to count character occurrences in some string, first, you loop over the entire string to construct an occurrence dict. Then, you can find any occurrence of character from the dict. This reduce complexity of the program.
When constructing occurrence dict, defaultdict would help you to initialize count values.
A refactored version of the program is as below:
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = defaultdict(int)
with open(filename) as f:
for c in f.read():
dict_count[c] += 1
for c in special_chars:
print('{0};{1}'.format(c, dict_count[c]))
ref. defaultdict Examples: https://docs.python.org/3.4/library/collections.html#defaultdict-examples
I did something like this where you do not need to use the counter library. I used it to count all the special char but you can adapt to put the count in a dict.
import re
def countSpecial(passwd):
specialcount = 0
for special in special_chars:
lenght = 0
#print special
lenght = len(re.findall(r'(\%s)' %special , passwd))
if lenght > 0:
#print lenght,special
specialcount = lenght + specialcount
return specialcount

Python to print string from substring from list

I am a newbie to python.Consider I have a list ['python','java','ruby']
I have a textfile as:
jrubyk
knwdjavawe
weqkpythonqwe
1ruby.e
Expected output:
ruby
java
python
ruby
I need to print the strings in list hidden inside as substring.
Is there a way to obtain that?
I tend to use regular expressions when I want to strip certain substrings from larger strings. Here is an inelegant but readable way to do this.
import re
python_matcher = re.compile('python')
java_matcher = re.compile('java')
ruby_matcher = re.compile('ruby')
hidden_text_list = open('hidden.txt', 'r').readlines()
for line in hidden_text_list:
python_matched = python_matcher.search(line)
java_matched = java_matcher.search(line)
ruby_matched = ruby_matcher.search(line)
if python_matched:
print python_matched.group()
elif java_matched:
print java_matched.group()
elif ruby_matched:
print ruby_matched.group()
The brute force approach is:
hidden_strings = ['python','java','ruby']
with open('path/to/textfile/as/in/example.txt') as infile:
for line in infile:
for hidden_string in hidden_strings:
if hidden_string in line:
print(hidden_string)

Sort sequence in order from FASTA file by python program

I was trying to create a python program that reads the fasta file "seqs.fa"
and have the program to sort the sequences in order by the name.
The Fasta file looks like this:
>seqA - human
GCTGACGTGGTGAAGTCAC
>seqC - gorilla
GATGACAA
GATGAAGTCAG
>seqB - chimp
GATGACATGGTGAAGTAAC
My program looks like this:
import sys
inFile = open(sys.argv[1], 'r')
a = inFile.readlines()
a.sort()
seq = ''.join(a[0:])
seq = seq.replace('\n', "\n")
print seq
The expected result:
>seqA - human
GCTGACGTGGTGAAGTCAC
>seqB - chimp
GATGACATGGTGAAGTAAC
>seqC - gorilla
GATGACAAGATGAAGTCAG
My result:
>seqA - human
>seqB - chimp
>seqC - gorilla
GATGACAA
GATGAAGTCAG
GATGACATGGTGAAGTAAC
GCTGACGTGGTGAAGTCAC
The last four lines are the gorilla, chimp, and human sequences, with the gorilla sequence split over the first two lines.
Can anyone give me some tips on how to sort it or a way to fix the problem?
Don't implement a FASTA reader yourself! Like most cases, there are some smart people that already did this for you. Use for example BioPython instead. Like this:
from Bio import SeqIO
handle = open("seqs.fa", "rU")
l = SeqIO.parse(handle, "fasta")
sortedList = [f for f in sorted(l, key=lambda x : x.id)]
for s in sortedList:
print s.description
print str(s.seq)
There are some problems with your code. The main one is that in the list returned by readlines() your descriptions and sequences are all separate lines, so when you sort the list, they are detached from each other. Also, all descriptions go before sequences because they have '>' in the beginning.
Second, a[0:] is the same as a.
Third, seq.replace('\n', "\n") won't do anything. Single and double quotes mean the same thing. You replace a newline character with itself.
Reading fasta files is not a very complex task for Python, but still I hope I'll be excused for offering to use the package I work on - pyteomics.
Here's the code I'd use:
In [1]: from pyteomics import fasta
In [2]: with fasta.read('/tmp/seqs.fa') as f:
...: fasta.write(sorted(f))
...:
>seqA - human
GCTGACGTGGTGAAGTCAC
>seqB - chimp
GATGACATGGTGAAGTAAC
>seqC - gorilla
GATGACAAGATGAAGTCAG
To save this to a new file, give its name to fasta.write as argument:
fasta.write(sorted(f), 'newfile.fa')
Generally, pyteomics.fasta is for protein sequences, not DNA, but it does the job. Maybe you can use the fact that it returns descriptions and sequences in tuples.
file = open("seqs.fa")
a = file.readlines()
i = 0
ar = []
while True:
l1=file.readline()
l2=file.readline()
if not (l1 and l2):
break;
l = l1.strip('\n') + '////////' + l2
ar.append(l)
ar = ar.sort()
for l in ar:
l1 = l.split('////////')[0]+'\n'
print l1
l2 = l.split('////////')[1]
print l2

Categories