Running python mapreduce for multiple files - python

I am trying to implement python mapreduce for multiple files in directory, so that it will take folder and string as an argument and list files with the frequency of that string within that files. The output should be like that:
Filename Output
-------- --------------
x.txt 8
y.txt 12
I have tried to implement it but when I run it with command below:
cat /home/habil/Downloads/hadoop_test/*.txt | python mapper.py "AA" | python reducer.py
It gives me "AA 479" which are the frequency in all 5 files
This is my mapper.py
#!/usr/bin/env python
import sys
import textwrap
from os import listdir
from os.path import isfile, join
#Argument of the path
#folderPath = sys.argv[2]
#onlyfiles = [f for f in listdir(sys.argv[2]) if isfile(join(sys.argv[2], f))]
# Get the string sequence from the user
searchWord = sys.argv[1]
# Length of the word
searchWordLength = len(sys.argv[1])
# helper Function
def locations_of_substring(string, substring):
"""Return a list of locations of a substring."""
substring_length = len(substring)
def recurse(locations_found, start):
location = string.find(substring, start)
if location != -1:
return recurse(locations_found + [location], location+substring_length)
else:
return locations_found
return recurse([], 0)
#--- get all lines from stdin ---
for line in sys.stdin:
#--- remove leading and trailing whitespace---
line = line.strip()
temp = locations_of_substring(line, searchWord)
if len(temp) != 0:
for count in temp:
print '%s\t%s' % (line[count:count+searchWordLength], "1")
And below is my reducer:
#!/usr/bin/env python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word])
How can I get the desired result, so that it will run for each file in the directory, once and print seperate results. Any help or hint is appreciated. Thanks in advance.

Related

Counting reads and bases from a list of fastq files

I trimmed my Illumina short reads, forward and reverse, by using Trimmomatic. The Trimmomatic's outputs were: paired_1 - unpaired_1, and paired_2 - unpaired_2.fastq.gz files. I want to know how big was the impact of trimming by counting the number of reads and bases of each file in my directory. I had made a script to count the number of bases and reads for each file in my directory; however, I have problems in if __name__=='__main__'. When I do the for loop I don't know the order of the files that will be run, how can I make it to call the files by the order I see from the screen? Additionally, I also need help with correcting the script as I don't get any stdout.
Thank you in advance for your help.
#!/usr/bin/env python
from sys import argv
import os
def get_num_bases(file_content):
total = []
for linenumber, line in enumerate(file_content):
mod=linenumber%4
if mod==0:
ID = line.strip()[1:]
#print(ID)
if mod==1:
seq = line.strip()
counting = 0
counting += seq.count("T")+ seq.count("A") + seq.count("C") + seq.count("G")
total.append(counting)
allbases = sum(total)
print("Number of bases are: " , allbases)
def get_num_reads(file_content):
total_1 = []
for line in file_content:
num_reads = 0
num_reads += content.count(line)
total_1.append(num_reads)
print("Number of reads are: ", sum(total_1)/int(4))
if __name__=='__main__':
path = os.getcwd()
dir_files = os.listdir(path)
list_files = []
for file in dir_files:
if file.endswith("fastq.gz"):
if file not in list_files:
file_content = open(file, "r").readlines()
list_files.append(file)
print("This is the filename: ", file, get_num_bases(file_content), get_num_reads(file_content))

Read user inputted .fasta file and parse using Biopython?

I am trying to create a python script where the user can type in their FASTA file and that file will then be parsed using Biopython. I am struggling to get this to work. The script I have thus far is this:
#!/usr/bin/python3
file_name = input("Insert full file name including the fasta extension: ")
with open(file_name, "r") as inf:
seq = inf.read()
from Bio.SeqIO.FastaIO import SimpleFastaParser
count = 0
total_len = 0
with open(inf) as in_file:
for title, seq in SimpleFastaParser(in_file):
count += 1
total_len += len(seq)
print("%i records with total sequence length %i" % (count, total_len))
I would like the user to be prompted to type in their file and its extension and that file should be used to parse with Biopython such that that output is printed. I also want to be about to send the print output to a log file. Any help would be appreciate.
The purpose of the script is to take a fasta file, parse and trim primers. I know there is an easy method to do this using Biopython entirely but as per instruction Biopython can only be used to parse not trim. Any insight into this would be appreciated as well.
Firstly, you have two places where you open the fasta file
One where you store the contents in seq
Then you try to open inf, but you don't assign inf as a variable in this code snippet.
You may want to include some check to makes sure a valid file path was used
Also, this is a good case for using argparse:
#!/usr/bin/python3
import argparse
from Bio.SeqIO
import os
import sys
def main(infile):
# check that the file exists
if not os.path.is_file(infile):
print("file not found")
sys.exit()
count = 0
total_len = 0
for seq_record in SeqIO.parse(open(infile), "fasta"):
count += 1
total_len += len(seq_record.seq)
print("%i records with total sequence length %i" % (count, total_len))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='some script to do something with fasta files',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('-in', '--infile', type=str, default=None, required=True,
help='The path and file name of the fasta file')
args = parser.parse_args()
infile = args.infile
main(infile)
If you need to use input:
#!/usr/bin/python3
from Bio.SeqIO
import os
import sys
infile = input("Insert full file name including the fasta extension: ")
# remove any white space
infile = infile.strip()
# check that the file exists
if not os.path.is_file(infile):
print("file not found")
sys.exit()
count = 0
total_len = 0
for seq_record in SeqIO.parse(open(infile), "fasta"):
count += 1
total_len += len(seq_record.seq)
print("%i records with total sequence length %i" % (count, total_len))

Changing an orientation of a reverse sequence in the fasta file doesnt work

I am trying to get the reverse sequences orientated correctly in a file. This is the code:
import os
import sys import pysam
from Bio import SeqIO, Seq, SeqRecord
def main(in_file):
out_file = "%s.fa" % os.path.splitext(in_file)[0]
with open(out_file, "w") as out_handle:
# Write records from the BAM file one at a time to the output file.
# Works lazily as BAM sequences are read so will handle large files.
SeqIO.write(bam_to_rec(in_file), out_handle, "fasta")
def bam_to_rec(in_file):
"""Generator to convert BAM files into Biopython SeqRecords.
"""
bam_file = pysam.Samfile(in_file, "rb")
for read in bam_file:
seq = Seq.Seq(read.seq)
if read.is_reverse:
seq = seq.reverse_complement()
rec = SeqRecord.SeqRecord(seq, read.qname, "", "")
yield rec
if __name__ == "__main__":
main(*sys.argv[1:])`
When I print out the reverse sequences, the code works. But when in the file it is printed out as a reverse sequence. Can anyone help me to find out what is going wrong?
Here is the link to my infile:
https://www.dropbox.com/sh/68ui8l7nh5fxatm/AABUr82l01qT1nL8I_XgJaeTa?dl=0
Note the ugly counter is just to print 10000 sequences, not more.
comparing one without ever reversing with one that reverses if needed
Here's the output on a couple of seqs, feel free to test it, I think your issue is that yield returns an iterator but you are not iterating it, unless I am missunderstanding what you are doing:
Original:
SOLEXA-1GA-2:2:93:1281:961#0
GGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG
Becomes:
SOLEXA-1GA-2:2:93:1281:961#0
CTAACCCTAACCCTAACCCTAACCCTAACCTAACCC
And if not reverse:
Original:
SOLEXA-1GA-2:2:12:96:1547#0
ACACACAAACACACACACACACACACACACACCCCC
Becomes:
SOLEXA-1GA-2:2:12:96:1547#0
ACACACAAACACACACACACACACACACACACCCCC
Here's my code:
import os
import sys
import pysam
from Bio import SeqIO, Seq, SeqRecord
def main(in_file):
out_file = "%s.fa" % os.path.splitext(in_file)[0]
with open('test_non_reverse.txt', 'w') as non_reverse:
with open(out_file, "w") as out_handle:
# Write records from the BAM file one at a time to the output file.
# Works lazily as BAM sequences are read so will handle large files.
i = 0
for s in bam_to_rec(in_file):
if i == 10000:
break
i +=1
SeqIO.write(s, out_handle, "fasta")
i = 0
for s in convert_to_seq(in_file):
if i == 10000:
break
i +=1
SeqIO.write(s, non_reverse, 'fasta')
def convert_to_seq(in_file):
bam_file = pysam.Samfile(in_file, "rb")
for read in bam_file:
seq = Seq.Seq(read.seq)
rec = SeqRecord.SeqRecord(seq, read.qname, "", "")
yield rec
def bam_to_rec(in_file):
"""Generator to convert BAM files into Biopython SeqRecords.
"""
bam_file = pysam.Samfile(in_file, "rb")
for read in bam_file:
seq = Seq.Seq(read.seq)
if read.is_reverse:
seq = seq.reverse_complement()
rec = SeqRecord.SeqRecord(seq, read.qname, "", "")
yield rec
if __name__ == "__main__":
main(*sys.argv[1:])

Python select a file from a list

I have a folder that contains several log file that I will parse with python.
I would show the list of file contained into a folder like:
[1] FileName1.log
[2] FileName2.log
And then the user can choose the right file writing the file list number.
For instance, to parse the file "FileName2.log" the user press 2.
In my script I can show the list of file but I don't now how to pick up a file from a list by index.
This is my script
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = raw_input("\n\rSelect log file [0 -" + str(cnt) + " ]: ")
Thanks for the help!
import os
import sys
items = os.listdir("D:/Logs")
fileList = [name for name in items if name.endswith(".log")]
for cnt, fileName in enumerate(fileList, 1):
sys.stdout.write("[%d] %s\n\r" % (cnt, fileName))
choice = int(input("Select log file[1-%s]: " % cnt))
print(fileList[choice])
you own version of code with few modifications, hope this solves your purpose
If you have the names in an array like this:
fileList = ['FileName1.log','FileName2.log']
you can pull them out by using their index (remember that arrarys are 0-indexed) so fileList[0] would be 'FileName1.log'
when you ask for the user to input a number (eg 0, 1, 2) you would then use that number to get the file you want. like this:
fileToRead=fileList[userInput]
if you asked for 1,2,3 you would need to use userInput-1 to make sure it is correctly 0-indexed.
then you open the file you now have:
f=open(fileToRead, 'r')
you can read more about open here
If fileList is a list of files, and fileName is the user input, you can reference the file the user chose by using the following:
fileList[fileName]
import glob
import os
dirpath = r"D:\Logs" # the directory that contains the log files
prefix = "FileName"
fpaths = glob.glob(os.path.join(dirpath, "{}*.log".format(prefix))) # get all the log files
fpaths.sort(key=lambda fname: int(fname.split('.',1)[0][len(prefix):])) # sort the log files by number
print("Select a file to view:")
for i,fpath in enumerate(fpaths, 1):
print("[{}]: {}".format(i, os.path.basename(fpath)))
choice = int(input("Enter a selection number: ")) # assuming valid inputs
choice -= 1 # correcting for python's 0-indexing
print("You have chosen", os.path.basename(fpaths[choice]))
Just add in the end something like this...
sys.stdout.write(fileList[int(fileName)])
Indexing in python as in many other languages starts from 0. Try this:
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = int(raw_input("\n\rSelect log file [0 - " + str(cnt - 1) + "]: "))
print(fileList[fileName])
You need to cast input from raw_input() to int. And then you can use the obtained number as index for your list. 0 is the first file, 1 is the second file etc.

python equivalent to sed

Is there a way, without a double loop to accomplish what the following sed command does
Input:
Time
Banana
spinach
turkey
sed -i "/Banana/ s/$/Toothpaste/" file
Output:
Time
BananaToothpaste
spinach
turkey
What I have so far is a double list which would take a long time to go through both.
List a has a bunch of numbers
list b has a the same bunch of numbers but in a different order
For each entry in A i want to find the line in B with that same number and add value C to the end of it.
Hope this makes sense, even if my example doesn't.
I was doing the following in Bash and it was working however it was super slow...
for line in $(cat DATSRCLN.txt.utf8); do
srch=$(echo $line | awk -F'^' '{print $1}');
rep=$(echo $line | awk -F'^' '{print $2}');
sed -i "/$(echo $srch)/ s/$/^$(echo $rep)/" tmp.1;
done
Thanks!
Using re.sub():
newstring = re.sub('(Banana)', r'\1Toothpaste', oldstring)
This catches one group (between first parentheses), and replaces it by ITSELF (the \number part) followed by a desired suffix. It is needed to use r'' (raw string) so that the escape is correctly interpreted.
A late comer to the race, here is my implementation for sed in Python:
import re
import shutil
from tempfile import mkstemp
def sed(pattern, replace, source, dest=None, count=0):
"""Reads a source file and writes the destination file.
In each line, replaces pattern with replace.
Args:
pattern (str): pattern to match (can be re.pattern)
replace (str): replacement str
source (str): input filename
count (int): number of occurrences to replace
dest (str): destination filename, if not given, source will be over written.
"""
fin = open(source, 'r')
num_replaced = count
if dest:
fout = open(dest, 'w')
else:
fd, name = mkstemp()
fout = open(name, 'w')
for line in fin:
out = re.sub(pattern, replace, line)
fout.write(out)
if out != line:
num_replaced += 1
if count and num_replaced > count:
break
try:
fout.writelines(fin.readlines())
except Exception as E:
raise E
fin.close()
fout.close()
if not dest:
shutil.move(name, source)
examples:
sed('foo', 'bar', "foo.txt")
will replace all 'foo' with 'bar' in foo.txt
sed('foo', 'bar', "foo.txt", "foo.updated.txt")
will replace all 'foo' with 'bar' in 'foo.txt' and save the result in "foo.updated.txt".
sed('foo', 'bar', "foo.txt", count=1)
will replace only the first occurrence of 'foo' with 'bar' and save the result in the original file 'foo.txt'
If you are using Python3 the following module will help you:
https://github.com/mahmoudadel2/pysed
wget https://raw.githubusercontent.com/mahmoudadel2/pysed/master/pysed.py
Place the module file into your Python3 modules path, then:
import pysed
pysed.replace(<Old string>, <Replacement String>, <Text File>)
pysed.rmlinematch(<Unwanted string>, <Text File>)
pysed.rmlinenumber(<Unwanted Line Number>, <Text File>)
You can actually call sed from python. Many ways to do this but I like to use the sh module. (yum -y install python-sh)
The output of my example program is a follows.
[me#localhost sh]$ cat input
Time
Banana
spinich
turkey
[me#localhost sh]$ python test_sh.py
[me#localhost sh]$ cat input
Time
Toothpaste
spinich
turkey
[me#localhost sh]$
Here is test_sh.py
import sh
sh.sed('-i', 's/Banana/Toothpaste/', 'input')
This will probably only work under LINUX.
It's possible to do this using tmp file with low system requirements and only one iteration without copying whole file into the memory:
#/usr/bin/python
import tempfile
import shutil
import os
newfile = tempfile.mkdtemp()
oldfile = 'stack.txt'
f = open(oldfile)
n = open(newfile,'w')
for i in f:
if i.find('Banana') == -1:
n.write(i)
continue
# Last row
if i.find('\n') == -1:
i += 'ToothPaste'
else:
i = i.rstrip('\n')
i += 'ToothPaste\n'
n.write(i)
f.close()
n.close()
os.remove(oldfile)
shutil.move(newfile,oldfile)
I found the answer supplied by Oz123 to be great, but didn't seem to work 100%. I'm new to python, but modded it and wrapped it up to run in a bash script. This works on osx, using python 2.7.
# Replace 1 occurrence in file /tmp/1
$ search_replace "Banana" "BananaToothpaste" /tmp/1
# Replace 5 occurrences and save in /tmp/2
$ search_replace "Banana" "BananaToothpaste" /tmp/1 /tmp/2 5
search_replace
#!/usr/bin/env python
import sys
import re
import shutil
from tempfile import mkstemp
total = len(sys.argv)-1
cmdargs = str(sys.argv)
if (total < 3):
print ("Usage: SEARCH_FOR REPLACE_WITH IN_FILE {OUT_FILE} {COUNT}")
print ("by default, the input file is replaced")
print ("and the number of times to replace is 1")
sys.exit(1)
# Parsing args one by one
search_for = str(sys.argv[1])
replace_with = str(sys.argv[2])
file_name = str(sys.argv[3])
if (total < 4):
file_name_dest=file_name
else:
file_name_dest = str(sys.argv[4])
if (total < 5):
count = 1
else:
count = int(sys.argv[5])
def sed(pattern, replace, source, dest=None, count=0):
"""Reads a source file and writes the destination file.
In each line, replaces pattern with replace.
Args:
pattern (str): pattern to match (can be re.pattern)
replace (str): replacement str
source (str): input filename
count (int): number of occurrences to replace
dest (str): destination filename, if not given, source will be over written.
"""
fin = open(source, 'r')
num_replaced = 0
fd, name = mkstemp()
fout = open(name, 'w')
for line in fin:
if count and num_replaced < count:
out = re.sub(pattern, replace, line)
fout.write(out)
if out != line:
num_replaced += 1
else:
fout.write(line)
fin.close()
fout.close()
if file_name == file_name_dest:
shutil.move(name, file_name)
else:
shutil.move(name, file_name_dest)
sed(search_for, replace_with, file_name, file_name_dest, count)
With thanks to Oz123 above, here is sed which is not line by line so your replacement can span newlines. Larger files could be a problem.
import re
import shutil
from tempfile import mkstemp
def sed(pattern, replace, source, dest=None):
"""Reads a source file and writes the destination file.
Replaces pattern with replace globally through the file.
This is not line-by-line so the pattern can span newlines.
Args:
pattern (str): pattern to match (can be re.pattern)
replace (str): replacement str
source (str): input filename
dest (str): destination filename, if not given, source will be over written.
"""
if dest:
fout = open(dest, 'w')
else:
fd, name = mkstemp()
fout = open(name, 'w')
with open(source, 'r') as file:
data = file.read()
p = re.compile(pattern)
new_data = p.sub(replace, data)
fout.write(new_data)
fout.close()
if not dest:
shutil.move(name, source)
massedit
you could use it as a command line tool:
# Will change all test*.py in subdirectories of tests.
massedit.py -e "re.sub('failIf', 'assertFalse', line)" -s tests test*.py
you also could use it as a library:
import massedit
filenames = ['massedit.py']
massedit.edit_files(filenames, ["re.sub('Jerome', 'J.', line)"])
you can use sed or awk or grep in python (with some restrictions). Here is a very simple example. It changes banana to bananatoothpaste in the file. You can edit and use it. ( I tested it worked...note: if you are testing under windows you should install "sed" command and set the path first)
import os
file="a.txt"
oldtext="Banana"
newtext=" BananaToothpaste"
os.system('sed -i "s/{}/{}/g" {}'.format(oldtext,newtext,file))
#print(f'sed -i "s/{oldtext}/{newtext}/g" {file}')
print('This command was applied: sed -i "s/{}/{}/g" {}'.format(oldtext,newtext,file))
if you want to see results on the file directly apply: "type" for windows/ "cat" for linux:
####FOR WINDOWS:
os.popen("type " + file).read()
####FOR LINUX:
os.popen("cat " + file).read()

Categories