How to deal with Mapreduce by identifying the keys in python Hadoop - python

I have two key values from map function: NY and Others. so, the output of my key is: NY 1, or Other 1. Only these two cases.
my map function:
#!/usr/bin/env python
import sys
import csv
import string
reader = csv.reader(sys.stdin, delimiter=',')
for entry in reader:
if len(entry) == 22:
registration_state=entry[16]
print('{0}\t{1}'.format(registration_state,int(1)))
Now i need to use reducers to process the map outputs. My reduce:
#!/usr/bin/env python
import sys
import string
currentkey = None
ny = 0
other = 0
# input comes from STDIN (stream data that goes to the program)
for line in sys.stdin:
#Remove leading and trailing whitespace
line = line.strip()
#Get key/value
key, values = line.split('\t', 1)
values = int(values)
#If we are still on the same key...
if key == 'NY':
ny = ny + 1
#Otherwise, if this is a new key...
else:
#If this is a new key and not the first key we've seen
other = other + 1
#Compute/output result for the last key
print('{0}\t{1}'.format('NY',ny))
print('{0}\t{1}'.format('Other',other))
From these, the mapreduce will give two output result files, each contains both NY and Others outputs. i.e. one contains: NY 1248, Others 4677; another one: NY 0, Others 1000. This is because two reduced split the output from the map, so generated two results, by combining (merge) the final output will be the result.
However, I would like to change my reduce or map functions to have each reduced process on only one key, i.e. one reduced only deal with NY as the key values, and another one works on Other. I expect to have results like one contains:
NY 1258, Others 0; Another: NY 0, Others 5677.
How can I adjust my functions to achieve results I expect?

Probably you need to use Python iterators and generators.
An excellent example is given this link. I have tried re-writing your code with the same (not tested)
Mapper:
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def main(separator='\t'):
reader = csv.reader(sys.stdin, delimiter=',')
for entry in reader:
if len(entry) == 22:
registration_state=entry[16]
print '%s%s%d' % (registration_state, separator, 1)
if __name__ == "__main__":
main()
Reducer:
!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='\t'):
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()

Related

How do I quickly extract data from this massive csv file?

I have genomic data from 16 nuclei. The first column represents the nucleus, the next two columns represent the scaffold (section of genome) and the position on the scaffold respectively, and the last two columns represent the nucleotide and coverage respectively. There can be equal scaffolds and positions in different nuclei.
Using input for start and end positions (scaffold and position of each), I'm supposed to output a csv file which shows the data (nucleotide and coverage) of each nucleus within the range from start to end. I was thinking of doing this by having 16 columns (one for each nucleus), and then showing the data from top to bottom. The leftmost region would be a reference genome in that range, which I accessed by creating a dictionary for each of its scaffolds.
In my code, I have a defaultdict of lists, so the key is a string which combines the scaffold and the location, while the data is an array of lists, so that for each nucleus, the data can be appended to the same location, and in the end each location has data from every nucleus.
Of course, this is very slow. How should I be doing it instead?
Code:
#let's plan this
#input is start and finish - when you hit first, add it and keep going until you hit next or larger
#dictionary of arrays
#loop through everything, output data for each nucleus
import csv
from collections import defaultdict
inrange = 0
start = 'scaffold_41,51335'
end = 'scaffold_41|51457'
locations = defaultdict(list)
count = 0
genome = defaultdict(lambda : defaultdict(dict))
scaffold = ''
for line in open('Allpaths_SL1_corrected.fasta','r'):
if line[0]=='>':
scaffold = line[1:].rstrip()
else:
genome[scaffold] = line.rstrip()
print('Genome dictionary done.')
with open('automated.csv','rt') as read:
for line in csv.reader(read,delimiter=','):
if line[1] + ',' + line[2] == start:
inrange = 1
if inrange == 1:
locations[line[1] + ',' + line[2]].append([line[3],line[4]])
if line[1] + ',' + line[2] == end:
inrange = 0
count += 1
if count%1000000 == 0:
print('Checkpoint '+str(count)+'!')
with open('region.csv','w') as fp:
wr = csv.writer(fp,delimiter=',',lineterminator='\n')
for key in locations:
nuclei = []
for i in range(0,16):
try:
nuclei.append(locations[key][i])
except IndexError:
nuclei.append(['',''])
wr.writerow([genome[key[0:key.index(',')][int(key[key.index(',')+1:])-1],key,nuclei])
print('Done!')
Files:
https://drive.google.com/file/d/0Bz7WGValdVR-bTdOcmdfRXpUYUE/view?usp=sharing
https://drive.google.com/file/d/0Bz7WGValdVR-aFdVVUtTbnI2WHM/view?usp=sharing
(Only focusing on the CSV section in the middle of your code)
The example csv file you supplied is over 2GB and 77,822,354 lines. Of those lines, you seem to only be focused on 26,804,253 lines or about 1/3.
As a general suggestion, you can speed thing up by:
Avoid processing the data you are not interested in (2/3 of the file);
Speed up identifying the data you are interested in;
Avoid the things that repeated millions of times that tend to be slower (processing each line as csv, reassembling a string, etc);
Avoid reading all data when you can break it up into blocks or lines (memory will get tight)
Use faster tools like numpy, pandas and pypy
You data is block oriented, so you can use a FlipFlop type object to sense if you are in a block or not.
The first column of your csv is numeric, so rather than splitting the line apart and reassembling two columns, you can use the faster Python in operator to find the start and end of the blocks:
start = ',scaffold_41,51335,'
end = ',scaffold_41,51457,'
class FlipFlop:
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
rtr=True if self.state else False
if self.patterns[self.state] in st:
self.state = not self.state
return self.state or rtr
lines_in_block=0
with open('automated.csv') as f:
ff=FlipFlop(start, end)
for lc, line in enumerate(f):
if ff(line):
lines_in_block+=1
print lines_in_block, lc
Prints:
26804256 77822354
That runs in about 9 seconds in PyPy and 46 seconds in Python 2.7.
You can then take the portion that reads the source csv file and turn that into a generator so you only need to deal with one block of data at a time.
(Certainly not correct, since I spent no time trying to understand your files overall..):
def csv_bloc(fn, start_pat, end_pat):
from itertools import ifilter
with open(fn) as csv_f:
ff=FlipFlop(start_pat, end_pat)
for block in ifilter(ff, csv_f):
yield block
Or, if you need to combine all the blocks into one dict:
def csv_line(fn, start, end):
with open(fn) as csv_in:
ff=FlipFlop(start, end)
for line in csv_in:
if ff(line):
yield line.rstrip().split(",")
di={}
for row in csv_line('/tmp/automated.csv', start, end):
di.setdefault((row[2],row[3]), []).append([row[3],row[4]])
That executes in about 1 minute on my (oldish) Mac in PyPy and about 3 minutes in cPython 2.7.
Best

Categorising data according to one column in python

Hi I have a dataset as follows eg:
sample pos mutation
2fec2 40 TC
1f3c 40 TC
19b0 40 TC
tld3 60 CG
I want to be able to find a python way to for example find every instance where 2fec2 and 1f3c have the same mutation and print the code. So far I have tried the following but it simply returns everything. I am completely new to python and trying to ween myself off awk - please help!
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
for record.affected_start in vcf_reader: #.affect_start is this modules way of calling data from the parsed pos column from a particular type of bioinformatics file
if record.sample == 2fec2 and 1f3c != 19b0 !=t1d3: #ditto .sample
print record.affected_start
I'm assuming your data is in the format you describe and not VCF.
You can try to simply parse the file with standard python techniques and for each (pos, mutation) pair, build the set of samples having it as follows:
from sys import argv
from collections import defaultdict
# More convenient than a normal dict: an empty set will be
# automatically created whenever a new key is accessed
# keys will be (pos, mutation) pairs
# values will be sets of sample names
mutation_dict = defaultdict(set)
# This "with" syntax ("context manager") is recommended
# because file closing will be handled automatically
with open(argv[1], "r") as data_file:
# Read first line and check headers
# (assert <something False>, "message"
# will make the program exit and display "message")
assert data_file.readline().strip().split() == ["sample", "pos", "mutation"], "Unexpected column names"
# .strip() removes end-of-line character
# .split() splits into a list of words
# (by default using "blanks" as separator)
# .readline() has "consumed" a first line.
# Now we can loop over the rest of the lines
# that should contain the data
for line in data_file:
# Extract the fields
[sample, pos, mutation] = line.strip().split()
# add the sample to the set of samples having
# this (pos, mutation) combination
mutation_dict[(pos, mutation)].add(sample)
# Now loop over the key, value pairs in our dict:
for (pos, mutation), samples in mutation_dict.items():
# True if set intersection (&) is not empty
if samples & {"2fec2", "1f3c"}:
print("2fec2 and 1f3c share mutation %s at position %s" % (mutation, pos))
With your example data as first argument of the script, this outputs:
2fec2 and 1f3c share mutation TC at position 40
How about this
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
# Store our results outside of the loop
fecResult = ""
f3cResult = ""
# For each record
for record.affected_start in vcf_reader:
if record.sample == "2fec2":
fecResult = record.mutation
if record.sample == "1f3c":
f3cResult = record.mutation
# Outside of the loop compare the results and if they match print the record.
if fecResult == f3cResult:
print record.affected_start

UPDATE: Calculate vector length according to str value in specific column in Python

I am trying to measure the length of vectors based on a value of the first column of my input data.
For instance: my input data is as follows:
dog nmod+n+-n 4
dog nmod+n+n-a-commitment-n 6
child into+ns-j+vn-pass-rb-divide-v 3
child nmod+n+ns-commitment-n 5
child nmod+n+n-pledge-n 3
hello nmod+n+ns 2
The value that I want to calculate is based on an identical value in the first column. For instance, I would calculate a value based on all rows in which dog is in the first column, then I would calculate a value based on all rows in which child is in the first column... and so on.
I have worked out the mathematics to calculate the vector length (Euc. norm). However, I am unsure how to base the calculation based on grouping the identical values in the first column.
So far, this is the code that I have written:
#!/usr/bin/python
import os
import sys
import getopt
import datetime
import math
print "starting:",
print datetime.datetime.now()
def countVectorLength(infile, outfile):
with open(infile, 'rb') as inputfile:
flem, _, fw = next(inputfile).split()
current_lem = flem
weights = [float(fw)]
for line in inputfile:
lem, _, w = line.split()
if lem == current_lem:
weights.append(float(w))
else:
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
current_lem = lem
weights = [float(w)]
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
print "Finish:",
print datetime.datetime.now()
path = '/Path/to/Input/'
pathout = '/Path/to/Output'
listing = os.listdir(path)
for infile in listing:
outfile = 'output' + infile
print "current file is:" + infile
countVectorLength(path + infile, pathout + outfile)
This code outputs the length of vector of each individual lemma. The above data gives me the following output:
dog 7.211102550927978
child 6.48074069840786
hello 2
UPDATE
I have been working on it and I have managed to get the following working code, as updated in the code sample above. However, as you will be able to see. The code has a problem with the output of the very last line of each file --- which I have solved rather rudimentarily by manually adding it. However, because of this problem, it does not permit a clean iteration through the directory -- outputting all of the results of all files in an appended > document. Is there a way to make this a bit cleaner, pythonic way to output directly each individual corresponding file in the outpath directory?
First thing, you need to transform the input into something like
dog => [4,2]
child => [3,5,3]
etc
It goes like this:
from collections import defaultdict
data = defaultdict(list)
for line in file:
line = line.split('\t')
data[line[0]].append(line[2])
Once this is done, the rest is obvious:
def vector_len(vec):
you already got that
vector_lens = {name: vector_len(values) for name, values in data.items()}

Filtering a CSV file in python

I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-* columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03 then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01, then the :01 tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02, so that it has two levels of resolution after DQB1*. How can I tell python to look for these formats, and ignore the data stored in them.
e.g.
if (csvCell is of format DQB1*03:02:01):
delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
delete row
else:
goto next line
UPDATE: Edited code I referenced
import csv
import re
import sys
csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^\w+\*\d\d$', value):
keep = False
break # quit processing target fields
elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
if keep:
csvdictwriter.writerow(rowfields)
Here's something that I think will do what you want. It's not as simple as Peter's answer because it uses Python's csv module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.
import csv
import re
import sys
csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
keep = False
break # quit processing target fields
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
r'DQB1*\1:\2', value)
if keep:
csvdictwriter.writerow(rowfields)
The hardest part for me was determining what you wanted to do.
Here's an ultra-simple filter:
import sys
for line in sys.stdin:
line = line.replace( ',DQB1*03:02:01,', ',DQB1*03:02,' )
if line.find( ',DQB1*03,' ) == -1:
sys.stdout.write( line )
Or, if you want to use regular expressions
import re
import sys
for line in sys.stdin:
line = re.sub( ',DQB1\\*03:02:01,', ',DQB1*03:02,', line )
if re.search( ',DQB1\\*03,', line ) == None:
sys.stdout.write( line )
Run it as
python script.py < data.csv

Python: Use variable as a lookup reference in CSV and return value

Couldn't find an answer to this solution, so once I figured it out I thought I'd re-post my solution...
I was looking for a way of taking a user input (sys.argv[1]) and use this value to perform a lookup in a CSV file for another value in column x (e.g. 5). This would be part of a larger script and I would use the looked-up value as a test.
My example csv would be:
col0,col1,col2,col3,col4
a,foo,bar,blah,1
b,foo,bar,blah,2
c,foo,bar,blah,3
d,foo,bar,blah,4
e,foo,bar,blah,5
f,foo,bar,blah,6
g,foo,bar,blah,7
h,foo,bar,blah,8
i,foo,bar,blah,9
j,foo,bar,blah,10
k,foo,bar,blah,11
l,foo,bar,blah,12
m,foo,bar,blah,13
n,foo,bar,blah,14
o,foo,bar,blah,15
p,foo,bar,blah,16
q,foo,bar,blah,17
r,foo,bar,blah,18
s,foo,bar,blah,19
t,foo,bar,blah,20
u,foo,bar,blah,21
v,foo,bar,blah,22
w,foo,bar,blah,23
x,foo,bar,blah,24
y,foo,bar,blah,25
z,foo,bar,blah,26
I had a quick look at your Javascript question, and if you're only ever mapping char->position in alphabet, what about the following?
def char_to_pos(char):
from string import ascii_lowercase
try:
return ascii_lowercase.index(char) + 1
except ValueError as e:
pass # no match - do what's sensible here
And if you wanted to pre-generate a lookup table, then something like:
from string import ascii_lowercase
from itertools import count
lookup = dict(zip(ascii_lowercase, count(1)))
# or depending on taste
lookup = {letter: idx for idx, letter in enumerate(ascii_lowercase, start=1)}
Otherwise, since CSV files are generally relativey small you could just load the entire file to RAM to avoid repeated sequential lookups later (as long as the CSVs not so large it'll put your machine in a coma)
with open('test1.csv') as fin:
csvin = csv.reader(fin)
lookup = {row[0]: row for row in csvin}
to_find = 'x'
try:
print '{} = {[4]}'.format(to_find, lookup[to_find])
except (KeyError, IndexError) as e:
pass # KeyError = no lookup match, IndexError is that CSV file didn't have 5th column...
The way I did this was as follows:
import sys
import csv
#define user input as variable
input = sys.argv[1]
# read csv file into "fooReader"
fooReader = csv.reader(open('test1.csv', 'rb'), delimiter = ',', quotechar="\"")
# read each row in "fooReader"
for row in fooReader:
# define first row column as "value" for testing
value = row[0]
# test if value (1st column) is the same as input (user input)
if value == input:
# ...if it is then print the 5th column in a certain way
print value + " = " + row[4]
This can then be used to assign row[4] as a variable in another test that is required.

Categories