Python MapReduce Hadoop Streaming Job that requires 3 input files? - python

I have 3 small sample input files (the actual files are much larger),
# File Name: books.txt
# File Format: BookID|Title
1|The Hunger Games
2|To Kill a Mockingbird
3|Pride and Prejudice
4|Animal Farm
# File Name: ratings.txt
# File Format: ReaderID|BookID|Rating
101|1|1
102|2|2
103|3|3
104|4|4
105|1|5
106|2|1
107|3|2
108|4|3
# File Name: readers.txt
# File Format: ReaderID|Gender|PostCode|PreferComms
101|M|1000|email
102|F|1001|mobile
103|M|1002|email
104|F|1003|mobile
105|M|1004|email
106|F|1005|mobile
107|M|1006|email
108|F|1007|mobile
I want to create a Python MapReduce Hadoop Streaming Job to get the following output which is the Average Rating by Title by Gender
Animal Farm F 3.5
Pride and Prejudice M 2.5
The Hunger Games M 3
To Kill a Mockingbird F 1.5
I searched this forum and someone pointed out a solution but it is for 2 input files instead of 3. I gave it a go but am stuck at the mapper part because I am not able to sort it correctly so that the reducer can appropriately recognise the 1st record for Title & Gender, then start aggregating. My mapper code below,
#!/usr/bin/env python
import sys
for line in sys.stdin:
try:
ReaderID = "-1"
BookID = "-1"
Title = "-1"
Gender = "-1"
Rating = "-1"
line = line.strip()
splits = line.split("|")
if len(splits) == 2:
BookID = splits[0]
Title = splits[1]
elif len(splits) == 3:
ReaderID = splits[0]
BookID = splits[1]
Rating = splits[2]
else:
ReaderID = splits[0]
Gender = splits[1]
print('%s\t%s\t%s\t%s\t%s' % (BookID, Title, ReaderID, Rating, Gender))
except:
pass
PS: I need to use Python and Hadoop Streaming only. Not allowed to install Python packages like Dumbo, mrjob and etc.
Appreciate your help in advance.
Thanks,
Lobbie

Went through some core Java MR and all have suggested, the three files cannot be merged together in a single map job. We have to first join the first two, and the resultant should be joined with the third one. Applying your logic for the three, does not give me good result. Hence, I tried with Pandas, and its seems to give promising result. If using pandas is not a constraint for you, please try my code. Else, we will try to join these three files with Python Dictionary and Lists.
Here is my suggested code. I have just concatenated all the input to test it. In you code, just comment my for loop (line #36) and un-comment your for loop (line #35).
import pandas as pd
import sys
input_string_book = [
"1|The Hunger Games",
"2|To Kill a Mockingbird",
"3|Pride and Prejudice",
"4|Animal Farm"]
input_string_book_df = pd.DataFrame(columns=('BookID','Title'))
input_string_rating = [
"101|1|1",
"102|2|2",
"103|3|3",
"104|4|4",
"105|1|5",
"106|2|1",
"107|3|2",
"108|4|3"]
input_string_rating_df = pd.DataFrame(columns=('ReaderID','BookID','Rating'))
input_string_reader = [
"101|M|1000|email",
"102|F|1001|mobile",
"103|M|1002|email",
"104|F|1003|mobile",
"105|M|1004|email",
"106|F|1005|mobile",
"107|M|1006|email",
"108|F|1007|mobile"]
input_string_reader_df = pd.DataFrame(columns=('ReaderID','Gender','PostCode','PreferComms'))
#for line in sys.stdin:
for line in input_string_book + input_string_rating + input_string_reader:
try:
line = line.strip()
splits = line.split("|")
if len(splits) == 2:
input_string_book_df = input_string_book_df.append(pd.DataFrame([[splits[0],splits[1]]],columns=('BookID','Title')))
elif len(splits) == 3:
input_string_rating_df = input_string_rating_df.append(pd.DataFrame([[splits[0],splits[1],splits[2]]],columns=('ReaderID','BookID','Rating')))
else:
input_string_reader_df = input_string_reader_df.append(pd.DataFrame([[splits[0],splits[1],splits[2],splits[3]]]
,columns=('ReaderID','Gender','PostCode','PreferComms')))
except:
raise
l_concat_1 = input_string_book_df.merge(input_string_rating_df,on='BookID',how='inner')
l_concat_2 = l_concat_1.merge(input_string_reader_df,on='ReaderID',how='inner')
for each_iter in l_concat_2[['BookID', 'Title', 'ReaderID', 'Rating', 'Gender']].iterrows():
print('%s\t%s\t%s\t%s\t%s' % (each_iter[1][0], each_iter[1][1], each_iter[1][2], each_iter[1][3], each_iter[1][4]))
Output
1 The Hunger Games 101 1 M
1 The Hunger Games 105 5 M
2 To Kill a Mockingbird 102 2 F
2 To Kill a Mockingbird 106 1 F
3 Pride and Prejudice 103 3 M
3 Pride and Prejudice 107 2 M
4 Animal Farm 104 4 F
4 Animal Farm 108 3 F

Related

Unable to search and compile regex code from each line in python

I am trying to write a program to match regex in the file. Initial lines of my file looks as shown below
Alternate Take with Liz Copeland (Day 1) (12am-1am)
Saturday March 31, 2007
No. Artist Song Album (Label) Comment
buy 1 Tones on Tail Go! (club mix) Everything! (Beggars Banquet)
buy 2 Devo (I Can't Get No) Satisfaction Anthology: Pioneers Who Got Scalped (Warner Archives/Rhino)
My code to match first line of the file is as follows
with open("data.csv") as my_file:
for line in my_file:
re_show = re.compile(r'(Alternate Take with Liz Copeland) \((.*?)\)\s\((.*?)\)')
num_showtitle_lines_matched = 0
m_show = re.match(re_show, line)
bool(m_show) == 1
if m_show:
num_showtitle_lines_matched += 1
show_title = m_show.group()
print("Num show lines matched --> {}".format(num_showtitle_lines_matched))
print(show_title)
It should give me result as below
Alternate Take with Liz Copeland (Day 1) (12am-1am)
num_showtitle_lines_matched -->1
But my result doesn't show any output.
Please let me know how to accomplish this.Thanks in advance.
As in the comment:
just put the num_showtitle_lines_matched = 0 above the loop:
with open("data.csv") as my_file:
num_showtitle_lines_matched = 0
for line in my_file:
re_show = re.compile(r'(Alternate Take with Liz Copeland) \((.*?)\)\s\((.*?)\)')
m_show = re.match(re_show, line)
bool(m_show) == 1
if m_show:
num_showtitle_lines_matched += 1
show_title = m_show.group()
print("Num show lines matched --> {}".format(num_showtitle_lines_matched))
print(show_title)
Output:
Num show lines matched --> 1
Alternate Take with Liz Copeland (Day 1) (12am-1am)

What wrong with this pandas read excel problem?

I want to read a large number of coordinates(about 14000+) from a excel file and transform them into specific address through an API from Baidu map. But the program can only read the last coordinate. Is there any problem in my code? Here is my code:
import requests
import pandas as pd
import json
df = pd.read_excel(r'C:\Users\YDC\Desktop\JW.xlsx')
fw = open(r'C:\Users\YDC\Desktop\result.txt', "w", encoding="utf-8")
for i in range(0,len(df)):
t1=df.iloc[i]['lat']
t2=df.iloc[i]['lng']
baiduUrl = "http://api.map.baidu.com/geocoder/v2/?ak=21q0bMSgjdDVe0gLmjClrsuyUA1mvsRx&callback=renderReverse&location=%s,%s&output=json&pois=0" % (t1, t2)
req = requests.get(baiduUrl)
content = req.text
content = content.replace("renderReverse&&renderReverse(", "")
content = content[:-1]
baiduAddr = json.loads(content)
country = baiduAddr["result"]["addressComponent"]["country"]
city = baiduAddr["result"]["addressComponent"]["city"]
province = baiduAddr["result"]["addressComponent"]["province"]
new_line = country + "|" + city + "|" + province
fw.write(new_line)
fw.write("\n")
print(new_line)
It can only print the address of the last coordinate:
Czech Republic|Olomouc|Olomouc
How to get the rest of these coordinates?
Here is the data in excel file
This looks like a classic python loop gotcha.
Consider this:
for i in range(0, 10):
foo = i
print (foo) # notice the indentation
Outputs
9
That's because in python, the variable scope is such that you can still reference variables that are defined inside the loop from outside the loop.
A very simple fix like such:
for i in range(0, 10):
foo = i
print (foo)
Gives the expected result
0
1
2
3
4
5
6
7
8
9
In your case, just make sure that line 12 onwards is indented to the right by one more level.
Related: Scoping in Python 'for' loops

Iterating Over Two Large Lists using Python

I have two files both of which are tab delimited. One of the file is almost 800k lines and it is a An Exonic Coordinates file and the other file is almost 200k lines (It is a VCF File).
I am writing a code in python to find and filter the position in the VCF that is within an exonic coordinates (Exon Start and End from Exonic Coordinates File) and writes it to a file.
However, because the files are big, it took a couple of days to get the filtrated output file?
So the code below is partially solve the issue of speed but the problem is to figure out is to speed the filtration process which is why I used a break to exit the second loop and I want to start from the beginning of the outer loop instead taking the next element from the first loop (outer loop)?
Here is my code:
import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
list_coord.append((row[0],row[1],row[2]))
def parseVcf(vcf,src):
done = False
with open(vcf,'r') as f:
reader=csv.reader((f),delimiter='\t')
vcf_out_split = vcf.split('.')
vcf_out_split.insert(2,"output_CORRECT2")
outpt = open('.'.join(vcf_out_split),'a')
for coord in list_coord:
for row in reader:
if '#' not in row[0]:
coor_genom = int(row[1])
coor_exon1 = int(coord[1])+1
coor_exon2 = int(coord[2])
coor_genom_chr = row[0]
coor_exon_chr = coord[0]
ComH = row[7].split(';')
for x in ComH:
if 'DP4=' in x:
DP4_split=x[4:].split(',')
if (coor_exon1 <= coor_genom <= coor_exon2):
if (coor_genom_chr == coor_exon_chr):
if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
done = True
outpt.write('\t'.join(row) + '\n')
if done:
break
outpt.close()
for root,dirs,files in os.walk("."):
for file in files:
pathname=os.path.join(root,file)
if file.find("1_1")==0:
print "Parsing " + file
parseVcf(pathname, "1_1")
ref_ordered.txt:
1 69090 70008
1 367658 368597
1 621095 622034
1 861321 861393
1 865534 865716
1 866418 866469
1 871151 871276
1 874419 874509
1_1 Input File:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT directory
1 14907 rs79585140 A G 20 . DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ 0/1:50,0,51:50
1 14930 rs75454623 A G 44 . DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38 GT:PL:GQ 0/1:74,0,58:61
1 15211 rs78601809 T G 9.33 . DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171 GT:PL:GQ 1/1:41,10,0:13
1 16146 . A C 25 . DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ 0/1:55,0,68:58
1 16257 rs78588380 G C 40 . DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308 GT:PL:GQ 0/1:70,0,249:73
1 16378 rs148220436 T C 39 . DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227 GT:PL:GQ 0/1:69,0,90:72
OUTPUT File:
1 877831 rs6672356 T C 44.8 . DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 878000 . C T 44.8 . DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 881627 rs2272757 G A 205 . DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1 GT:PL:GQ 1/1:238,27,0:51
First of all, I did not include any code because it looks like homework to me (I have had homework like this). I will however try to explain the steps I took to improve my scripts, even though I know my solutions are far from perfect.
your script could be slow because for every line in your csv file you open, write and close your output file. Try to make a list of lines you want to add to the output file, and after you are done with reading and filtering, then start writing.
You also might want to consider to write functions per filter and call these functions with the line as variable. That way you can easily add filters later on. I use a counter to keep track of the amount of succeeded filters and if in the end counter == len(amountOfUsedFilers) I add my line to the list.
Also, why do you use outpt = open('.'.join(vcf_out_split),'a') and with open(vcf,'r') as f: try to be consistent and smart in your choices.
Bioinformatics for the win!
If both of your files are ordered, you can save a lot of time by iterating over them in parallel, always advancing the one with lowest coordinates. This way you will only handle each line once, not many times.
Here's a basic version of your code that only does the coordinate checking (I don't fully understand your DP4 condition, so I'll leave it to you to add that part back in):
with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
coords = csv.reader(coords_f, delimiter="\t")
vcf = csv.reader(vcf_f, delimiter="\t")
out = csv.writer(out_f, delimiter="\t")
next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!
try:
c = next(coords)
r = next(vcf)
while True:
if int(c[1]) >= int(r[1]): # vcf file is behind
r = next(vcf)
elif int(c[2]) < int(r[1]): # coords file is behind
c = next(coords)
else: # int(c[1]) < int(r[1]) <= int(c[2])
out.writeline(r) # add DP4 check here, and indent this line under it
r = next(vcf) # don't indent this line
except StopIteration: # one of the files has ended
pass

Import a file and putting it inside a tuple in a particular format (python)

I am really stuck on this code that i've been working on and for 9 hours straight I cannot get it to work. Basically I am importing a file and splitting it to read the lines one by one and one of the tasks is to rearrange the lines in the file, for example the first line is: 34543, 2, g5000, Joseph James Collindale should look like : ['Collindale, Joseph James', '34543', g5000', '2']. So essentially it should loop over each line in the file and rearrange it to look like that format above. I created a function to check whether the length of the line is either 5 or 6 because they would both have a different format.
def tupleT(myLine):
myLine = myLine.split()
if len(myLine) == "5":
tuple1 = (myLine[4],myLine[3],myLine[0],myLine[2],myLine[1])
return tuple1
elif len(myLine) == "6":
tuple1 = (myLine[5],myLine[3]+ myLine[4],myLine[0],myLine[2], myLine[1])
return tuple1
mylist = []
x = input("Enter filename: ")
try :
f = open(x)
myLine = f.readline()
while (len(myLine)>0):
print(myLine[:-1])
myLine = f.readline()
tupleT(myLine)
f.close()
except IOError as e :
print("Problem opening file")
This is what the original file looks like in textpad and its called studs.txt:
12345 2 G400 Bart Simpson
12346 1 GH46 Lisa Simpson
12347 2 G401 Homer J Simpson
12348 4 H610 Hermione Grainger
12349 3 G400 Harry Potter
12350 1 G402 Herschel Shmoikel Krustofski
13123 3 G612 Wayne Rooney
13124 2 I100 Alex Ferguson
13125 3 GH4P Manuel Pellegrini
13126 2 G400A Mike T Sanderson
13127 1 G401 Amy Pond
13128 2 G402 Matt Smith
13129 2 G400 River Storm
13130 1 H610 Rose Tyler
Here is some commented code to get you started. Your code was a bit hard to read.
Consider renaming first, second and third since I have no idea what they are...
#!/usr/bin/env python
# this is more readable since there are actual names rather than array locations
def order_name(first, second, third, first_name, middle_name, last_name=None):
if not last_name:
# if there is no last name we got the last name in middle name (5 arguments)
last_name = middle_name
else:
# if there is a middle name add it to the first name to format as needed
first_name = "%s %s" % (first_name, middle_name)
return ("%s, %s" % (last_name, first_name), first, third, second)
with open('studs.txt') as o:
# this is a more standard way of iterating file rows
for line in o.readlines():
# strip line of \n
line = line.strip()
print "parsing " + line
# you can unpack a list to function arguments using the star operator.
print "ordered: " + str(order_name(*line.split()))

Search user input from a string and print

I have a string called new_file that I read from a file with these contents:
;ASP718I
;AspA2I
;AspBHI 0 6 9 15 ...
;AspCNI
;AsuI 37 116 272 348
...
I am using name = raw_input ("enter the enzyme ")
to get data from the user and I am trying to print the corresponding fields from the above file (new_file).
For the input ;AspBHI I'd like the program to print the corresponding line from the file:
;AspBHI 0 6 9 15 ...
How can I achieve this?
This is a start:
db = dict((x.split(" ")[0], x) for x in new_file.split("\n"))
name = raw_input("enter the enzyme ")
print db[name]
Also try to be nice next time, people might help you with more enthusiasm and even explain their approach.

Categories