python, fetch sequence from DAS by coordinates - python

ucsc DAS server, which get DNA sequences by coordinate.
URL: http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060
sample file:
<DASDNA>
<SEQUENCE id="chr20" start="30037832" stop="30038060" version="1.00">
<DNA length="229">
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
</DNA>
</SEQUENCE>
</DASDNA>
what I want is this part:
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
I want to get the sequence part from thousands of this kind urls, how should i do it?
I tried to write the data to file and parse the file, it worked ok, but is there any way to parse the xml-like string directly? i tried some example from other posts, but they didn't work.
Here, I added my solution. Thanks to the 2 answers below.
Solution 1:
def getSequence2(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
doc = etree.parse(url,parser=etree.XMLParser())
if doc != '':
sequence = doc.xpath('SEQUENCE/DNA/text()')[0].replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Solution 2:
def getSequence1(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
xml = urllib2.urlopen(url).read()
if xml != '':
w = open('temp.xml', 'w')
w.write(xml)
w.close()
dom = parse('temp.xml')
data = dom.getElementsByTagName('DNA')
sequence = data[0].firstChild.nodeValue.replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Of course they will need to import some necessary libraries.

>>> from lxml import etree
>>> doc = etree.parse("http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060",parser=etree.XMLParser())
>>> doc.xpath('SEQUENCE/DNA/text()')
['\natagtggcacatgtctgttgtcctagctcctcggggaaactcaggtggga\ngagtcccttgaactgggaggaggaggtttgcagtgagccagaatcattcc\nactgtactccagcctaggtgacagagcaagactcatctcaaaaaaaaaaa\naaaaaaaaaaaaaagacaatccgcacacataaaggctttattcagctgat\ngtaccaaggtcactctctcagtcaaaggtgggaagcaaaaaaacagagta\naaggaaaaacagtgatagatgaaaagagtcaaaggcaagggaaacaaggg\naccttctatctcatctgtttccattcttttacagacctttcaaatccgga\ngcctacttgttaggactgatactgtctcccttctttctgctttgtgtcag\ngtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc\ntccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg\ncgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc\ntttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc\nacactctatcaataaacacctctggctga\n']

Use a Python XML parsing library like lxml, load the XML file with that parser, and then use a selector (e.g. using XPath) to grab the node/element that you need.

Related

How do I hierarchically sort URLs in python?

Given an initial list of URLs crawled from a site:
https://somesite.com/
https://somesite.com/advertise
https://somesite.com/articles
https://somesite.com/articles/read
https://somesite.com/articles/read/1154
https://somesite.com/articles/read/1155
https://somesite.com/articles/read/1156
https://somesite.com/articles/read/1157
https://somesite.com/articles/read/1158
https://somesite.com/blogs
I am trying to turn the list into a tab-organized tree hierarchy:
https://somesite.com
/advertise
/articles
/read
/1154
/1155
/1156
/1157
/1158
/blogs
I've tried using lists, tuples, and dictionaries. So far I have figured out two flawed ways to output the content.
Method 1 will miss elements if they have the same name and position in the hierarchy:
Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
/missions
/playit
/extbasic
/0
/stego
----------------^ Missing expected output "/0"
Method 2 will not miss any elements, but it will print redundant content:
Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
/missions
/playit
/extbasic
/0
/missions <- Redundant content
/playit <- Redundant content
/stego
/0
I'm not sure how to properly do this, and my googling has only turned up references to urllib that don't seem to be what I need. Perhaps there is a much better approach, but I have been unable to find it.
My code for getting the content into a usable list:
#!/usr/bin/python3
import re
# Read the original list of URLs from file
with open("sitelist.raw", "r") as f:
raw_site_list = f.readlines()
# Extract the prefix and domain from the first line
first_line = raw_site_list[0]
prefix, domain = re.match("(http[s]://)(.*)[/]" , first_line).group(1, 2)
# Remove instances of prefix and domain, and trailing newlines, drop any lines that are only a slash
clean_site_list = []
for line in raw_site_list:
clean_line = line.strip(prefix).strip(domain).strip()
if not clean_line == "/":
if not clean_line[len(clean_line) - 1] == "/":
clean_site_list += [clean_line]
# Split the resulting relative paths into their component parts and filter out empty strings
split_site_list = []
for site in clean_site_list:
split_site_list += [list(filter(None, site.split("/")))]
This gives a list to manipulate, but I've run out of ideas on how to output it without losing elements or outputting redundant elements.
Thanks
Edit: This is the final working code I put together based on the answer chosen below:
# Read list of URLs from file
with open("sitelist.raw", "r") as f:
urls = f.readlines()
# Remove trailing newlines
for url in urls:
urls[urls.index(url)] = url[:-1]
# Remove any trailing slashes
for url in urls:
if url[-1:] == "/":
urls[urls.index(url)] = url[:-1]
# Remove duplicate lines
unique_urls = []
for url in urls:
if url not in unique_urls:
unique_urls += [url]
# Do the actual work (modified to use unique_urls and use tabs instead of 4x spaces, and to write to file)
base = unique_urls[0]
tabdepth = 0
tlen = len(base.split('/'))
final_urls = []
for url in unique_urls[1:]:
t = url.split('/')
lt = len(t)
if lt != tlen:
tabdepth += 1 if lt > tlen else -1
tlen = lt
pad = ''.join(['\t' for _ in range(tabdepth)])
final_urls += [f'{pad}/{t[-1]}']
with open("sitelist.new", "wt") as f:
f.write(base + "\n")
for url in final_urls:
f.write(url + "\n")
This works with your sample data:
urls = ['https://somesite.com',
'https://somesite.com/missions',
'https://somesite.com/missions/playit',
'https://somesite.com/missions/playit/extbasic',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/stego',
'https://somesite.com/missions/playit/stego/0']
base = urls[0]
print(base)
tabdepth = 0
tlen = len(base.split('/'))
for url in urls[1:]:
t = url.split('/')
lt = len(t)
if lt != tlen:
tabdepth += 1 if lt > tlen else -1
tlen = lt
pad = ''.join([' ' for _ in range(tabdepth)])
print(f'{pad}/{t[-1]}')
This code will help you in your task. I agree this code might be a bit large and might contain some redundant codes and checks but this will create a dictionary containing hierarchy of the urls, you can use that dictionary however you like, print it or store it.
More over this code will also parse different urls and create a seprate tree of them (see code and output)
EDIT: This will also take care of the redundant urls
Code:
from json import dumps
def process_urls(urls: list):
tree = {}
for url in urls:
url_components = url.split("/")
# First three components will be the protocol
# an empty entry
# and the base domain
base_domain = url_components[:3]
base_domain = base_domain[0] + "//" + "".join(base_domain[1:])
# Add base domain to tree if not there.
try:
tree[base_domain]
except:
tree[base_domain] = {}
structure = url_components[3:]
for i in range(len(structure)):
# add the first element
if i == 0 :
try:
tree[base_domain]["/"+structure[i]]
except:
tree[base_domain]["/"+structure[i]] = {}
else:
base = tree[base_domain]["/"+structure[0]]
for j in range(1, i):
base = base["/"+structure[j]]
try:
base["/"+structure[i]]
except:
base["/"+structure[i]] = {}
return tree
def print_tree(tree: dict, depth=0):
for key in tree.keys():
print("\t"*depth+key)
# redundant checks
if type(tree[key]) == dict:
# if dictionary is empty then do nothing
# else call this function recuressively
# increase depth by 1
if tree[key]:
print_tree(tree[key], depth+1)
if __name__ == "__main__":
urls = [
'https://somesite.com',
'https://somesite.com/missions',
'https://somesite.com/missions/playit',
'https://somesite.com/missions/playit/extbasic',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/extbasic/0',
'https://somesite.com/missions/playit/stego',
'https://somesite.com/missions/playit/stego/0',
'https://somesite2.com/missions/playit',
'https://somesite2.com/missions/playit/extbasic',
'https://somesite2.com/missions/playit/extbasic/0',
'https://somesite2.com/missions/playit/stego',
'https://somesite2.com/missions/playit/stego/0'
]
tree = process_urls(urls)
print_tree(tree)
Output:
https://somesite.com
/missions
/playit
/extbasic
/0
/stego
/0
https://somesite2.com
/missions
/playit
/extbasic
/0
/stego
/0

How to generate sequential numbers in xml? (Python 2.7)

I have a Python script (2.7) that is generating xml data. However, for each object in my array I need it to generate a new sequential number beginning with 1.
Example below (see sequenceOrder tag):
<objectArray>
<object>
<title>Title</title>
<sequenceOrder>1</sequenceOrder>
</object>
<object>
<title>Title</title>
<sequenceOrder>2</sequenceOrder>
</object>
<object>
<title>Title</title>
<sequenceOrder>3</sequenceOrder>
</object>
</objectArray>
With Python 2.7, how can I have my script generate a new number (+1 of the number preceding it) for the sequenceOrder part of each object in my xml array?
Please note that there will be hundreds of thousands of objects in my array, so something to consider.
I'm totally new to Python/coding in general, so any help is appreciated! Glad to provide additional information as necessary.
If you are creating objects to be serialized yourself, you may use itertools.count to get consecutive unique integers.
In a very abstract way it'll look like:
import itertools
counter = itertools.count()
o = create_object()
o.sequentialNumber = next(counter)
o2 = create_another_object()
o.sequentialNumber = next(counter)
create_xml_doc(my_objects)
Yes, you can generate a new sequence number for each XML element. Here is how to produce your sample output using lxml:
import lxml.etree as et
root = et.Element('objectArray')
for i in range(1, 4):
obj = et.Element('object')
title = et.Element('title')
title.text = 'Title'
obj.append(title)
sequenceOrder = et.Element('sequenceOrder')
sequenceOrder.text = str(i)
obj.append(sequenceOrder)
root.append(obj)
print et.tostring(root, pretty_print=True)
A colleague provided a solution:
set
sequence_order = 1
then in xml
<sequenceOrder>""" + str(sequence_order) + """</sequenceOrder>
then later
if test is False:
with open('uniquetest.csv', 'a') as write:
writelog = csv.writer(write, delimiter= '\t', quoting=csv.QUOTE_ALL)
writelog.writerow( (title, ) )
try:
f = open(file + '_output.xml', 'r')
f = open(file + '_output.xml', 'a')
f.write(DASxml_bottom + DASxml_top + digital_objects)
f.close()
sequence_order = sequence_order + 1
# f = open('log.txt', 'a')
# f.write(title + str(roll) + label_flag + str(id) + str(file_size) + file_path + """
# """)
# f.close()
That might make no sense to some since I didn't provide the whole script, but it worked for my purposes! Thanks all for the suggestions

python: gettng multiple results for getElementsByTagName

I'm trying to get each instance of an XML tag but I can only seem to return one or none.
#!/usr/software/bin/python
# import libraries
import urllib
from xml.dom.minidom import parseString
# variables
startdate = "2014-01-01"
enddate = "2014-05-01"
rest_client = "test"
rest_host = "restprd.test.com"
rest_port = "80"
rest_base_url = "asup-rest-interface/ASUP_DATA"
rest_date = "/start_date/%s/end_date/%s/limit/5000" % (startdate,enddate)
rest_api = "http://" + rest_host + ":" + rest_port + "/" + rest_base_url + "/" + "client_id" + "/" + rest_client
response = urllib.urlopen(rest_api + rest_date + '/sys_serial_no/700000667725')
data = response.read()
response.close()
dom = parseString(data)
xmlVer = dom.getElementsByTagName('sys_version').toxml()
xmlDate = dom.getElementsByTagName('asup_gen_date').toxml()
xmlVerTag=xmlVer.replace('<sys_version>','').replace('</sys_version>','')
xmlDateTag=xmlDate.replace('<asup_gen_date>','').replace('</asup_gen_date>','').replace('T',' ')[0:-6]
print xmlDateTag , xmlVerTag
The above code generates the following error:
Traceback (most recent call last):
File "./test.py", line 23, in <module>
xmlVer = dom.getElementsByTagName('sys_version').toxml()
AttributeError: 'NodeList' object has no attribute 'toxml'
If I change the .toxml() to [0].toxml() I can get the first element, but I need to get all the elements. Any ideas?
Also, if I try something like this I get no output at all:
response = urllib.urlopen(rest_api + rest_date + '/sys_serial_no/700000667725')
DOMTree = xml.dom.minidom.parse(response)
collection = DOMTree.documentElement
if collection.hasAttribute("results"):
print collection.getAttribute("sys_version")
The original data looks like this.
There are repeating sections of XML like this:
<xml><status request_id="58f39198-2c76-4e87-8e00-f7dd7e69519f1416354337206" response_time="00:00:00:833"></status><results start="1" limit="1000" total_results_count="1" results_count="1"><br/><system><tests start="1" limit="50" total_results_count="18" results_count="18"><test> <biz_key>C|BF02F1A3-3C4E-11DC-8AAE-0015171BBD90|8594169899|700000667725</biz_key><test_id>2014071922090465</test_id><test_subject>HA Group Notification (WEEKLY_LOG) INFO</test_subject><test_type>DOT-REGULAR</test_type><asup_gen_date>2014-07-20T00:21:40-04:00</asup_gen_date><test_received_date>Sat Jul 19 22:09:19 PDT 2014</test_received_date><test_gen_zone>EDT</test_gen_zone><test_is_minimal>false</test_is_minimal><sys_version>9.2.2X22</sys_version><sys_operating_mode>Cluster-Mode</sys_operating_mode><hostname>rerfdsgt</hostname><sys_domain>test.com</sys_domain><cluster_name>bbrtp</cluster_name> ... etc
<xml>
<results>
<system>
-<sys_version>
<asup>
-<asup_gen_date>
I simply want to extract the sys_version and asup_gen_date
9.2.2X22 2014-07-20 00:21:40
9.2.2X21 2014-06-31 12:51:40
8.5.2X1 2014-07-20 04:33:22
You need to loop over the results of getElementsByTagName():
for version in dom.getElementsByTagName('sys_version'):
version = version.toxml()
version = version.replace('<sys_version>','').replace('</sys_version>','')
print version
Also, instead of replacing opening and closing tags, you probably want yo use getText():
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
for version in dom.getElementsByTagName('sys_version'):
print getText(version.childNodes)
Another point is that it would be much more easy and pleasant to parse xml with xml.etree.ElementTree, example:
import xml.etree.ElementTree as ET
tree = ET.parse(response)
root = tree.getroot()
for version in root.findall('sys_version'):
print version.text

How to Modify Python Code in Order to Print Multiple Adjacent "Location" Tokens to Single Line of Output

I am new to python, and I am trying to print all of the tokens that are identified as locations in an .xml file to a .txt file using the following code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('exercise-ner.xml', 'r'))
tokenlist = soup.find_all('token')
output = ''
for x in tokenlist:
readeachtoken = x.ner.encode_contents()
checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")
if checktoseeifthetokenisalocation != -1:
output += "\n%s" % x.word.encode_contents()
z = open('exercise-places.txt','w')
z.write(output)
z.close()
The program works, and spits out a list of all of the tokens that are locations, each of which is printed on its own line in the output file. What I would like to do, however, is to modify my program so that any time beautiful soup finds two or more adjacent tokens that are identified as locations, it can print those tokens to the same line in the output file. Does anyone know how I might modify my code to accomplish this? I would be entirely grateful for any suggestions you might be able to offer.
This question is very old, but I just got your note #Amanda and I thought I'd post my approach to the task in case it might help others:
import glob, codecs
from bs4 import BeautifulSoup
inside_location = 0
location_string = ''
with codecs.open("washington_locations.txt","w","utf-8") as out:
for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
locations = []
with codecs.open(i,'r','utf-8') as f:
soup = BeautifulSoup(f.read())
tokens = soup.findAll('token')
for token in tokens:
if token.ner.string == "LOCATION":
inside_location = 1
location_string += token.word.string + u" "
else:
if location_string:
locations.append( location_string )
location_string = ''
out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )

How do I replace a specific part of a string in Python

As of now I am trying to scrape Good.is.The code as of now gives me the regular image(turn the if statement to True) but I want to higher res picture. I was wondering how I would replace a certain text so that I could download the high res picture. I want to change the html: http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html to http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html (The end is different). My code is:
import os, urllib, urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser
parser = HTMLParser.HTMLParser()
# make folder.
folderName = 'Good.is'
if not os.path.exists(folderName):
os.makedirs(folderName)
list = []
# Python ranges start from the first argument and iterate up to one
# less than the second argument, so we need 36 + 1 = 37
for i in range(1, 37):
list.append("http://www.good.is/infographics/page:" + str(i) + "/sort:recent/range:all")
listIterator1 = []
listIterator1[:] = range(0,37)
counter = 0
for x in listIterator1:
soup = BeautifulSoup(urllib2.urlopen(list[x]).read())
body = soup.findAll("ul", attrs = {'id': 'gallery_list_elements'})
number = len(body[0].findAll("p"))
listIterator = []
listIterator[:] = range(0,number)
for i in listIterator:
paragraphs = body[0].findAll("p")
nextArticle = body[0].findAll("a")[2]
text = body[0].findAll("p")[i]
if len(paragraphs) > 0:
#print image['src']
counter += 1
print counter
print parser.unescape(text.getText())
print "http://www.good.is" + nextArticle['href']
originalArticle = "http://www.good.is" + nextArticle['href']
article = BeautifulSoup(urllib2.urlopen(originalArticle).read())
title = article.findAll("div", attrs = {'class': 'title_and_image'})
getTitle = title[0].findAll("h1")
article1 = article.findAll("div", attrs = {'class': 'body'})
articleImage = article1[0].find("p")
betterImage = articleImage.find("a")
articleImage1 = articleImage.find("img")
paragraphsWithinSection = article1[0].findAll("p")
print betterImage['href']
if len(paragraphsWithinSection) > 1:
articleText = article1[0].findAll("p")[1]
else:
articleText = article1[0].findAll("p")[0]
print articleImage1['src']
print parser.unescape(getTitle)
if not articleText is None:
print parser.unescape(articleText.getText())
print '\n'
link = articleImage1['src']
x += 1
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
Have a look at str.replace. If that isn't general enough to get the job done, you'll need to use a regular expression ( re -- probably re.sub ).
>>> str1="http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html"
>>> str1.replace("flash","flat")
'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html'
I think the safest and easiest way is to use a regular expression:
import re
url = 'http://www.google.com/this/is/sample/url/flash.html'
newUrl = re.sub('flash\.html$','flat.html',url)
The "$" means only match the end of the string. This solution will behave correctly even in the (admittedly unlikely) event that your url includes the substring "flash.html" somewhere other than the end, and also leaves the string unchanged (which I assume is the correct behavior) if it does not end with 'flash.html'.
See: http://docs.python.org/library/re.html#re.sub
#mgilson has a good solution, but the problem is it will replace all occurrences of the string with the replacement; so if you have the word "flash" as part of the URL (and not the just the trailing file name), you'll have multiple replacements:
>>> str = 'hello there hello'
>>> str.replace('hello','world')
'world there world'
An alternate solution is to replace the last part after / with flat.html:
>>> url = 'http://www.google.com/this/is/sample/url/flash.html'
>>> url[:url.rfind('/')+1]+'flat.html'
'http://www.google.com/this/is/sample/url/flat.html'
Using urlparse you can do a few bits and bobs:
from urlparse import urlsplit, urlunsplit, urljoin
s = 'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html'
url = urlsplit(s)
head, tail = url.path.rsplit('/', 1)
new_path = head, 'flat.html'
print urlunsplit(url._replace(path=urljoin(*new_path)))

Categories