I've got this these two code snippets from a webinar(slide 7 and 8 respectively)
The first one finds a desired URL i have tested it and it works:
def SECdownload(year, month):
import os
from urllib.request import urlopen
root = None
feedFile = None
feedData = None
good_read = False
itemIndex = 0
edgarFilingsFeed = 'http://www.sec.gov/Archives/edgar/monthly/xbrlrss-' + str(year) + '-' + str(month).zfill(2) + '.xml'
print( edgarFilingsFeed )
if not os.path.exists( "sec/" + str(year) ):
os.makedirs( "sec/" + str(year) )
if not os.path.exists( "sec/" + str(year) + '/' + str(month).zfill(2) ):
os.makedirs( "sec/" + str(year) + '/' + str(month).zfill(2) )
target_dir = "sec/" + str(year) + '/' + str(month).zfill(2) + '/'
try:
feedFile = urlopen( edgarFilingsFeed ) # urlopen will not work (python 3) needs from urllib.request import urlopen
try:
feedData = feedFile.read()
good_read = True
finally:
feedFile.close()
except HTTPError as e:
print( "HTTP Error:", e.code )
and the second one is supposed to parse the RSS Feed to find ZIP filenames:
#Downloading the data - parsing the RSS feed to extract the ZIP file enclosure filename
# Process RSS feed and walk through all items contained
for item in feed.entries:
print( item[ "summary" ], item[ "title" ], item[ "published" ] )
try:
# Identify ZIP file enclosure, if available
enclosures = [ l for l in item[ "links" ] if l[ "rel" ] == "enclosure" ]
if ( len( enclosures ) > 0 ):
# ZIP file enclosure exists, so we can just download the ZIP file
enclosure = enclosures[0]
sourceurl = enclosure[ "href" ]
cik = item[ "edgar_ciknumber" ]
targetfname = target_dir+cik +' - ' +sourceurl.split('/')[-1]
retry_counter = 3
while retry_counter > 0:
good_read = downloadfile( sourceurl, targetfname )
if good_read:
break
else:
print( "Retrying:", retry_counter )
retry_counter -= 1
except:
pass
However whenever i try to run the second module i get:
Traceback (most recent call last):
File "E:\Py_env\df2.py", line 3, in <module>
for item in feed.entries:
NameError: name 'feed' is not defined
What am i not understanding in the webinar right? And if i must define feed i have literally no idea how to do it while keeping a logical linkage to the data the first code snippet provides!
(On the sidenote this is a webinar from a reputable software vendor so how it is possible to have mistakes(?) there's something i am doing wrong...)
The problem is like the error message implies: you haven't defined any variable named feed that's in scope when the second snippet executes. Either their code omitted something, or you missed a part that was crucial.
That aside, the formatting on this code is really dodgy and not at all idiomatic Python. You're probably better off looking for a new snippet.
Migrated from a comment.
As your output is showing you and you noticed, feed wasn't defined nor was it shown to you in the slides. It looks as though the slide share is expecting you to make a logical jump, and they do point out in the right column that feedparser is an easy way to parse... feeds (RSS Feeds).
So they are expecting that you can adapt the feedData you found in your first function and can dump it into a method from feedparser.
As you can see in various examples online (such as the docs), this can be done from the string you got:
>>> import feedparser
>>> rawdata = """<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>"""
>>> d = feedparser.parse(rawdata)
>>> d['feed']['title']
u'Sample Feed'
Using that, I bet you can see where it goes (rather than me telling you).
as #PatrickCollins pointed out, this is kindof crappy examples for python, but that shouldn't get in your way as you're learning it.
Related
I am trying to get each entity to draw a certain type of graph using the Understand Python API. All the inputs are good and the database is opened but the the single item is not drawn. There is no error and no output file. The code is listed below. The Python code is called from a C# application which calls upython.exe with the associated arguments.
The Python file receives the Scitools directory and opens the understand database. The entities are also an argument which is loaded into a temp file. The outputPath references the directory where the SVG file will be placed. I can't seem to figure out why the item.draw method isn't working.
import os
import sys
import argparse
import re
import json
#parse arguments
parser = argparse.ArgumentParser(description = 'Creates butterfly or call by graphs from and Understand DB')
parser.add_argument('PathToSci', type = str, help = "Path to Sci Understand libraries")
parser.add_argument('PathToDB', type = str, help = "Path to the Database you want to create graphs from")
parser.add_argument('PathToOutput', type = str, help='Path to where the graphs should be outputted')
parser.add_argument('TypeOfGraph', type = str,
help="The type of graph you want to generate. Same names as in Understand GUI. IE 'Butterfly' 'Called By' 'Control Flow' ")
parser.add_argument("entities", help='Path to json list file of Entity long names you wish to create graphs for')
args, unknown = parser.parse_known_args()
# they may have entered a path with a space broken into multiple strings
if len(unknown) > 0:
print("Unkown argument entered.\n Note: Individual arguments must be passed as a single string.")
quit()
pathToSci = args.PathToSci
pathToDB = args.PathToDB
graphType = args.TypeOfGraph
entities = json.load(open(args.entities,))
pathToOutput = args.PathToOutput
pathToSci = os.path.join(pathToSci, "Python")
sys.path.append(pathToSci)
import understand
db = understand.open(pathToDB)
count = 0
for name in entities:
count += 1
print("Completed: " + str(count) + "/" + str(len(entities)))
#if it is an empty name don't make a graph
if len(name) == 0:
break
pattern = re.compile((name + '$').replace("\\", "/"))
print("pattern: " + str(pattern))
sys.stdout.flush()
ent = db.lookup(pattern)
print("ent: " + str(ent))
sys.stdout.flush()
print("Type: " + str(type(ent[0])))
sys.stdout.flush()
for item in ent:
try:
filename = os.path.join(pathToOutput, item.longname() + ".svg")
print("Graph Type: " + graphType)
sys.stdout.flush()
print("filename: " + filename)
sys.stdout.flush()
print("Item Kind: " + str(ent[0].kind()))
sys.stdout.flush()
item.draw(graphType, filename)
except understand.UnderstandError:
print("error creating graph")
sys.stdout.flush()
except Exception as e:
print("Could not create graph for " + item.kind().longname() + ": " + item.longname())
sys.stdout.flush()
print(e)
sys.stdout.flush()
db.close()
The output is below:
Completed: 1/1
pattern: re.compile('CSC03.SIN_COS$')
ent: [#lCSC03.SIN_COS#kacsc03.sin_cos(long_float,csc03.sctype)long_float#f./../../../IOSSP/Source_Files/OGP/OGP_71/csc03/csc03.ada]
Type: <class 'understand.Ent'>
Graph Type: Butterfly
filename: C:\Users\M73720\Documents\DFS\DFS-OGP-25-Aug-2022-11-24\SVGs\Entities\CSC03.SIN_COS.svg
Item Kind: Function
It turns out that it was a problem in the Understand API. The latest build corrected the problem. This was found by talking with SciTools Support group.
This question already has answers here:
What does "sys.argv[1]" mean? (What is sys.argv, and where does it come from?)
(9 answers)
Closed 7 months ago.
I am trying to run this script that grabs rss feeds on the environment "Thonny" but I just keep receiving this error of "IndexError: List index out of range"
Traceback (most recent call last):
File "C:\Users\uri\rssfeedfour.py", line 11, in <module>
url = sys.argv[1]
IndexError: list index out of range
How do I resolve this to keep from getting this error over and over again. Im not sure how to solve this as I am a beginner. Do I need to define it, if so how? or could I take it out and go a different direction? Here is the code.
import feedparser
import time
from subprocess import check_output
import sys
#feed_name = 'TRIBUNE'
#url = 'http://chicagotribune.feedsportal.com/c/34253/f/622872/index.rss'
feed_name = sys.argv[1]
url = sys.argv[2]
db = 'http://feeds.feedburner.com/TheHackersNews'
limit = 12 * 3600 * 1000
current_time_millis = lambda: int(round(time.time() * 1000))
current_timestamp = current_time_millis()
def post_is_in_db(title):
with open(db, 'r') as database:
for line in database:
if title in line:
return True
return False
def post_is_in_db_with_old_timestamp(title):
with open(db, 'r') as database:
for line in database:
if title in line:
ts_as_string = line.split('|', 1)[1]
ts = long(ts_as_string)
if current_timestamp - ts > limit:
return True
return False
#
# get the feed data from the url
#
feed = feedparser.parse(url)
#
# figure out which posts to print
#
posts_to_print = []
posts_to_skip = []
for post in feed.entries:
# if post is already in the database, skip it
# TODO check the time
title = post.title
if post_is_in_db_with_old_timestamp(title):
posts_to_skip.append(title)
else:
posts_to_print.append(title)
#
# add all the posts we're going to print to the database with the current timestamp
# (but only if they're not already in there)
#
f = open(db, 'a')
for title in posts_to_print:
if not post_is_in_db(title):
f.write(title + "|" + str(current_timestamp) + "\n")
f.close
#
# output all of the new posts
#
count = 1
blockcount = 1
for title in posts_to_print:
if count % 5 == 1:
print("\n" + time.strftime("%a, %b %d %I:%M %p") + ' ((( ' + feed_name + ' - ' + str(blockcount) + ' )))')
print("-----------------------------------------\n")
blockcount += 1
print(title + "\n")
count += 1
sys.argv is a list in Python, which contains the command-line arguments passed to the script. sys.argv[0] contains the name of the script, sys.argv[1] contains the first argument and so on.
To prevent this error, you need to give command line arguments when starting the script. For example, you can start this script without any errors by
python rssfeedfour.py TRIBUNE http://chicagotribune.feedsportal.com/c/34253/f/622872/index.rss
You can also modify the script so that it works using the default arguments if you don't provide any command line arguments.
try:
feed_name = sys.argv[1]
except IndexError:
feed_name = 'TRIBUNE'
try:
url = sys.argv[2]
except IndexError:
url = 'http://chicagotribune.feedsportal.com/c/34253/f/622872/index.rss'
You can learn more about handling errors here.
Although it is much more convenient to use argparse library.
I am trying to access the summary of the NYT articles using the NewsWire API and python 2.7. Here is the code:
from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper
posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
if posts and articles and keys:
outfile= open("articles_next.tsv", "w")
for s in articles:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
outfile= open("summary_next.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
indexfile=open("ind2_next.tsv", "w")
for x in keys.keys():
indexfile.write('\n' + str(x) + " " + str(keys[x]))
indexfile.close()
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
article = newspaper.Article(url)
article.download()
article.parse()
try:
el_post = article.text.replace('\n\n',' ').replace("Advertisement Continue reading the main story",'')
except XMLSyntaxError, e:
continue
articles.append(el_post)
count=count + 1
res= abst # url + " " + abst
# print res.encode("utf-8")
posts.append(res) # Here is the appending statement.
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
print str(len(posts))
print str(len(keys))
I was getting is good summary. But sometimes I came across some weird sentences as a part of the summary. Here are the example:
Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.
which are consider to be a summary of some article. Kindly, help me in extracting the perfect summary of the article from the NYT news. I thought of using the titles if such the arises, but the title is weird too.
So, I have a taken a look through the summary results.
It is possible to remove repeated statements such as Corrections appearing in print on Monday, August 28, 2017., where only the date is different.
Simplest way to do this is to check if the statement is present in the vairable itself.
Example,
# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"]
And then,
if (statement not in res for statement in REMOVE_STATEMENTS):
posts.append(res)
As for the remaining unwanted statements, there is NO way they can be differentiated, unless you search for keywords within res that you want to ignore, or they are repetitive. If you find any, just simply add them to the list I created.
I'm trying to get each instance of an XML tag but I can only seem to return one or none.
#!/usr/software/bin/python
# import libraries
import urllib
from xml.dom.minidom import parseString
# variables
startdate = "2014-01-01"
enddate = "2014-05-01"
rest_client = "test"
rest_host = "restprd.test.com"
rest_port = "80"
rest_base_url = "asup-rest-interface/ASUP_DATA"
rest_date = "/start_date/%s/end_date/%s/limit/5000" % (startdate,enddate)
rest_api = "http://" + rest_host + ":" + rest_port + "/" + rest_base_url + "/" + "client_id" + "/" + rest_client
response = urllib.urlopen(rest_api + rest_date + '/sys_serial_no/700000667725')
data = response.read()
response.close()
dom = parseString(data)
xmlVer = dom.getElementsByTagName('sys_version').toxml()
xmlDate = dom.getElementsByTagName('asup_gen_date').toxml()
xmlVerTag=xmlVer.replace('<sys_version>','').replace('</sys_version>','')
xmlDateTag=xmlDate.replace('<asup_gen_date>','').replace('</asup_gen_date>','').replace('T',' ')[0:-6]
print xmlDateTag , xmlVerTag
The above code generates the following error:
Traceback (most recent call last):
File "./test.py", line 23, in <module>
xmlVer = dom.getElementsByTagName('sys_version').toxml()
AttributeError: 'NodeList' object has no attribute 'toxml'
If I change the .toxml() to [0].toxml() I can get the first element, but I need to get all the elements. Any ideas?
Also, if I try something like this I get no output at all:
response = urllib.urlopen(rest_api + rest_date + '/sys_serial_no/700000667725')
DOMTree = xml.dom.minidom.parse(response)
collection = DOMTree.documentElement
if collection.hasAttribute("results"):
print collection.getAttribute("sys_version")
The original data looks like this.
There are repeating sections of XML like this:
<xml><status request_id="58f39198-2c76-4e87-8e00-f7dd7e69519f1416354337206" response_time="00:00:00:833"></status><results start="1" limit="1000" total_results_count="1" results_count="1"><br/><system><tests start="1" limit="50" total_results_count="18" results_count="18"><test> <biz_key>C|BF02F1A3-3C4E-11DC-8AAE-0015171BBD90|8594169899|700000667725</biz_key><test_id>2014071922090465</test_id><test_subject>HA Group Notification (WEEKLY_LOG) INFO</test_subject><test_type>DOT-REGULAR</test_type><asup_gen_date>2014-07-20T00:21:40-04:00</asup_gen_date><test_received_date>Sat Jul 19 22:09:19 PDT 2014</test_received_date><test_gen_zone>EDT</test_gen_zone><test_is_minimal>false</test_is_minimal><sys_version>9.2.2X22</sys_version><sys_operating_mode>Cluster-Mode</sys_operating_mode><hostname>rerfdsgt</hostname><sys_domain>test.com</sys_domain><cluster_name>bbrtp</cluster_name> ... etc
<xml>
<results>
<system>
-<sys_version>
<asup>
-<asup_gen_date>
I simply want to extract the sys_version and asup_gen_date
9.2.2X22 2014-07-20 00:21:40
9.2.2X21 2014-06-31 12:51:40
8.5.2X1 2014-07-20 04:33:22
You need to loop over the results of getElementsByTagName():
for version in dom.getElementsByTagName('sys_version'):
version = version.toxml()
version = version.replace('<sys_version>','').replace('</sys_version>','')
print version
Also, instead of replacing opening and closing tags, you probably want yo use getText():
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
for version in dom.getElementsByTagName('sys_version'):
print getText(version.childNodes)
Another point is that it would be much more easy and pleasant to parse xml with xml.etree.ElementTree, example:
import xml.etree.ElementTree as ET
tree = ET.parse(response)
root = tree.getroot()
for version in root.findall('sys_version'):
print version.text
ucsc DAS server, which get DNA sequences by coordinate.
URL: http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060
sample file:
<DASDNA>
<SEQUENCE id="chr20" start="30037832" stop="30038060" version="1.00">
<DNA length="229">
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
</DNA>
</SEQUENCE>
</DASDNA>
what I want is this part:
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
I want to get the sequence part from thousands of this kind urls, how should i do it?
I tried to write the data to file and parse the file, it worked ok, but is there any way to parse the xml-like string directly? i tried some example from other posts, but they didn't work.
Here, I added my solution. Thanks to the 2 answers below.
Solution 1:
def getSequence2(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
doc = etree.parse(url,parser=etree.XMLParser())
if doc != '':
sequence = doc.xpath('SEQUENCE/DNA/text()')[0].replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Solution 2:
def getSequence1(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
xml = urllib2.urlopen(url).read()
if xml != '':
w = open('temp.xml', 'w')
w.write(xml)
w.close()
dom = parse('temp.xml')
data = dom.getElementsByTagName('DNA')
sequence = data[0].firstChild.nodeValue.replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Of course they will need to import some necessary libraries.
>>> from lxml import etree
>>> doc = etree.parse("http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060",parser=etree.XMLParser())
>>> doc.xpath('SEQUENCE/DNA/text()')
['\natagtggcacatgtctgttgtcctagctcctcggggaaactcaggtggga\ngagtcccttgaactgggaggaggaggtttgcagtgagccagaatcattcc\nactgtactccagcctaggtgacagagcaagactcatctcaaaaaaaaaaa\naaaaaaaaaaaaaagacaatccgcacacataaaggctttattcagctgat\ngtaccaaggtcactctctcagtcaaaggtgggaagcaaaaaaacagagta\naaggaaaaacagtgatagatgaaaagagtcaaaggcaagggaaacaaggg\naccttctatctcatctgtttccattcttttacagacctttcaaatccgga\ngcctacttgttaggactgatactgtctcccttctttctgctttgtgtcag\ngtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc\ntccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg\ncgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc\ntttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc\nacactctatcaataaacacctctggctga\n']
Use a Python XML parsing library like lxml, load the XML file with that parser, and then use a selector (e.g. using XPath) to grab the node/element that you need.