If statement to only select certain XML attributes - python

I'm attempting to get 2 different elements from an XML file; I'm trying to print them as the x and y on a scatter plot. I can manage to get both the elements but one list is 155 long and the other only 50.
So I need to add an if statement to just select from elements that have an associated windSpeed element.
url = "http://api.met.no/weatherapi/locationforecast/1.9/?lat=52.41616;lon=-4.064598"
response = requests.get(url)
xml_text=response.text
weather= bs4.BeautifulSoup(xml_text, "xml")
f = open('file.xml', "w")
f.write(weather.prettify())
f.close()
I'm then trying to get the time (from) element and the (windSpeed > mps) element and attribute. I'd like to use use Beautifulsoup if possible, or a straight if loop would be great.
with open ('file.xml') as file:
soup = bs4.BeautifulSoup(file, "xml")
times = soup.find_all("time")
windspeed = soup.select("windSpeed")
form = ("%Y-%m-%dT%H:%M:%SZ")
x = []
y = []
for element in times:
time = element.get("from")
t = datetime.datetime.strptime(time, form)
x.append(t)
for mps in windspeed:
speed = mps.get("mps")
y.append(speed)
plt.scatter(x, y)
plt.show()
When I run it raises the following error:
raise ValueError("x and y must be the same size")
ValueError: x and y must be the same size
I'm assuming it's because the lists are different lengths.
I know there's probably a simple way of fixing it, any ideas would be great.

Just modify your code snippet as follows. It will solve the length problem.
....
for element in times:
time = element.get("from")
t = datetime.datetime.strptime(time, form)
if element.find('windSpeed'):
x.append(t)
....

Related

Item in list found but when I ask for location by index it says that the item can't be found

I am writing some code to get a list of certain counties in Florida for a database. These counties are listed on a website but are each on individual webpages. To make the collection process less tedious I am writing a webscraper. I have gotten the links to all of the websites with the counties. I have written code that will then inspect the website, find the line that says "COUNTY:" and then I want to get the location so I can actually get the county on the next line. The only problem is when I ask for the location it says it can't be found. I know it is in there because when I ask my code to find it and then return the line (Not the placement) it doesn't return empty. I will give some of the code for reference and an image of the problem.
Broken code:
links = ['https://www.ghosttowns.com/states/fl/acron.html', 'https://www.ghosttowns.com/states/fl/acton.html']
import requests
r = requests.get(links[1])
r = str(r.text)
r = r.split("\n")
county_before_location = [x for x in r if 'COUNTY' in x]
print(county_before_location)
print(r.index(county_before_location))
Returns:
[' </font><b><font color="#80ff80">COUNTY:</font><font color="#ffffff">'] is not in list
Code that shows the item:
links = ['https://www.ghosttowns.com/states/fl/acron.html', 'https://www.ghosttowns.com/states/fl/acton.html']
import requests
r = requests.get(links[1])
r = str(r.text)
r = r.split("\n")
county_before_location = [x for x in r if 'COUNTY' in x]
print(county_before_location)
Returns:
[' </font><b><font color="#80ff80">COUNTY:</font><font color="#ffffff">']
Photo
county_before_location is a list and you are asking for the index of said list, which is not in r. Instead you would need to ask for r.index(county_before_location[0]).

Python Webscraping: How do i loop many url requests?

import requests
from bs4 import BeautifulSoup
LURL="https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
Lpage = requests.get(LURL)
Lsoup = BeautifulSoup(Lpage.content, 'html.parser')
Lx = Lsoup.find_all(class_="column-2")
a=[]
for Lx in Lx:
a.append(Lx.text)
a.remove("Land")
j=0
for i in range(len(a)):
b = a[j]
URL = "https://de.wikipedia.org/wiki/"+b
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
l = soup.find(class_="firstHeading")
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
z = zr.findAll("tr")
a=""
for z in z:
a=a+z.text
h=a.find("Hauptstadt")
lol=a[h:-1]
lol=lol.replace("Hauptstadt", "")
lol=lol.strip()
fg=lol.find("\n")
lol=lol[0:fg]
lol=lol.strip()
j=j+1
print(lol)
print(l.text)
This is the code. It gets the name of every country and packs it into a list. After that the program loops through the wikipedia pages of the countrys and gets the capital of the country and prints it. It works fine for every country. But after one country is finished and code starts again it stops do work with the error:
Traceback (most recent call last): File "main.py", line 19, in <module>
z = zr.findAll("tr") AttributeError: 'NoneType' object has no attribute 'findAll'
You stored the list of countries in a variable called a, which you then overwrote later in the script with some other value. That messes up your iteration. Two good ways to prevent problems like this:
Use more meaningful variable names.
Use mypy on your Python code.
I spent a little time trying to do some basic cleanup on your code to at least get you past that first bug; the list of countries is now called countries instead of a, which prevents you from overwriting it, and I replaced the extremely confusing i/j/a/b iteration with a very simple for country in countries loop. I also got rid of all the variables that were only used once so I wouldn't have to try to come up with better names for them. I think there's more work to be done, but I don't have enough of an idea what that inner loop is doing to want to even try to fix it. Good luck!
import requests
from bs4 import BeautifulSoup
countries = [x.text for x in BeautifulSoup(
requests.get(
"https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
).content,
'html.parser'
).find_all(class_="column-2")]
countries.remove("Land")
for country in countries:
soup = BeautifulSoup(
requests.get(
"https://de.wikipedia.org/wiki/" + country
).content,
'html.parser'
)
heading = soup.find(class_="firstHeading")
rows = soup.find(
class_="wikitable infobox infoboxstaat float-right"
).findAll("tr")
a = ""
for row in rows:
a += row.text
h = a.find("Hauptstadt")
lol = a[h:-1]
lol = lol.replace("Hauptstadt", "")
lol = lol.strip()
fg = lol.find("\n")
lol = lol[0:fg]
lol = lol.strip()
print(lol)
print(heading.text)
The error message is actually telling you what's happening. The line of code
z = zr.findAll("tr")
is throwing an attribute error because the NoneType object does not have a findAll attribute. You are trying to call findAll on zr, assuming that variable will always be a BeautifulSoup object, but it won't. If this line:
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
finds no objects in the html matching those classes, zr will be set to None. So, on one of the pages you are trying to scrape, that's what happening. You can code around it with a try/except statement, like this:
for i in range(len(a)):
b = a[j]
URL = "https://de.wikipedia.org/wiki/"+b
page = requests.get(URL)
try:
soup = BeautifulSoup(page.content, 'html.parser')
l = soup.find(class_="firstHeading")
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
z = zr.findAll("tr")
a=""
#don't do this! should be 'for i in z' or something other variable name
for z in z:
a=a+z.text
h=a.find("Hauptstadt")
lol=a[h:-1]
lol=lol.replace("Hauptstadt", "")
lol=lol.strip()
fg=lol.find("\n")
lol=lol[0:fg]
lol=lol.strip()
j=j+1
print(lol)
print(l.text)
except:
pass
In this example, any page that doesn't have the right html tags will be skipped.
The 'NoneType' means your line zr = soup.find(class_="wikitable infobox infoboxstaat float-right") has returned nothing.
The error is in this loop :
for Lx in Lx:
a.append(Lx.text)
You can't use the same name there. Please try to use this loop instead and let me know how it goes:
for L in Lx:
a.append(Lx.text)

removing datetime.datetime from a list in python

I'm attempting to get 2 different elements from an XML file, I'm trying to print them as the x and y on a scatter plot, I can manage to get both the elements but when I plot them it only uses one of the dates to plot the other elements. I'm using the below code to get a weather HTML and save it as an XML.
url = "http://api.met.no/weatherapi/locationforecast/1.9/?lat=52.41616;lon=-4.064598"
response = requests.get(url)
xml_text=response.text
weather= bs4.BeautifulSoup(xml_text, "xml")
f = open('file.xml', "w")
f.write(weather.prettify())
f.close()
I'm then trying to get the time ('from') element and the ('windSpeed' > 'mps') element and attribute. I'm then trying to plot it as an x and y on a scatter plot.
with open ('file.xml') as file:
soup = bs4.BeautifulSoup(file, "xml")
times = soup.find_all("time")
windspeed = soup.select("windSpeed")
form = ("%Y-%m-%dT%H:%M:%SZ")
x = []
y = []
for element in times:
time = element.get("from")
t = datetime.datetime.strptime(time, form)
x.append(t)
for mps in windspeed:
speed = mps.get("mps")
y.append(speed)
plt.scatter(x, y)
plt.show()
I'm trying to make 2 lists from 2 loops, and then read them as the x and y, but when I run it it gives the error;
raise ValueError("x and y must be the same size")
ValueError: x and y must be the same size
I'm assuming it's because it prints the list as datetime.datetime(2016, 12, 22, 21, 0), how do I remove the datetime.datetime from the list.
I know there's probably a simple way of fixing it, any ideas would be great, you people here on stack are helping me a lot with learning to code. Thanks
Simply make two lists one containing x-axis values and other with y-axis values and pass to scatter function
plt.scatter(list1, list2);
I suggest that you use lxml for analysing xml because it gives you the ability to use xpath expressions which can make life much easier. In this case, not every time entry contains a windSpeed entry; therefore, it's essential to identify the windSpeed entries first then to get the associated times. This code does that. There are two little problems I usually encounter: (1) I still need to 'play' with xpath to get it right; (2) Sometimes I get a list when I expect a singleton which is why there's a '[0]' in the code. I find it's better to build the code interactively.
>>> from lxml import etree
>>> XML = open('file.xml')
>>> tree = etree.parse(XML)
>>> for count, windSpeeds in enumerate(tree.xpath('//windSpeed')):
... windSpeeds.attrib['mps'], windSpeeds.xpath('../..')[0].attrib['from']
... if count>5:
... break
...
('3.9', '2016-12-29T18:00:00Z')
('4.8', '2016-12-29T21:00:00Z')
('5.0', '2016-12-30T00:00:00Z')
('4.5', '2016-12-30T03:00:00Z')
('4.1', '2016-12-30T06:00:00Z')
('3.8', '2016-12-30T09:00:00Z')
('4.4', '2016-12-30T12:00:00Z')

Save Images from Url's stored in List - Python Simple

Using a list I am able to get all url's from a webage already into list imgs_urls. I need to now how to save all images from a webage, with the number of images changing.
Within the imgs_urls list depending on what report I run, there can be any number of urls in the list. This currently already works by calling just one list item.
html = lxml.html.fromstring(data)
imgs = html.cssselect('img.graph')
imgs_urls = []
for x in imgs:
imgs_urls.append('http://statseeker%s' % (x.attrib['src']))
lnum = len(imgs_urls)
link = urllib2.Request(imgs_urls[0])
output = open('sla1.jpg','wb')
response = urllib2.urlopen(link)
output.write(response.read())
output.close()
The urls in the lsit are full urls. This list would readback something like this if printed:
img_urls = ['http://site/2C2302.png','http://site/2C22101.png','http://site/2C2234.png']
Basic premise of what I think something like this would look like, but the Syntax I know is not correct:
lnum = len(imgs_urls)
link = urllib2.Request(imgs_urls[0-(lnum)])
output = open('sla' + (0-(lnum)).jpg','wb')
response = urllib2.urlopen(link)
output.write(response.read())
output.close()
It would then save all images, and the file would look something like this:
sla1.png, sla2.png, sla3.png, sla4.png
Any ideas? I think a loop would probably fix this but I don't know how to increment saving the sla.jpg the amount of times of the integer in lnum, and then increment the list number in output the same way.
I like to use Python's enumerate to get the index of the iterable in addition to the value. You can use this to auto-increment the value you give to the outputted filenames. Something like this should work:
import urllib2
img_urls = ['http://site/2C2302.png','http://site/2C22101.png','http://site/2C2234.png']
for index, url in enumerate(img_urls):
link = urllib2.urlopen(url)
try:
name = "sla%s.jpg" % (index+1)
with open(name, "wb") as output:
output.write(link.read())
except IOError:
print "Unable to create %s" % name
You may need to catch other exceptions too, such as permission errors, but that should get you started. Note that I incremented the index by 1 as it is zero-based.
See also:
http://www.blog.pythonlibrary.org/2012/06/07/python-101-how-to-download-a-file/
How do I download a file over HTTP using Python?

How to apply a function on each item in a list

I have a sitemap with about 21 urls on it and each of those urls contains about 2000 more urls. I'm trying to write something that will allow me to parse each of the original 21 urls and grab their containing 2000 urls then append it to a list.
I've been bashing my head against a wall for a few days now trying to get this to work, but it keeps returning a list of 'None'. I've only been working with python for about a 3 weeks now, so I might be missing something really obvious. Any help would be great!
storage = []
storage1 = []
for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)
def parser(any):
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
storage2 = [parser(x) for x in storage]
I also tried using a while loop with a counter, but it always stopped after the first 2000 urls.
parser() never returns anything, so it defaults to returning None, hence why storage2 contains a list of Nones. Perhaps you want to look at what's in storage1?
If you don't declare a return for a function in python, it automatically returns None. Inside parser you're adding elements to storage1, but aren't returning anything. I would give this a shot instead.
storage = []
for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)
def parser(any):
storage1 = []
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
return storage1
storage2 = [parser(x) for x in storage]
EDIT: As Amber said, you should also see that all your elements were actually being stored in storage1.
If I understand your problem correctly, you have two stages in your program:
You generate initial list of the 21 URLs
You fetch the page at each of those URLs, and extract additional URLs from the page.
Your first step could look like this:
initial_urls = [('http://...%s...' % x) for x in range(21)]
Then, to populate the large list of URLs from the pages, you could do something like this:
big_list = []
def extract_urls(source):
tree = ET.parse(urlopen(any))
for link in get_links(tree):
big_list.append(link.attrib['href'])
def get_links(tree):
... - define the logic for link extraction here
for url in initial_urls:
extract_urls(url)
print big_list
Note that you'll have to write the procedure that extracts the links from the document yourself.
Hope this helps!
You have to return storage1 in the parser function
def parser(any):
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
return storage1
I think this is what you want.

Categories