Python Webscraping: How do i loop many url requests? - python

import requests
from bs4 import BeautifulSoup
LURL="https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
Lpage = requests.get(LURL)
Lsoup = BeautifulSoup(Lpage.content, 'html.parser')
Lx = Lsoup.find_all(class_="column-2")
a=[]
for Lx in Lx:
a.append(Lx.text)
a.remove("Land")
j=0
for i in range(len(a)):
b = a[j]
URL = "https://de.wikipedia.org/wiki/"+b
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
l = soup.find(class_="firstHeading")
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
z = zr.findAll("tr")
a=""
for z in z:
a=a+z.text
h=a.find("Hauptstadt")
lol=a[h:-1]
lol=lol.replace("Hauptstadt", "")
lol=lol.strip()
fg=lol.find("\n")
lol=lol[0:fg]
lol=lol.strip()
j=j+1
print(lol)
print(l.text)
This is the code. It gets the name of every country and packs it into a list. After that the program loops through the wikipedia pages of the countrys and gets the capital of the country and prints it. It works fine for every country. But after one country is finished and code starts again it stops do work with the error:
Traceback (most recent call last): File "main.py", line 19, in <module>
z = zr.findAll("tr") AttributeError: 'NoneType' object has no attribute 'findAll'

You stored the list of countries in a variable called a, which you then overwrote later in the script with some other value. That messes up your iteration. Two good ways to prevent problems like this:
Use more meaningful variable names.
Use mypy on your Python code.
I spent a little time trying to do some basic cleanup on your code to at least get you past that first bug; the list of countries is now called countries instead of a, which prevents you from overwriting it, and I replaced the extremely confusing i/j/a/b iteration with a very simple for country in countries loop. I also got rid of all the variables that were only used once so I wouldn't have to try to come up with better names for them. I think there's more work to be done, but I don't have enough of an idea what that inner loop is doing to want to even try to fix it. Good luck!
import requests
from bs4 import BeautifulSoup
countries = [x.text for x in BeautifulSoup(
requests.get(
"https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
).content,
'html.parser'
).find_all(class_="column-2")]
countries.remove("Land")
for country in countries:
soup = BeautifulSoup(
requests.get(
"https://de.wikipedia.org/wiki/" + country
).content,
'html.parser'
)
heading = soup.find(class_="firstHeading")
rows = soup.find(
class_="wikitable infobox infoboxstaat float-right"
).findAll("tr")
a = ""
for row in rows:
a += row.text
h = a.find("Hauptstadt")
lol = a[h:-1]
lol = lol.replace("Hauptstadt", "")
lol = lol.strip()
fg = lol.find("\n")
lol = lol[0:fg]
lol = lol.strip()
print(lol)
print(heading.text)

The error message is actually telling you what's happening. The line of code
z = zr.findAll("tr")
is throwing an attribute error because the NoneType object does not have a findAll attribute. You are trying to call findAll on zr, assuming that variable will always be a BeautifulSoup object, but it won't. If this line:
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
finds no objects in the html matching those classes, zr will be set to None. So, on one of the pages you are trying to scrape, that's what happening. You can code around it with a try/except statement, like this:
for i in range(len(a)):
b = a[j]
URL = "https://de.wikipedia.org/wiki/"+b
page = requests.get(URL)
try:
soup = BeautifulSoup(page.content, 'html.parser')
l = soup.find(class_="firstHeading")
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
z = zr.findAll("tr")
a=""
#don't do this! should be 'for i in z' or something other variable name
for z in z:
a=a+z.text
h=a.find("Hauptstadt")
lol=a[h:-1]
lol=lol.replace("Hauptstadt", "")
lol=lol.strip()
fg=lol.find("\n")
lol=lol[0:fg]
lol=lol.strip()
j=j+1
print(lol)
print(l.text)
except:
pass
In this example, any page that doesn't have the right html tags will be skipped.

The 'NoneType' means your line zr = soup.find(class_="wikitable infobox infoboxstaat float-right") has returned nothing.
The error is in this loop :
for Lx in Lx:
a.append(Lx.text)
You can't use the same name there. Please try to use this loop instead and let me know how it goes:
for L in Lx:
a.append(Lx.text)

Related

Item in list found but when I ask for location by index it says that the item can't be found

I am writing some code to get a list of certain counties in Florida for a database. These counties are listed on a website but are each on individual webpages. To make the collection process less tedious I am writing a webscraper. I have gotten the links to all of the websites with the counties. I have written code that will then inspect the website, find the line that says "COUNTY:" and then I want to get the location so I can actually get the county on the next line. The only problem is when I ask for the location it says it can't be found. I know it is in there because when I ask my code to find it and then return the line (Not the placement) it doesn't return empty. I will give some of the code for reference and an image of the problem.
Broken code:
links = ['https://www.ghosttowns.com/states/fl/acron.html', 'https://www.ghosttowns.com/states/fl/acton.html']
import requests
r = requests.get(links[1])
r = str(r.text)
r = r.split("\n")
county_before_location = [x for x in r if 'COUNTY' in x]
print(county_before_location)
print(r.index(county_before_location))
Returns:
[' </font><b><font color="#80ff80">COUNTY:</font><font color="#ffffff">'] is not in list
Code that shows the item:
links = ['https://www.ghosttowns.com/states/fl/acron.html', 'https://www.ghosttowns.com/states/fl/acton.html']
import requests
r = requests.get(links[1])
r = str(r.text)
r = r.split("\n")
county_before_location = [x for x in r if 'COUNTY' in x]
print(county_before_location)
Returns:
[' </font><b><font color="#80ff80">COUNTY:</font><font color="#ffffff">']
Photo
county_before_location is a list and you are asking for the index of said list, which is not in r. Instead you would need to ask for r.index(county_before_location[0]).

How can I find specific text in a website's HTML code with Python and BeautifulSoup?

Completely new to HTML and Python here. I am looking to scrape a website with Python to find auction data. I want to find all listings with the text "lb, lbs., pound" et cetera. Here is an example of a listing HTML code I'm interested in:
<a class="product" href="/Item/91150404">
<div class="title">
30.00 LB Lego Mini Figures Lego People Grab Bag
<br>Bids: 7 </div> </a>
I figured out how to get a ResultSet of all the "title" tags with the title_all variable, but I would like to further filter all the auction listings to only show those with "LB" in the name. I have read the BeautifulSoup documentation and the best I've been able to do is return a blank list []. Here is my Python code:
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
title_all=soup.findAll(True,class_=['title'])
result=soup.findAll('div', text = re.compile('LB'),attrs = {'class' : 'title'})
print(result)
#does not work
I've also tried reading similar questions here and implementing the answers but am getting stuck. Any help would be greatly appreciated! I am using Python 3.7.3 and BeautifulSoup 4. Thank you!
Instead of:
text=re.compile('LB')
Try:
string=re.compile('LB')
Documentation
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
# url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
# html = req.get(url)
# url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
# html = requests.get(url).text
html = '''
<a class="product" href="/Item/91150404">
<div class="title">
30.00 LB Lego Mini Figures Lego People Grab Bag
<br>Bids: 7
</div>
</a>
'''
doc = SimplifiedDoc(html)
title_all = doc.getElementsByReg('( LB | LBS )',tag="div").text
print(title_all)
Result:
['30.00 LB Lego Mini Figures Lego People Grab Bag Bids: 7']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from bs4 import BeautifulSoup
import datetime as dt
import requests
url = 'https://www.shopgoodwill.com/Listings?st=&sg=&c=388&s=&lp=0&hp=999999&sbn=false&spo=false&snpo=false&socs=false&sd=false&sca=false&caed=4/18/2020&cadb=7&scs=false&sis=false&col=0&p=1&ps=40&desc=false&ss=0&UseBuyerPrefs=true'
r = requests.get(url)
bs = BeautifulSoup(r.text, "html.parser")
# Gathering products.
bs_products = bs.findAll("a", {"class": "product"})
# Gathering listing information for each product.
products = []
for product in bs_products:
price_str = product.find("div", {"class": "price"}).text.strip()
price_int = int(''.join(filter(lambda i: i.isdigit(), price_str)))
product = {"img": product.find("img", {"class": "lazy-load"}).get("data-src"),
"num": int(product.find("div", {"class": "product-number"}).text.split(":")[1]),
"title": product.find("div", {"class": "title"}).next_element.strip(),
"time_left": dt.datetime.strptime(product.find("div", {"class": "timer"}).get("data-countdown"), "%m/%d/%Y %I:%M:%S %p"),
"price": price_int}
products.append(product)
filter_LB = list(filter(lambda product: "LB" in product['title'], products))
print(filter_LB)
Outputs:
[{'img': 'https://sgwproductimages.azureedge.net/109/4-16-2020/56981071672752ssdt-thumb.jpg',
'num': 91150404,
'title': '30.00 LB Lego Mini Figures Lego People Grab Bag',
'time_left': datetime.datetime(2020, 4, 21, 19, 20),
'price': 444500},
{'img': 'https://sgwproductimages.azureedge.net/5/4-14-2020/814151314749m.er-thumb.jpg',
'num': 91000111,
'title': '20 LBS of Bulk Loose Lego Pieces',
'time_left': datetime.datetime(2020, 4, 19, 18, 6),
'price': 4600}]
What I would recommend to do is utilize BS4 for what it is meant to be used -- scraping -- and then utilize Python to filter your objects. I won't object to the claim that BS4 can filter, but, I've always found that it is best to first implement a general solution, and then deal with specifics in need-be cases.
In case you are not familiar with filter, check out the documentation here. If you don't know what lambda is, it's a function written in one line. All filter does it loop through your objects, and apply the lambda function given. Whatever object returns True inside the lambda, filter returns it.
def func(a):
return a + 2
func(4) # >>> 6
func = lambda a: a + 2
func(4) # >>> 6
Happy programming! :)
References:
filter
lambda
datetime
Edit: For discussion sake below. Say we want to filter the numbers to always be above or equal to 5. We can do it in multiple ways:
l = [1, 2, 3, 4, 5, 6, 7]
# Traditional filtering way. Makes sense.
filtered_l = []
for i in l:
if i >= 5:
filtered_l.append(i)
# Lambda + Filter way
filtered_l = list(filter(lambda i: i >= 5, l))
# Function + Filter Way
def filtering(i): # Notice this function returns either True or False.
return i >= 5
filtered_l = list(filter(filtering, l))
You might be asking why we do list(filter()) as opposed to simply filter(). That is because filter returns an iterable, which is not originally a list. It's an object that you can intereate through. Therefore, we extract the resources of the filter by converting it into a list. Similarly, you can convert a list into an iterable (which gives you extra functionality and control):
l = [1, 2, 3, 4, 5]
iter_l = iter(l) # >>> <list_iterator object at 0x10aa9ee50>
next(iter_l) # >>> 1
next(iter_l) # >>> 2
next(iter_l) # >>> 3
next(iter_l) # >>> 4
next(iter_l) # >>> 5
next(iter_l)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
You might be asking "why bother using iter as oppose to simply using a list?" The answer is because you can overload the __iter__ and __next__ functionality in classes to make them into iterable classes (classes you can call for-loops in):
import random
class RandomIterable:
def __iter__(self):
return self
def __next__(self):
if random.choice(["go", "go", "stop"]) == "stop":
raise StopIteration # signals "the end"
return 1
Which allows us to iterate through the class itself:
for eggs in RandomIterable():
print(eggs)
Or, as you used in filter, simply get the list:
list(RandomIterable())
>>> [1]
Which in this case, will return you the amount of time (marked by each 1) that the word stop was selected at random. If the return was [1, 1], then stop was selected twice consecutively. Of course this is a dumb example, but hopefully now you see how list, filter, and lambda work together to filter lists (a.k.a. iterables) in Python.

Python Appending DataFrame, weird for loop error

I'm working on some NFL statistics web scraping, honestly the activity doesn't matter much. I spent a ton of time debugging because I couldn't believe what it was doing, either I'm going crazy or there is some sort of bug in a package or python itself. Here's the code I'm working with:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np
#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
print(letter)
players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
soup = bs(players_html.content,"html.parser")
for player in soup.find("div",{"id":"div_players"}).find_all("p"):
temp_row = {}
temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
temp_row["name"] = player.text.split("(")[0].strip()
years = player.text.split(")")[1].strip()
temp_row["startYear"] = int(years.split("-")[0])
temp_row["endYear"] = int(years.split("-")[1])
temp_row["positions"] = player.text.split("(")[1].split(")")[0]
players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)
game_df = pd.DataFrame()
def apply_test(row):
#print(row)
url = row['url']
#print(list(range(int(row['startYear']),int(row['endYear'])+1)))
for yr in range(int(row['startYear']),int(row['endYear'])+1):
print(yr)
content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
#overheader
over_headers = []
for over in soup.find("thead").find("tr").find_all("th"):
if("colspan" in over.attrs.keys()):
for i in range(0,int(over['colspan'])):
over_headers = over_headers + [over.text]
else:
over_headers = over_headers + [over.text]
#headers
headers = []
for header in soup.find("thead").find_all("tr")[1].find_all("th"):
headers = headers + [header.text]
all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
#remove first column, it's meaningless
all_headers = all_headers[1:len(all_headers)]
for row in soup.find("tbody").find_all("tr"):
temp_row = {}
for i,col in enumerate(row.find_all("td")):
temp_row[all_headers[i]] = col.text
game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)
Now again I could get into what I'm trying to do, but there seems to be a much higher-level issue here. startYear and endYear in the for loop are 2013 and 2014, so the loop should be setting the yr variable to 2013 then 2014. But when you look at what prints out due to the print(yr), you realize it's printing out 2013 twice. But if you simply comment out the game_df = game_df.append(temp_row,ignore_index=True) line, the printouts of yr are correct. There is an error shortly after the first two lines, but that is expected and one I am comfortable debugging. But the fact that appending to a global dataframe is causing a for loop to behave differently is blowing my mind right now. Can someone help with this?
Thanks.
I don't really follow what the overall aim is but I do note two things:
You either need the local game_df to be declared as global game_df before game_df = game_df.append(temp_row,ignore_index=True) or better still pass as an arg in the def signature though you would need to amend this: players.apply(apply_test,axis=1) accordingly.
You need to handle the cases of find returning None e.g. with soup.find("thead").find_all("tr")[1].find_all("th") for page https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014. Perhaps put in try except blocks with appropriate default values to be supplied.

If statement to only select certain XML attributes

I'm attempting to get 2 different elements from an XML file; I'm trying to print them as the x and y on a scatter plot. I can manage to get both the elements but one list is 155 long and the other only 50.
So I need to add an if statement to just select from elements that have an associated windSpeed element.
url = "http://api.met.no/weatherapi/locationforecast/1.9/?lat=52.41616;lon=-4.064598"
response = requests.get(url)
xml_text=response.text
weather= bs4.BeautifulSoup(xml_text, "xml")
f = open('file.xml', "w")
f.write(weather.prettify())
f.close()
I'm then trying to get the time (from) element and the (windSpeed > mps) element and attribute. I'd like to use use Beautifulsoup if possible, or a straight if loop would be great.
with open ('file.xml') as file:
soup = bs4.BeautifulSoup(file, "xml")
times = soup.find_all("time")
windspeed = soup.select("windSpeed")
form = ("%Y-%m-%dT%H:%M:%SZ")
x = []
y = []
for element in times:
time = element.get("from")
t = datetime.datetime.strptime(time, form)
x.append(t)
for mps in windspeed:
speed = mps.get("mps")
y.append(speed)
plt.scatter(x, y)
plt.show()
When I run it raises the following error:
raise ValueError("x and y must be the same size")
ValueError: x and y must be the same size
I'm assuming it's because the lists are different lengths.
I know there's probably a simple way of fixing it, any ideas would be great.
Just modify your code snippet as follows. It will solve the length problem.
....
for element in times:
time = element.get("from")
t = datetime.datetime.strptime(time, form)
if element.find('windSpeed'):
x.append(t)
....

JSON.LOADS is picking only 2 resultset

I am trying to use JSON to search through googlemapapi. So, I give location "Plymouth" - in googlemapapi it is showing 6 resultset but when I try to parse in Json, I am getting length of only 2. I tried with multiple cities too, but all I am getting is resultset of 2 rather.
What is wrong below?
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1:
location = js1["results"][count]["formatted_address"]
lat = js1["results"][count]["geometry"]["location"]["lat"]
lng = js1["results"][count]["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)
Simply replace for result in js1: with for result in js1['results']:
By the way, as posted in a comment in the question, no need to use a counter. You can rewrite your for loop as:
for result in js1['results']:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
print('lat',lat,'lng',lng)
print(location)
If you look at the json that comes in, you'll see that its a single dict with two items ("results" and "status"). Add print('result:', result) to the top of your for loop and it will print result: status and result: results because all you are iterating the the keys of that outer dict. That's a general debugging trick in python... if you aren't getting the stuff you want, put in a print statement to see what you got.
The results (not surprisingly) and in a list under js1["results"]. In your for loop, you ignore the variable you are iterating and go back to the original js1 for its data. This is unnecessary and in your case, it hid the error. Had you tried to reference cities off of result you would gotten an error and it may have been easier to see that result was "status", not the array you were after.
Now a few tweaks fix the problem
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1["results"]:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)

Categories