I'm getting the above error with the code below. The error occurs at the last line. Please excuse the subject matter, I'm just practicing my python skills. =)
from urllib.request import urlopen
from bs4 import BeautifulSoup
from pprint import pprint
from pickle import dump
moves = dict()
moves0 = set()
url = 'http://www.marriland.com/pokedex/1-bulbasaur'
print(url)
# Open url
with urlopen(url) as usock:
# Get url data source
data = usock.read().decode("latin-1")
# Soupify
soup = BeautifulSoup(data)
# Find move tables
for div_class1 in soup.find_all('div', {'class': 'listing-container listing-container-table'}):
div_class2 = div_class1.find_all('div', {'class': 'listing-header'})
if len(div_class2) > 1:
header = div_class2[0].find_all(text=True)[1]
# Take only moves from Level Up, TM / HM, and Tutor
if header in ['Level Up', 'TM / HM', 'Tutor']:
# Get rows
for row in div_class1.find_all('tbody')[0].find_all('tr'):
# Get cells
cells = row.find_all('td')
# Get move name
move = cells[1].find_all(text=True)[0]
# If move is new
if not move in moves:
# Get type
typ = cells[2].find_all(text=True)[0]
# Get category
cat = cells[3].find_all(text=True)[0]
# Get power if not Status or Support
power = '--'
if cat != 'Status or Support':
try:
# not STAB
power = int(cells[4].find_all(text=True)[1].strip(' \t\r\n'))
except ValueError:
try:
# STAB
power = int(cells[4].find_all(text=True)[-2])
except ValueError:
# Moves like Return, Frustration, etc.
power = cells[4].find_all(text=True)[-2]
# Get accuracy
acc = cells[5].find_all(text=True)[0]
# Get pp
pp = cells[6].find_all(text=True)[0]
# Add move to dict
moves[move] = {'type': typ,
'cat': cat,
'power': power,
'acc': acc,
'pp': pp}
# Add move to pokemon's move set
moves0.add(move)
pprint(moves)
dump(moves, open('pkmn_moves.dump', 'wb'))
I have reduced the code as much as possible in order to produce the error. The fault may be simple, but I can't just find it. In the meantime, I made a workaround by setting the recursion limit to 10000.
Just want to contribute an answer for anyone else who may have this issue. Specifically, I was having it with caching BeautifulSoup objects in a Django session from a remote API.
The short answer is the pickling BeautifulSoup nodes is not supported. I instead opted to store the original string data in my object and have an accessor method that parsed it on the fly, so that only the original string data is pickled.
Related
I want to use Elsevier Article Retrieval API (https://dev.elsevier.com/documentation/FullTextRetrievalAPI.wadl) to get fulltext of paper.
I use httpx to get the information of the paper,but it just contains some information.My code is below:
import httpx
import time
def scopus_paper_date(paper_doi,apikey):
apikey=apikey
headers={
"X-ELS-APIKey":apikey,
"Accept":'text/xml'
}
timeout = httpx.Timeout(10.0, connect=60.0)
client = httpx.Client(timeout=timeout,headers=headers)
query="&view=FULL"
url=f"https://api.elsevier.com/content/article/doi/" + paper_doi
r=client.get(url)
print(r)
return r.text
y = scopus_paper_date('10.1016/j.solmat.2021.111326',myapikey)
y
the result is below:
<full-text-retrieval-response xmlns="http://www.elsevier.com/xml/svapi/article/dtd" xmlns:bk="http://www.elsevier.com/xml/bk/dtd" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><coredata><prism:url>https://api.elsevier.com/content/article/pii/S0927024821003688</prism:url>....
how can i get the fulldata of the paper,many thanks!
That depends on the paper you want to download.
I modified a bit the function you posted. Now it gets the response as JSON and no XML (this is just my personal preference, you can use the format you prefer).
import httpx
import time
def scopus_paper_date(paper_doi,apikey):
apikey=apikey
headers={
"X-ELS-APIKey":apikey,
"Accept":'application/json'
}
timeout = httpx.Timeout(10.0, connect=60.0)
client = httpx.Client(timeout=timeout,headers=headers)
query="&view=FULL"
url=f"https://api.elsevier.com/content/article/doi/"+paper_doi
r=client.get(url)
print(r)
return r
Now you can retrieve the document you want, and then you will have to parse it:
# Get document
y = scopus_paper_date('10.1016/j.solmat.2021.111326',my_api_key)
# Parse document
import json
json_acceptable_string = y.text
d = json.loads(json_acceptable_string)
# Print document
print(d['full-text-retrieval-response']['coredata']['dc:description'])
The result will the the dc:description of the document, i.e. the Abstract:
The production of molecular hydrogen by photoelectrochemical
dissociation (PEC) of water is a promising technique, which allows ... The width of the forbidden
bands and the position of the valence and conduction bands of the
different materials were determined by Mott - Schottky type
measurements.
For this document that is all that you can get, there are no more options.
However, if you try to get a different document, for example:
# Get document
y = scopus_paper_date('10.1016/j.nicl.2021.102600',my_api_key)
# Parse document
import json
json_acceptable_string = y.text
d = json.loads(json_acceptable_string)
You can then print the originalText key of the full-text-retrieval-response
# Print document
print(d['full-text-retrieval-response']['originalText'])
You will notice that this is a very long string containing a lot of text, probably more that you want, for example it contains all the references as well.
As I said in the beginning, the information you can get depends on the single paper. However, the full data will always be contained in the y variable defined in the code.
I'm working on some NFL statistics web scraping, honestly the activity doesn't matter much. I spent a ton of time debugging because I couldn't believe what it was doing, either I'm going crazy or there is some sort of bug in a package or python itself. Here's the code I'm working with:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np
#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
print(letter)
players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
soup = bs(players_html.content,"html.parser")
for player in soup.find("div",{"id":"div_players"}).find_all("p"):
temp_row = {}
temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
temp_row["name"] = player.text.split("(")[0].strip()
years = player.text.split(")")[1].strip()
temp_row["startYear"] = int(years.split("-")[0])
temp_row["endYear"] = int(years.split("-")[1])
temp_row["positions"] = player.text.split("(")[1].split(")")[0]
players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)
game_df = pd.DataFrame()
def apply_test(row):
#print(row)
url = row['url']
#print(list(range(int(row['startYear']),int(row['endYear'])+1)))
for yr in range(int(row['startYear']),int(row['endYear'])+1):
print(yr)
content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
#overheader
over_headers = []
for over in soup.find("thead").find("tr").find_all("th"):
if("colspan" in over.attrs.keys()):
for i in range(0,int(over['colspan'])):
over_headers = over_headers + [over.text]
else:
over_headers = over_headers + [over.text]
#headers
headers = []
for header in soup.find("thead").find_all("tr")[1].find_all("th"):
headers = headers + [header.text]
all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
#remove first column, it's meaningless
all_headers = all_headers[1:len(all_headers)]
for row in soup.find("tbody").find_all("tr"):
temp_row = {}
for i,col in enumerate(row.find_all("td")):
temp_row[all_headers[i]] = col.text
game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)
Now again I could get into what I'm trying to do, but there seems to be a much higher-level issue here. startYear and endYear in the for loop are 2013 and 2014, so the loop should be setting the yr variable to 2013 then 2014. But when you look at what prints out due to the print(yr), you realize it's printing out 2013 twice. But if you simply comment out the game_df = game_df.append(temp_row,ignore_index=True) line, the printouts of yr are correct. There is an error shortly after the first two lines, but that is expected and one I am comfortable debugging. But the fact that appending to a global dataframe is causing a for loop to behave differently is blowing my mind right now. Can someone help with this?
Thanks.
I don't really follow what the overall aim is but I do note two things:
You either need the local game_df to be declared as global game_df before game_df = game_df.append(temp_row,ignore_index=True) or better still pass as an arg in the def signature though you would need to amend this: players.apply(apply_test,axis=1) accordingly.
You need to handle the cases of find returning None e.g. with soup.find("thead").find_all("tr")[1].find_all("th") for page https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014. Perhaps put in try except blocks with appropriate default values to be supplied.
I'm building a scraper that needs to perform pretty fast, over a large amount of webpages. The results of the code below will be a csv file with a list of links (and other things).
Basically, I create a list of webpages that contain several links, and for each of this pages I collect these links.
Implementing multiprocessing leads to some weird results, that I wasn't able to explain.
If I run this code setting the value of the pool to 1 (hence, without multithreading) I get a final result in which I have 0.5% of duplicated links (which is fair enough).
As soon as I speed it up setting the value to 8, 12 or 24, I get around 25% of duplicate links in the final results.
I suspect my mistake is in the way I write the results to the csv file or in the way I use the imap() function (same happens with imap_unordered, map etc..), which leads the threads to somehow access the same elements on the iterable passed. Any suggestion?
#!/usr/bin/env python
# coding: utf8
import sys
import requests, re, time
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
import random
import unicodecsv as csv
import progressbar
import multiprocessing
from multiprocessing.pool import ThreadPool
keyword = "keyword"
def openup():
global crawl_list
try:
### Generate list URLS based on the number of results for the keyword, each of these contains other links. The list is subsequently randomized
startpage = 1
## Get endpage
url0 = myurl0
r0 = requests.get(url0)
print "First request: "+str(r0.status_code)
tree = html.fromstring(r0.content)
endpage = tree.xpath("//*[#id='habillagepub']/div[5]/div/div[1]/section/div/ul/li[#class='adroite']/a/text()")
print str(endpage[0]) + " pages found"
### Generate random sequence for crawling
crawl_list = random.sample(range(1,int(endpage[0])+1), int(endpage[0]))
return crawl_list
except Exception as e:
### Catches openup error and return an empty crawl list, then breaks
print e
crawl_list = []
return crawl_list
def worker_crawl(x):
### Open page
url_base = myurlbase
r = requests.get(url_base)
print "Connecting to page " + str(x) +" ..."+ str(r.status_code)
while True:
if r.status_code == 200:
tree = html.fromstring(r.content)
### Get data
titles = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/text()')
links = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/#href')
abstracts = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/p/text()')
footers = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/span/text()')
dates = []
pagenums = []
for f in footers:
pagenums.append(x)
match = re.search(r'\| .+$', f)
if match:
date = match.group()
dates.append(date)
pageindex = zip(titles,links,abstracts,footers,dates,pagenums) #what if there is a missing value?
return pageindex
else:
pageindex = [[str(r.status_code),"","","","",str(x)]]
return pageindex
continue
def mp_handler():
### Write down:
with open(keyword+'_results.csv', 'wb') as outcsv:
wr = csv.DictWriter(outcsv, fieldnames=["title","link","abstract","footer","date","pagenum"])
wr.writeheader()
results = p.imap(worker_crawl, crawl_list)
for result in results:
for x in result:
wr.writerow({
#"keyword": str(keyword),
"title": x[0],
"link": x[1],
"abstract": x[2],
"footer": x[3],
"date": x[4],
"pagenum": x[5],
})
if __name__=='__main__':
p = ThreadPool(4)
openup()
mp_handler()
p.terminate()
p.join()
Are you sure the page responds with the correct response in a fast sequence of requests? I have been in situations where the scraped site responded with different responses if the requests were fast vs. if the requests were spaced in time. Menaing, everything went perfectly while debugging but as soon as the requests were fast and in sequence, the website decided to give me a different response.
Beside this, I would ask if the fact you are writing in a non-thread-safe environment might have impact: To minimize interactions on the final CSV output and issues with the data, you might:
use wr.writerows with a chunk of rows to write
use a threading.lock like here: Multiple threads writing to the same CSV in Python
I am trying to use JSON to search through googlemapapi. So, I give location "Plymouth" - in googlemapapi it is showing 6 resultset but when I try to parse in Json, I am getting length of only 2. I tried with multiple cities too, but all I am getting is resultset of 2 rather.
What is wrong below?
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1:
location = js1["results"][count]["formatted_address"]
lat = js1["results"][count]["geometry"]["location"]["lat"]
lng = js1["results"][count]["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)
Simply replace for result in js1: with for result in js1['results']:
By the way, as posted in a comment in the question, no need to use a counter. You can rewrite your for loop as:
for result in js1['results']:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
print('lat',lat,'lng',lng)
print(location)
If you look at the json that comes in, you'll see that its a single dict with two items ("results" and "status"). Add print('result:', result) to the top of your for loop and it will print result: status and result: results because all you are iterating the the keys of that outer dict. That's a general debugging trick in python... if you aren't getting the stuff you want, put in a print statement to see what you got.
The results (not surprisingly) and in a list under js1["results"]. In your for loop, you ignore the variable you are iterating and go back to the original js1 for its data. This is unnecessary and in your case, it hid the error. Had you tried to reference cities off of result you would gotten an error and it may have been easier to see that result was "status", not the array you were after.
Now a few tweaks fix the problem
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1["results"]:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)
I am trying to print only not null values but I am not sure why even the null values are coming up in the output:
Input:
from lxml import html
import requests
import linecache
i=1
read_url = linecache.getline('stocks_url',1)
while read_url != '':
page = requests.get(read_url)
tree = html.fromstring(page.text)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage != None:
print percentage
i = i + 1
read_url = linecache.getline('stocks_url',i)
Output:
$ python test_null.py
['76%']
['76%']
['80%']
['92%']
['77%']
['71%']
[]
['50%']
[]
['100%']
['67%']
You are getting empty lists, not None objects. You are testing for the wrong thing here; you see [], while if a Python null was being returned you'd see None instead. The Element.xpath() method will always return a list object, and it can be empty.
Use a boolean test:
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Empty lists (and None) test as false in a boolean context. I opted to print out the first element from the XPath result, you appear to only ever have one.
Note that linecache is primarily aimed at caching Python source files; it is used to present tracebacks when an error occurs, and when you use inspect.getsource(). It isn't really meant to be used to read a file. You can just use open() and loop over the file without ever having to keep incrementing a counter:
with open('stocks_url') as urlfile:
for url in urlfile:
page = requests.get(read_url)
tree = html.fromstring(page.content)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Change this in your code and it should work:
if percentage != []: