Using BeautifulSoup for web scraping: ValueError obtained

Using BeautifulSoup for web scraping: ValueError obtained - python

I have the following error:
ValueError at /scrape/
dictionary update sequence element #0 has length 1; 2 is required
Request Method: GET
Request URL: http://localhost:8000/scrape/
Django Version: 2.2.17
Exception Type: ValueError
Exception Value:
dictionary update sequence element #0 has length 1; 2 is required
Exception Location: c:\Users\toshiba\Desktop\newsscraper\news\views.py in scrape, line 32
Python Executable: C:\Users\toshiba\AppData\Local\Programs\Python\Python37\python.exe
Python Version: 3.7.2
Python Path:
['c:\\Users\\toshiba\\Desktop\\newsscraper',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\python37.zip',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\DLLs',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\win32',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\win32\\lib',
'C:\\Users\\toshiba\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\Pythonwin']
I am trying to scrape news articles from guardian.ng, but it seems to keep giving me an error.
This is my views.py file:
# import all necessary modules
import requests
from django.shortcuts import render, redirect
from bs4 import BeautifulSoup as BSoup
from news.models import Headline
requests.packages.urllib3.disable_warnings()
# new function news_list()
def news_list(request):
headlines = Headline.objects.all()[::-1]
context = {
'object_list': headlines,
}
return render(request, "news/home.html", context)
# the view function scrape()
def scrape(request):
session = requests.Session()
session.headers = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
# This HTTP header will tell the server information about the client. We are using Google bots for that purpose.
# # When our client requests anything on the server, the server sees our request coming as a Google bot.
url = "https://www.guardian.ng/"
content = session.get(url, verify=False).content
# We create a soup object where we pass the HTML page. Alongside the HTML page, we also pass HTML parser as a parameter.
soup = BSoup(content, "html.parser")
# in the news object, we return the <div> of a particular class. We selected this class through webpage inspection.
News = soup.find_all('div', {"class":"headline"})
for article in News:
main = dict(article.find_all('a'))# we can iterate over soup objects.
# In the for loop, the main variable will hold the link to the origin webpage.
# The main attribute gets the anchor tag. Since, the <div>s returned only have one <a>tag, we get most of our work done here.
# The <a> tag contains title and href of the original link.
link = main['href']
image_src = str(main.find('img')['srcset']).split(" ")[-4] # srcset attribute contains various sizes of images...
title = main['title']
# saving the data to the database
new_headline = Headline()
new_headline.title = title
new_headline.url = link
new_headline.image = image_src
new_headline.save()
return redirect("../")
What can I do to solve this error?

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]

To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()

Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)

I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Scraping Bandcamp fan collections via POST

I've been trying to scrape Bandcamp fan pages to get a list of the albums they have purchased and I'm having trouble efficiently doing it. I wrote something with Selenium but it's mildly slow so I'd like to learn a solution that'd maybe send a POST request to the site and parse the JSON from there.
Here's a sample collection page: https://bandcamp.com/nhoward
Here's the Selenium code:
def scrapeFanCollection(url):
browser = getBrowser()
setattr(threadLocal, 'browser', browser)
#Go to url
browser.get(url)
try:
#Click show more button
browser.find_element_by_class_name('show-more').click()
#Wait two seconds
time.sleep(2)
#Scroll to the bottom loading full collection
scroll(browser, 2)
except Exception:
pass
#Return full album collection
soup_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))
#Empty array
urls = []
# Looping through all the a elements in the page source
for item in soup_a.find_all('a', {"class": "item-link"}):
url = item.get('href')
if(url != None):
urls.append(url)
return urls

The API can be accessed as follows:
$ curl -X POST -H "Content-Type: Application/JSON" -d \
'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \
https://bandcamp.com/api/fancollection/1/collection_items
I didn't encounter a scenario where a "older_than_token" was stale, so the problem boils down to getting the "fan_id" given a URL.
This information is located in a blob in the id="pagedata" element.
>>> import json
>>> import requests
>>> from bs4 import BeautifulSoup
>>> res = requests.get("https://www.bandcamp.com/ggorlen")
>>> soup = BeautifulSoup(res.text, "lxml")
>>> user = json.loads(soup.find(id="pagedata")["data-blob"])
>>> user["fan_data"]["fan_id"]
82985
Putting it all together (building upon this answer):
import json
import requests
from bs4 import BeautifulSoup
fan_page_url = "https://www.bandcamp.com/ggorlen"
collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items"
res = requests.get(fan_page_url)
soup = BeautifulSoup(res.text, "lxml")
user = json.loads(soup.find(id="pagedata")["data-blob"])
data = {
"fan_id": user["fan_data"]["fan_id"],
"older_than_token": user["wishlist_data"]["last_token"],
"count": 10000,
}
res = requests.post(collection_items_url, json=data)
collection = res.json()
for item in collection["items"][:10]:
print(item["album_title"], item["item_url"])
I'm using user["wishlist_data"]["last_token"] which has the same format as the "older_than_token" just in case this matters.

In order to get the entire collection i changed the previous code from
"older_than_token": user["wishlist_data"]["last_token"]
to
user["collection_data"]["last_token"]
which contained the right token

Unfortunately for you, this particular Bandcamp site doesn't seem to make any HTTP API call to fetch the list of albums. You can check that by using your browser developer tools, Network tab, click on XHR filter. The only call being made seems to be fetching your collection details.

Unable to access webpage with request in python

After some discussion with my problem on Unable to print links using beautifulsoup while automating through selenium
I realized that the main problem is in the URL which the request is not able to extract. URL of the page is actually https://society6.com/discover but I am using selenium to log into my account so the URL becomes https://society6.com/society?show=2
However, I can't use the second URL with request since its showing error. How do i scrap information from URL like this.

You need to log in first!
To do that you can use the bs4.BeautifulSoup library.
Here is an implementation that I have used:
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://society6.com/"
def log_in_and_get_session():
"""
Get the session object with login details
:return: requests.Session
"""
ss = requests.Session()
ss.verify = False # optinal for uncertifaied sites.
text = ss.get(f"{BASE_URL}login").text
csrf_token = BeautifulSoup(text, "html.parser").input["value"]
data = {"username": "your_username", "password": "your_password", "csrfmiddlewaretoken": csrf_token}
# results = ss.post("{}login".format(BASE_URL), data=data)
results = ss.post("{}login".format(BASE_URL), data=data)
if results.ok:
print("Login success", results.status_code)
return ss
else:
print("Can't login", results.status_code)
Using the 'post` method to log in...
Hope this helps you!
Edit
Added the beginning of the function.

Python post request for USPTO site scraping

I’m trying to scrape data from http://portal.uspto.gov/EmployeeSearch/ web site.
I open the site in browser, click on the Search button inside the Search by Organisation part of the site and look for the request being sent to server.
When I post the same request using python requests library in my program, I don’t get the result page which I am expecting but I get the same Search page, with no employee data on it.
I’ve tried all variants, nothing seems to work.
My question is, what URL should I use in my request, do I need to specify headers (tried also, copied headers viewed in Firefox developer tools upon request) or something else?
Below is the code that sends the request:
import requests
from bs4 import BeautifulSoup
def scrape_employees():
URL = 'http://portal.uspto.gov/EmployeeSearch/searchEm.do;jsessionid=98BC24BA630AA0AEB87F8109E2F95638.prod_portaljboss4_jvm1?action=displayResultPageByOrgShortNm&currentPage=1'
response = requests.post(URL)
site_data = response.content
soup = BeautifulSoup(site_data, "html.parser")
print(soup.prettify())
if __name__ == '__main__':
scrape_employees()

All the data you need is in a form tag:
action is the url when you make a post to server.
input is the data you need post to server. {name:value}
import requests, bs4, urllib.parse,re
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_form(soup):
form = soup.find(name='form', action=re.compile(r'OrgShortNm'))
return form
def get_action(form, base_url):
action = form['action']
# action is reletive url, convert it to absolute url
abs_action = urllib.parse.urljoin(base_url, action)
return abs_action
def get_form_data(form, org_code):
data = {}
for inp in form('input'):
# if the value is None, we put the org_code to this field
data[inp['name']] = inp['value'] or org_code
return data
if __name__ == '__main__':
url = 'http://portal.uspto.gov/EmployeeSearch/'
soup = make_soup(url)
form = get_form(soup)
action = get_action(form, url)
data = get_form_data(form, '1634')
# make request to the action using data
r = requests.post(action, data=data)

i can not get the body element of html page in web scraping by python

I would like to parse a website with urllib python library. I wrote this:
from bs4 import BeautifulSoup
from urllib.request import HTTPCookieProcessor, build_opener
from http.cookiejar import FileCookieJar
def makeSoup(url):
jar = FileCookieJar("cookies")
opener = build_opener(HTTPCookieProcessor(jar))
html = opener.open(url).read()
return BeautifulSoup(html, "lxml")
def articlePage(url):
return makeSoup(url)
Links = "http://collegeprozheh.ir/%d9%85%d9%82%d8%a7%d9%84%d9%87- %d9%85%d8%af%d9%84-%d8%b1%d9%82%d8%a7%d8%a8%d8%aa%db%8c-%d8%af%d8%b1-%d8%b5%d9%86%d8%b9%d8%aa-%d9%be%d9%86%d9%84-%d9%87%d8%a7%db%8c-%d8%ae%d9%88%d8%b1%d8%b4%db%8c%d8%af/"
print(articlePage(Links))
but the website does not return content of body tag.
this is result of my program:
cURL = window.location.href;
var p = new Date();
second = p.getTime();
GetVars = getUrlVars();
setCookie("Human" , "15421469358743" , 10);
check_coockie = getCookie("Human");
if (check_coockie != "15421469358743")
document.write("Could not Set cookie!");
else
window.location.reload(true);
</script>
</head><body></body>
</html>
i think the cookie has caused this problem.

The page is using JavaScript to check the cookie and to generate the content. However, urllib does not process JavaScript and thus the page shows nothing.
You'll either need to use something like Selenium that acts as a browser and executes JavaScript, or you'll need to set the cookie yourself before you request the page (from what I can see, that's all the JavaScript code does). You seem to be loading a file containing cookie definitions (using FileCookieJar), however you haven't included the content.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using BeautifulSoup for web scraping: ValueError obtained - python

Related

API - Web Scrape

Scraping Bandcamp fan collections via POST

Unable to access webpage with request in python

Python post request for USPTO site scraping

i can not get the body element of html page in web scraping by python

Categories

Resources