Scraping the core-text of webpages in python

Scraping the core-text of webpages in python - python

Is there a module that can be used to get the core-text of a webpage?
Something the will remove the headers / menus / social links?
Thank you

I think that varies from site to site. Since every website has a different structure you could not come up with a standard extractor.
For extracting a particular part of the web page, you can approach as following:
from urllib2 import urlopen
from scrapy.http import HtmlResponse
url = "some_website_you_want_to_crawl"
url_response = urlopen(url)
resp = HtmlResponse(url=url, body=url_response.read())
core_text = resp.xpath('xpath_to_core_text').extract()[0]

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]

To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()

Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)

I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

I have written a simple script for myself as practice to find who had bought same tracks as me on bandcamp to ideally find accounts with similar tastes and so more same music on their accounts.
The problem is that fan list on a album/track page is lazy loading. Using python's requests and bs4 I am only getting 60 results out of potential 700.
I am trying to figure out how to send request i.e. here https://pitp.bandcamp.com/album/fragments-distancing to open more of the list. After finding what request is send when I click in finder, I used that json request to send it using requests, although without any result
res = requests.get(track_link)
open_more = {"tralbum_type":"a","tralbum_id":3542956135,"token":"1:1609185066:1714678:0:1:0","count":100}
for i in range(0,3):
requests.post(track_link, json=open_more)
Will appreciate any help!

i think that just typing a ridiculous number for count will do. i did some automation on your script too if you want to get data on other albums
from urllib.parse import urlsplit
import json
import requests
from bs4 import BeautifulSoup
# build the post link
get_link="https://pitp.bandcamp.com/album/fragments-distancing"
link=urlsplit(get_link)
base_link=f'{link.scheme}://{link.netloc}'
post_link=f"{base_link}/api/tralbumcollectors/2/thumbs"
with requests.session() as s:
res = s.get(get_link)
soup = BeautifulSoup(res.text, 'lxml')
# the data for tralbum_type and tralbum_id
# are stored in a script attribute
key="data-band-follow-info"
data=soup.select_one(f'script[{key}]')[key]
data=json.loads(data)
open_more = {
"tralbum_type":data["tralbum_type"],
"tralbum_id":data["tralbum_id"],
"count":1000}
r=s.post(post_link, json=open_more).json()
print(r['more_available']) # if not false put a bigger count

How do I get a list of redirect urls from Dell.com

I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks

This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.

try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping

Advice on how to scrape data from this website

I would like some advice on how to scrape data from this website.
I started with selenium, but got stuck at the beginning because, for example, I have no idea how to set the dates.
My code until now:
from bs4 import BeautifulSoup as soup
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font
from selenium import webdriver
from selenium.webdriver.common.by import By
import datetime
import os
import time
import re
day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
cookieValue = '12-c12-cached|from:' +str(year)+ '-' +str(month)+ '-' +str(day-5)+ ','+'to:' +str(year)+ '-' +str(month)+ '-' + str(day) +',dateType:1,company:PreussenElektra,fuel:uranium,canceled:0,durationComparator:ge,durationValue:5,durationUnit:day'
#saving url
browser = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit'
browser.add_cookie({'name': 'tem', 'value': cookieValue})
browser.get(my_url)
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
browser.get(my_url)
Obviously I am not asking for code, just some suggestions on how to continue with Selenium (how to set dates and other data) or any idea on how to scrape this website
Thanks in advance.
EDIT: I am trying to follow the cookie way. That is my updated code, I read that the cookie need to be created before loading the page and so I did, any idea why it is not working?

Best approach for you will be changing cookies, because every filter data is saved in cookie.
Check cookies in chrome ( f12 -> application -> cookies ) and play with filters. If you will change it in programmers tools you have to refresh website :)
Check this post on how to change cookies in selenium python.
To get values from website you have to use classic way like u did here, but you will have to use classes:
radio = browser.find_elements_by_class_name('aaaaaa')
You can always use xPath to search elements ( chrome will generate them for you ).

Is there any particular reason why you have decided to use selenium over other web scraping tools (scrapy, urllib, etc.)? I personally have not used Selenium but I have used some of the other tools. Below is an example of a script to just pull all the html from a page.
import urllib
import urllib2
from bs4 import BeautifulSoup as soup
link = "https://ubuntu.com"
page = urllib2.urlopen(link)
data = soup(page, 'html.parser')
print (data)
This is just a short script to pull all the HTML off a page. I believe BeautifulSoup has additional tools for inputting data into fields, but the exact method slips my mind right now, if I can find my notes on it I will edit this post. I remember it being very straightforward, though.
Best of luck!
Edit: here's a discussion web scraping tools from reddit a while back that I had saved https://www.reddit.com/r/Python/comments/1qnbq3/webscraping_selenium_vs_conventional_tools/

Python to Save Web Pages

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.

Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.

This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html

Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping the core-text of webpages in python - python

Is there a module that can be used to get the core-text of a webpage? Something the will remove the headers / menus / social links? Thank you

Related

API - Web Scrape

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

How do I get a list of redirect urls from Dell.com

Advice on how to scrape data from this website

Python to Save Web Pages

Categories

Resources