Scraping paginated sites and appending output in Python

Scraping paginated sites and appending output in Python - python

I have a simple scraping task that I would like to improve the pagination efficiency of, and append lists so that I may output the results of scraping to a common/single file.
The current task is scraping municipal laws for the city of São Paulo, iterating over the first 10 pages. I would like to find a way to determine the total number of pages for pagination, and have the script automatically cycle through all pages, similar in spirit to this: Handling pagination in lxml.
The xpaths for the pagination links are too poorly defined at the moment for me to understand how to do this effectively. For instance, on the first or last page (1 or 1608), there are only three li nodes, while on page page 1605 there are 6 nodes.
/html/body/div/section/ul[2]/li/a
How may I efficiently account for this pagination; making the determination of pages in an automated way rather than manually, and how can I properly specify the xpaths to cycle through all the appropriate pages, without duplicates?
The existing code is as follows:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import html
base_url = "http://www.leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page=%d&types=o"
for url in [base_url % i for i in xrange(10)]:
page = requests.get(url)
tree = html.fromstring(page.text)
#This will create a list of titles:
titles = tree.xpath('/html/body/div/section/ul/li/a/strong/text()')
#This will create a list of descriptions:
desc = tree.xpath('/html/body/div/section/ul/li/a/text()')
#This will create a list of URLs
url = tree.xpath('/html/body/div/section/ul/li/a/#href')
print 'Titles: ', titles
print 'Description: ', desc
print 'URL: ', url
Secondarily, how can I compile/append these results and write them out to JSON, SQL, etc? I prefer JSON due to familiarity, but am rather ambivalent about how to do this at the moment.

You'll need to examine the data layout of your page/site. Each site is different. Look for 'pagination' or 'next' or some slider. Extract the details/count and use that in your loop.
import json library. You have a json dump function...

Although I couldn't understand your problem properly, this code will help invigorate your new attempt. The code is compatible with python 3 and later version.
import requests
from lxml import html
result = {}
base_url = "https://leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page={0}&types=28&types=5"
for url in [base_url .format(i) for i in range(1,3)]:
tree = html.fromstring(requests.get(url).text)
for title in tree.cssselect(".item-result"):
try:
name = ' '.join(title.cssselect(".title a")[0].text.split())
except Exception:
name = ""
try:
url = ' '.join(title.cssselect(".domain")[0].text.split())
except Exception:
url = ""
result[name] = url
print(result)
Partial output:
{'Decreto 57998/2017': 'http://leismunicipa.is', 'Decreto 58009/2017': 'http://leismunicipa.is'}

Related

First time web scraping from a weathercast website

I am learning web scraping as my first mini-project. Currently working with python. I want to extract a weather data and use python to show the weather where I am living, I have found the data I needed by inspecting the tags but it keeps giving me all the numbers on the weather forecast table instead of the specific one I need I have tried for to write its specific index number but it still did not work. This is the code I have written so far;
import requests
from bs4 import BeautifulSoup as bs
url= "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
r= requests.get(url)
cast = bs(r.content, "lxml")
wthr = cast.findAll("div",{"class": "col-md-9"})
print (wthr)
Any help would be greatly appreciated. The data I want is the Temperature data.
Also; Can somebody explain to me the differences between using lxml or html.parser. I have seen both methods being used widely and was curious how would you decide to use one over the other.

I think element is <div class="temp">24.2 °c</div>.
If your primary focus is just temperature data, you can check out weather APIs. There are several public APIs you could find here.

Legality of scraping should be considered before the action. You can find something about it here https://www.tutorialspoint.com/python_web_scraping/legality_of_python_web_scraping.htm
This site doesn't have robots.txt file so it is permitted to crawl.
Here is a very simplified way to get the table data published at the url that you use in the question. This uses html.parser to extract data
import requests
from bs4 import BeautifulSoup
def get_soup(my_url):
HTML = requests.get(my_url)
my_soup = BeautifulSoup(HTML.text, 'html.parser')
if 'None' not in str(type(my_soup)):
return my_soup
else:
return None
url = "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
# get the whole html document
soup = get_soup(url)
# get something from that soup
# here a table header and data are extracted from the soup
table_header = soup.find("table").findAll("th")
table_data = soup.find("table").findAll("td")
# header's and data's type is list
# combine lists
for x in range(len(table_header)):
print(table_header[x].text + ' --> ' + table_data[x].text)
""" R e s u l t :
Tarih / Saat -->
Hava --> COK BULUTLU
Sıcaklık --> 27.5°C
İşba Sıcaklığı --> 17.9°C
Basınç --> 1003.5 hPa
Görüş --> 10 km
Rüzgar --> Batıdan (270) 5 kt.
12.06.2022 13:00 --> Genel Tablo Genel Harita
"""
This is just one way to do it and it gets just a part shown in a transparent table on the site. Once more, take care of the instructions stated in the robots.txt file of the site. Regards...

Did you check if the webservice has an api you can use? Many weather-apps have api's you can use for free if you stay under a certain limit of requests. If there is, you could easily request only the data you need, so there is no need of formatting it.

Beautiful Soup web scraping complex html for data

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you

As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

HTML Scraping the website with duplicated div class name

I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella

I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps

you can use xpath expressions and create an absolute path on what you want to scrape

Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]

What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

How do I use scrapy rules to crawl from Wiki actor and movie pages to only the links in the cast and fimlography links

I have recently started to use python and scrapy.
I have been trying to use scrapy to start at either a movie or actor wiki page, save the name and cast or filmography and traverse through the links in the cast or filmography sections to other actor/movie wiki pages.
However, I have no idea how rules work (edit: ok, this was a bit of hyperbole) and the wiki links are extremely nested. I saw that you can limit by xpath and give id or class but most of the links I want don't seem to have a class or id. I also wasn't sure if xpath also includes the other siblings and children.
Therefore I would like to know what rules to use to limit the non-relevant links and only go to cast and filmography links.
Edit: Clearly, i should have explained my question better. Its not that I dont understand xpaths and rules at all (that was a bit of hyperbole since I was getting frustrated) but I'm clearly not completely clear on their working. Firstly, let me show what I had so far and then clarify where I am having trouble.
import logging
from bs4 import BeautifulSoup
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor, re
from scrapy.exceptions import CloseSpider
from Assignment2_0.items import Assignment20Item
logging.basicConfig(filename='spider.log',level = logging.DEBUG)
class WikisoupSpiderSpider(CrawlSpider):
name = 'wikisoup_spider'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Keira_Knightley']
rules = (
Rule(LinkExtractor(restrict_css= 'table.wikitable')),
Rule(LinkExtractor(allow =('(/wiki/)',), ),
callback='parse_crawl', follow=True))
actor_counter = 0
actor_max = 250
movie_counter = 0
movie_max = 125
def parse_crawl(self, response):
items = []
soup = BeautifulSoup(response.text, 'lxml')
item = Assignment20Item()
occupations = ['Actress', 'Actor']
logging.debug(soup.title)
tempoccu = soup.find('td', class_ = 'role')
logging.warning('tempoccu only works for pages of people')
tempdir = soup.find('th', text = 'Directed by')
logging.warning('tempdir only works for pages of movies')
if (tempdir is not None) and self.movie_counter < self.movie_max:
logging.info('Found movie and do not have enough yet')
item['moviename'] = soup.h1.text
logging.debug('name is ' + item['moviename'])
finder = soup.find('th', text='Box office')
gross = finder.next_sibling.next_sibling.text
gross_float = re.findall(r"[-+]?\d*\.\d+|\d+", gross)
item['netgross'] = float(gross_float[0])
logging.debug('Net gross is ' + gross_float[0])
finder = soup.find('div', text='Release date')
date = finder.parent.next_sibling.next_sibling.contents[1].contents[1].contents[1].get_text(" ")
date = date.replace(u'\xa0', u' ')
item['releasedate'] = date
logging.debug('released on ' + item['releasedate'])
item['type'] = 'movie'
items.append(item)
elif (tempoccu is not None) and (any(occu in tempoccu for occu in occupations)) and self.actor_counter < self.actor_max:
logging.info('Found actor and do not have enough yet')
item['name'] = soup.h1.text
logging.debug('name is ' + item['name'])
temp = soup.find('span', class_ = 'noprint ForceAgeToShow').text
age = re.findall('\d+', temp)
item['age'] = int(age[0])
logging.debug('age is ' + age[0])
filmo = []
finder = soup.find('span', id='Filmography')
for x in finder.parent.next_sibling.next_sibling.find_all('i'):
filmo.append(x.text)
item['filmography'] = filmo
logging.debug('has done ' + filmo[0])
item['type'] = 'actor'
items.append(item)
elif (self.movie_counter == self.movie_max and self.actor_counter == self.actor_max):
logging.info('Found enough data')
raise CloseSpider(reason='finished')
else :
logging.info('irrelavent data')
pass
return items
Now, my understanding of the rules in my code is it should allow all wiki links and should take links only from table tags and their children. This is clearly not what was happening since it very quickly crawled away from movies.
I'm clear on what to do when each element has an identifier like id or class but when inspecting the page, the links are buried in multiple nests of id-less tags which don't seem to all follow a singular pattern(I would use the regular xpath but different pages have different paths to filmography and it didn't seem like finding the path to the table under h2=filmography, would include all links in the tables below it). Therefore I wanted to know more on how I could get scrapy to only use Filmography links(in actor pages anyway).
I apologize if this was an obvious thing, I have started using both python and scrapy/xpath/css only 48 hours ago.

Firstly, you will need to know where you have to look for, I mean, Which tags you have to filter, so you have to inspect in the HMTL code corresponding on your page. Regarding libraries, I would use:
import requests
to do the connections
from bs4 import BeautifulSoup as bs
to parser
example:
bs = bs('file with html code', "html.parser")
you instance the object
select_tags = bs('select')
you look for the tags you want to filter
Then you should to wrap your list and add some condition like this:
for i in self.select:
print i.get('class'), type(i.get('class'))
if type(i.get('class')) is list and '... name you look for ...' in i.get('class'):
In this case you can filter inside the select tag you want by 'class' tag.

If I understand correctly what you want, you will probably need to combine your two rules into one, using both allow and restrict_xpath/restrict_css.
So, something like:
rules = [
Rule(LinkExtractor(allow=['/wiki/'], restrict_xpaths=['xpath']),
callback='parse_crawl',
follow=True)
]
Scraping wikipedia is usually pretty complicated, especially if trying to access very specific data.
There are a few problems i see for this particular example:
The data lacks structure - it's just a bunch of text in sequence, meaning your xpaths are going to be pretty complicated. For example, to select the 3 tables you want, you might need to use:
//table[preceding-sibling::h2[1][contains(., "Filmography")]]
You only want to follow links from the Title column (second one), however, due to the way HTML tables are defined, this might not always be represented by the second td of a row.
This means you'll probably need some additional logic, wither in your xpath, or in your code.
IMO the biggest problem: the lack of consistency. For examle, take a look at https://en.wikipedia.org/wiki/Gerard_Butler#Filmography No tables there, just a list and a link to another article. Basically, you get no guarantee about naming, positioning, layout, or display of information.
Those notes might get you started, but getting this information is going to be a big task.
My recommendation and personal choice would be to obtain the data you want from a more specialized source, instead of trying to scrape a website as generalized as wikipedia.

Example on webcrawling news headlines and contents in Python

I am a beginner in WebCrawling, and I have a question regarding crawling multiple urls.
I am using CNBC in my project. I want to extract news titles and urls from its home page, and I also want to crawl the contents of the news articles from each url.
This is what I've got so far:
import requests
from lxml import html
import pandas
url = "http://www.cnbc.com/"
response = requests.get(url)
doc = html.fromstring(response.text)
headlineNode = doc.xpath('//div[#class="headline"]')
len(headlineNode)
result_list = []
for node in headlineNode :
url_node = node.xpath('./a/#href')
title = node.xpath('./a/text()')
soup = BeautifulSoup(url_node.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
if (url_node and title and text) :
result_list.append({'URL' : url + url_node[0].strip(),
'TITLE' : title[0].strip(),
'TEXT' : text[0].strip()})
print(result_list)
len(result_list)
I am keep on getting an error saying that'list' object has no attribute 'content'. I want to create a dictionary that contains titles for each headlines, urls for each headlines, and the news article content for each headlines. Is there an easier way to approach this?

Great start on the script. However, soup = BeautifulSoup(url_node.content) is wrong. url_content is a list. You need to form the full news URL, use requests to get the HTML and then pass it to BeautifulSoup.
Apart from that, there are a few things I would look at:
I see import issues, BeautifulSoup is not imported.
Add from bs4 import BeautifulSoup to the top. Are you using pandas? If not, remove it.
Some of the news divs on CNN with the big banner picture will yield a 0 length list when you query url_node = node.xpath('./a/#href'). You need to find the appropriate logic and selectors to get those news URLs as well. I will leave that up to you.
Check this out:
import requests
from lxml import html
import pandas
from bs4 import BeautifulSoup
# Note trailing backslash removed
url = "http://www.cnbc.com"
response = requests.get(url)
doc = html.fromstring(response.text)
headlineNode = doc.xpath('//div[#class="headline"]')
print(len(headlineNode))
result_list = []
for node in headlineNode:
url_node = node.xpath('./a/#href')
title = node.xpath('./a/text()')
# Figure out logic to get that pic banner news URL
if len(url_node) == 0:
continue
else:
news_html = requests.get(url + url_node[0])
soup = BeautifulSoup(news_html.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
if (url_node and title and text) :
result_list.append({'URL' : url + url_node[0].strip(),
'TITLE' : title[0].strip(),
'TEXT' : text[0].strip()})
print(result_list)
len(result_list)
Bonus debugging tip:
Fire up an ipython3 shell and do %run -d yourfile.py. Look up ipdb and the debugging commands. It's quite helpful to check what your variables are and if you're calling the right methods.
Good luck.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping paginated sites and appending output in Python - python

You'll need to examine the data layout of your page/site. Each site is different. Look for 'pagination' or 'next' or some slider. Extract the details/count and use that in your loop. import json library. You have a json dump function...

Related

First time web scraping from a weathercast website

Beautiful Soup web scraping complex html for data

HTML Scraping the website with duplicated div class name

How do I use scrapy rules to crawl from Wiki actor and movie pages to only the links in the cast and fimlography links

Example on webcrawling news headlines and contents in Python

Categories

Resources