Crawling two-level website, need comments in new rows - python

I am very new to web scraping , and I am trying to scrape this online forum: https://community.whattoexpect.com/forums/postpartum-depression.html
It is a two-level site where the main page is a list of discussion posts, and you can click on each post to get the full content and see the reply comments. The main site also has pagination.
I want my final CSV to look something like this:
The idea is to have the main post in one row, and then the replies in the next rows. I will be using the same ID for main post and replies, so that they can be linked.
Here is my Scrapy spider so far:
import scrapy
import datetime
class PeripartumSpider(scrapy.Spider):
name = 'peripartum'
start_urls = ['http://www.community.whattoexpect.com/forums/postpartum-depression.html']
def parse(self, response):
for post_link in response.xpath('//*[#id="group-discussions"]/div[3]/div/div/a/#href').extract():
link = response.urljoin(post_link)
yield scrapy.Request(link, callback=self.parse_thread)
# Checks if the main page has a link to next page if True keep parsing.
next_page = response.xpath('(//a[#class="page-link"])[1]/#href').extract_first()
if next_page:
yield scrapy.Request(next_page, callback=self.parse)
# Going into each post and extracting information.
def parse_thread(self, response):
original_post = response.xpath("//*[#class='__messageContent fr-element fr-view']/p/text()").extract()
title = response.xpath("//*[#class='discussion-original-post__title']/text()").extract_first()
author_name = response.xpath("//*[#class='discussion-original-post__author__name']/text()").extract_first()
unixtime = response.xpath("//*[#class='discussion-original-post__author__updated']/#data-date").extract_first()
unixtime = int(unixtime) / 1000 # Removing milliseconds
timestamp = datetime.datetime.utcfromtimestamp(unixtime).strftime("%m/%d/%Y %H:%M")
replies_list = response.xpath("//*[#class='discussion-replies__list']").getall()
# Getting the comments and their information for each post
reply_post = response.xpath(".//*[#class='wte-reply__content__message __messageContent fr-element fr-view']/p/text()").extract()
reply_author = response.xpath("//*[#class='wte-reply__author__name']/text()").extract()
reply_time = response.xpath("//*[#class='wte-reply__author__updated']/#data-date").extract()
for reply in reply_time:
reply_date = int(reply_time) / 1000 # Removing milliseconds
reply_timestamp = datetime.datetime.utcfromtimestamp(reply_date).strftime("%m/%d/%Y %H:%M")
yield {
"title": title,
"author_name": author_name,
"time": timestamp,
"post": original_post,
"reply_author": reply_author,
"reply_timestamp": reply_timestamp,
"replies": reply_post
}
When I try to run my spider, I am getting 0 crawls. I am not sure if I am correctly following the links to each post. And, should I use something like Python's CSV library to get the comments to load into the next row but with the original post ID?

You have to take care about
the existing web page document structure
and your code structure parsing the content
May there is a better coding than the following, just identifying n comments and after that looping over the comments. In this case you don't need to zip the lists together. But you could use it as a starting point
import scrapy
import datetime
class PeripartumSpider(scrapy.Spider):
name = 'peripartum'
start_urls = ['https://community.whattoexpect.com/forums/postpartum-depression.html']
def parse(self, response):
for post_link in response.xpath('//*[#id="group-discussions"]/div[3]/div/div/a/#href').extract():
link = response.urljoin(post_link)
yield scrapy.Request(link, callback=self.parse_thread)
# Checks if the main page has a link to next page if True keep parsing.
next_page = response.xpath('(//a[#class="page-link"])[1]/#href').extract_first()
if next_page:
yield scrapy.Request(next_page, callback=self.parse)
# Going into each post and extracting information.
def parse_thread(self, response):
original_post = response.xpath("//*[#class='__messageContent fr-element fr-view']/p/text()").extract()
title = response.xpath("//*[#class='discussion-original-post__title']/text()").extract_first()
author_name = response.xpath("//*[#class='discussion-original-post__author__name']/text()").extract_first()
unixtime = response.xpath("//*[#class='discussion-original-post__author__updated']/#data-date").extract_first()
unixtime = int(unixtime) / 1000 # Removing milliseconds
timestamp = datetime.datetime.utcfromtimestamp(unixtime).strftime("%m/%d/%Y %H:%M")
replies_list = response.xpath("//*[#class='discussion-replies__list']").getall()
# Getting the comments and their information for each post
replies_post = response.xpath(".//*[#class='wte-reply__content__message __messageContent fr-element fr-view']/p/text()").extract()
replies_author = response.xpath("//*[#class='wte-reply__author__name']/text()").extract()
replies_time = response.xpath("//*[#class='wte-reply__author__updated']/#data-date").extract()
replies = zip(replies_post, replies_author, replies_time)
for reply_post, reply_author, reply_time in replies:
reply_date = int(reply_time) / 1000 # Removing milliseconds
reply_timestamp = datetime.datetime.utcfromtimestamp(reply_date).strftime("%m/%d/%Y %H:%M")
yield {
"title": title,
"author_name": author_name,
"time": timestamp,
"post": original_post,
"reply_author": reply_author,
"reply_timestamp": reply_timestamp,
"replies": reply_post
}
You may also have to take care about pagination in comments.

Related

How to get video url from iframe?

I want to get the url of a video (.mp4) from an iframe using python(or rust) (doesn't matter which library). For example, I have:
<iframe src="https://spinning.allohalive.com/?kp=1332827&token=b51bdfc8af17dee996d3eae53726df" />
I really have no idea how to do this. Help me please! If you need some more information, just ask.
The code that I use to parse iframes from a website:
import scrapy
from cimber.models.website import Website
class KinokradSpider(scrapy.Spider):
name = "kinokrad"
start_urls = [Website.Kinokrad.value]
def __init__(self):
self.pages_count = 1
def parse(self, response):
pages_count = self.get_pages_count(response)
if self.pages_count <= pages_count:
for film in response.css("div.shorposterbox"):
film_url = film.css("div.postertitle").css("a").attrib["href"]
yield scrapy.Request(film_url, callback=self.parse_film)
next_page = f"{Website.Kinokrad.value}/page/{self.pages_count}"
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
self.pages_count += 1
def parse_film(self, response):
name = response.css("div.fallsttitle").css("h1::text").get().strip()
players = []
for player in response.css("iframe::attr(src)").extract():
players.append(player)
yield {
"name": name,
"players": players
}
def get_pages_count(self, response) -> int:
links = response.css("div.navcent").css("a")
last_link = links[len(links) - 1].attrib["href"]
return int(last_link.split("/page/")[1].replace("/", "").strip())
I've been trying for 2 weeks but finally I'm asking this question on StackOverflow. First I used Bs4, then Selenium, and now scrapy. I have a large code to automatically parse iframes, but I need mp4 url. I've already tried solutions on StackOVerflow, but They don't work, so please don't remove my question.

Scrapy: Having trouble to get value from another page

I started to use Scrapy yesterday, following this modified version of Scrapy: https://github.com/prncc/steam-scraper to get Steam Reviews information. The existing code allows for continuous scrolling until there is no review left to scrape. However, I need to modify it a bit to be able to get values from another page; more specifically, on a webpage like this https://steamcommunity.com/app/416600/reviews for instance, I would like to get the number of reviews of each reviewer, which are displayed only on their review page (like this one https://steamcommunity.com/profiles/76561197993023168/recommended/, who has 14 reviews).
The original code reads:
class ReviewSpider(scrapy.Spider):
name = 'reviews'
test_urls = [
# Full Metal Furies
'http://steamcommunity.com/app/416600/reviews/?browsefilter=mostrecent&p=1',
]
def __init__(self, url_file=None, steam_id=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.url_file = url_file
self.steam_id = steam_id
def read_urls(self):
with open(self.url_file, 'r') as f:
for url in f:
url = url.strip()
if url:
yield scrapy.Request(url, callback=self.parse)
def start_requests(self):
if self.steam_id:
url = (
f'http://steamcommunity.com/app/{self.steam_id}/reviews/'
'?browsefilter=mostrecent&p=1'
)
yield Request(url, callback=self.parse)
elif self.url_file:
yield from self.read_urls()
else:
for url in self.test_urls:
yield Request(url, callback=self.parse)
def parse(self, response):
page = get_page(response)
product_id = get_product_id(response)
# Load all reviews on current page.
reviews = response.css('div .apphub_Card')
for i, review in enumerate(reviews):
yield load_review(review, product_id, page, i)
# Navigate to next page.
form = response.xpath('//form[contains(#id, "MoreContentForm")]')
if form:
yield self.process_pagination_form(form, page, product_id)
def process_pagination_form(self, form, page=None, product_id=None):
action = form.xpath('#action').extract_first()
names = form.xpath('input/#name').extract()
values = form.xpath('input/#value').extract()
formdata = dict(zip(names, values))
meta = dict(prev_page=page, product_id=product_id)
return FormRequest(
url=action,
method='GET',
formdata=formdata,
callback=self.parse,
meta=meta
)
What I tried to do is to add this in the parse function, just to get the number of reviews for a given user:
def parse(self, response):
page = get_page(response)
product_id = get_product_id(response)
# Load all reviews on current page.
reviews = response.css('div .apphub_Card')
for i, review in enumerate(reviews):
yield load_review(review, product_id, page, i)
Reviewers = response.xpath("/html/body/div[1]/div[5]/div[5]/div/div[1]/div/div/a[1]") #Get the path for each reviewer
for IndividualReview in Reviewers:
num_reviews = IndividualReview.xpath(".//#href").get()
yield {
'num_reviews': num_reviews
}
# Navigate to next page.
form = response.xpath('//form[contains(#id, "MoreContentForm")]')
if form:
yield self.process_pagination_form(form, page, product_id)
But it did not work. The main issue is that I am not familiar in xpath in general, and I do not really understand how Scrapy is supposed to go to the other page, get the information desired and then go back, iteratively for each review on a given game. How can I tackle this issue?

speed up python scrapy crawler

I'm currently writing vacancies scraper with Scrapy to parse about 3M of vacancies item.
Now I'm on place when spider works and successfully scraping items and storing it tot postgreesql but the thing is it doing it pretty slow.
For 1 hr i stored only 12k vacancies so i'm really ti far from 3M of them.
Thing is that in the end i'm gonna need to scrape and update data once per day and with current performance I'm gonna need more than a day to just parse all data.
I'm new in data scraping so I may do some basic thing wrong and I'll be very gratefull if anybody can hel me.
Code of my spider:
import scrapy
import urllib.request
from lxml import html
from ..items import JobItem
class AdzunaSpider(scrapy.Spider):
name = "adzuna"
start_urls = [
'https://www.adzuna.ru/search?loc=136073&pp=10'
]
def parse(self, response):
job_items = JobItem()
items = response.xpath("//div[#class='sr']/div[#class='a']")
def get_redirect(url):
response = urllib.request.urlopen(url)
response_code = response.read()
result = str(response_code, 'utf-8')
root = html.fromstring(result)
final_url = root.xpath('//p/a/#href')[0]
final_final_url = final_url.split('?utm', 1)[0]
return final_final_url
for item in items:
id = None
data_aid = item.xpath(".//#data-aid").get()
redirect = item.xpath(".//h2/a/#href").get()
url = get_redirect(redirect)
url_header = item.xpath(".//h2/a/strong/text()").get()
if item.xpath(".//p[#class='as']/#data-company-name").get() == None:
company = item.xpath(".//p[#class='as']/text()").get()
else:
company = item.xpath(".//p[#class='as']/#data-company-name").get()
loc = item.xpath(".//p/span[#class='loc']/text()").get()
text = item.xpath(".//p[#class='at']/span[#class='at_tr']/text()").get()
salary = item.xpath(".//p[#class='at']/span[#class='at_sl']/text()").get()
job_items['id'] = id
job_items['data_aid'] = data_aid
job_items['url'] = url
job_items['url_header'] = url_header
job_items['company'] = company
job_items['loc'] = loc
job_items['text'] = text
job_items['salary'] = salary
yield job_items
next_page = response.css("table.pg td:last-child ::attr('href')").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Use indexes in your table
Insert in BULK instead of inserting one-by-one
Minimize use of meta in your Request
Use tuple instead of list where possible
Set CONCURRENT_ITEMS=100, setting it to higher decreases performance
Try to use less Middlewares and Pipielines
Set AUTOTHROTTLE_ENABLED=False in settings.py
Set TELNETCONSOLE_ENABLED=False in settings.py

struggling with Scrapy

I'm new to scrapy and I struggle a little with a special case.
Here is the scenario :
I want to scrap a website where there is a list of books.
httpx://...bookshop.../archive is the page where all the 10 firsts books are listed.
Then I want to get the informations (name, date, author) of all the books in the list. I have to go on another page for each books:
httpx://...bookshop.../book/{random_string}
So there is two types of request :
One for refreshing the list of books.
Another one for getting the book informations.
But some books can be added to the list at anytime.
So I would like to refresh the list every minutes.
and I also want to delay all the request by 5 seconds.
Here my basic solution, but it only works for one "loop" :
First I set the delay in settings.py :
DOWNLOAD_DELAY = 5
then the code of my spider :
from scrapy.loader import ItemLoader
class bookshopScraper(scrapy.Spider):
name = "bookshop"
url = "httpx://...bookshop.../archive"
history = []
last_refresh = 0
def start_requests(self):
self.last_refresh = time.time()
yield scrapy.Request(url=self.url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[3]
if page == 'archive':
return self.parse_archive(response)
else:
return self.parse_book(response)
def parse_archive(self, response):
links = response.css('SOME CSS ').extract()
for link in links:
if link not in self.history:
self.history.append(link)
yield scrapy.Request(url="httpx://...bookshop.../book/" + link, callback=self.parse)
if len(self.history) > 10:
n = len(self.history) - 10
self.history = history[-n:]
def parse_book(self, response):
"""
Load Item
"""
Now I would like to do something like :
if(time.time() > self.last_refresh + 80):
self.last_refresh = time.time()
return scrapy.Request(url=self.url, callback=self.parse, dont_filter=True)
But I really don't know how to implement this.
PS : I want the same instance of scrapy to run all the time without stopping.

Scrapy spider get information that is inside of links

I have done and spider that can take the information of this page and it can follow "Next page" links. Now, the spider just takes the information that i'm showing in the following structure.
The structure of the page is something like this
Title 1
URL 1 ---------> If you click you go to one page with more information
Location 1
Title 2
URL 2 ---------> If you click you go to one page with more information
Location 2
Next page
Then, that i want is that the spider goes on each URL link and get full information. I suppose that i must generate another rule that specify that i want do something like this.
The behaviour of the spider it should be:
Go to URL1 (get info)
Go to URL2 (get info)
...
Next page
But i don't know how i can implement it. Can someone guide me?
Code of my Spider:
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
sites = sel.xpath("//div[#id='llista-resultats']/div")
items = []
cont = 0
for site in sites:
item = BcnItem()
item['id'] = cont
item['title'] = u''.join(site.xpath('h3/a/text()').extract())
item['url'] = u''.join(site.xpath('h3/a/#href').extract())
item['when'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[1]/text()').extract())
item['where'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[2]/span/a/text()').extract())
item['street'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[3]/span/text()').extract())
item['phone'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[4]/text()').extract())
items.append(item)
cont = cont + 1
return items
EDIT After searching in internet I found a code with which i can do that.
First of all, I have to get all the links, then I have to call another parse method.
def parse(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
If you want use Rules because the page have a paginator, you should change def parse to def parse_start_url and then call this method through Rule. With this changes you make sure that the parser begins at the parse_start_url and the code it would be something like this:
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
Thant's all folks
There is an easier way of achieving this. Click next on your link, and read the new url carefully:
http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10
By looking at the get data in the url (everything after the questionmark), and a bit of testing, we find that these mean
from=10 - Starting index
q=*:* - Search query
nr=10 - Number of items to display
This is how I would've done it:
Set nr=100 or higher. (1000 may do as well, just be sure that there is no timeout)
Loop from from=0 to 34300. This is above the number of entries currently. You may want to extract this value first.
Example code:
entries = 34246
step = 100
stop = entries - entries % step + step
for x in xrange(0, stop, step):
url = 'http://guia.bcn.cat/index.php?pg=search&from={}&q=*:*&nr={}'.format(x, step)
# Loop over all entries, and open links if needed

Categories