Python bs4: why am I only seeing part of the HTML?

Python bs4: why am I only seeing part of the HTML? - python

I'm using bs4 to scrape Product Hunt.
Taking this post as an example, when I scrape it using the below code, the "discussion" section is entirely absent.
res = requests.get('https://producthunt.com/posts/weights-biases')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
pprint.pprint(soup.prettify())
I suspect this has something to do with lazy loading (when you open the page, the "discussion" section takes an extra second or two to appear).
How do I scrape lazy loaded components? Or is this something else entirely?

It seems that some elements of the page are dynamically loaded through Javascript queries.
The requests library allows you to send queries manually and then you parse the content of the updated page with bs4.
However, from my experience with dynamic webpages, this approach can be really annoying if you have a lot of queries to send.
Generally in these cases it's preferable to use a library that integrates real-time browser simulation. This way, the simulator itself will handle client-server communication and update the page; you will just have to wait for the elements to be loaded and then analyze them safely.
So I suggest you take a look at selenium or even selenium-requests if you prefer to keep the requests 'philosophy'.

This is how you can get the comments under discussion. You can always rectify the script to get the concerning replies each thread has got.
import json
import requests
from pprint import pprint
url = 'https://www.producthunt.com/frontend/graphql'
payload = {"operationName":"PostPageCommentsSection","variables":{"commentsListSubjectThreadsCursor":"","commentsThreadRepliesCursor":"","slug":"weights-biases","includeThreadForCommentId":None,"commentsListSubjectThreadsLimit":10},"query":"query PostPageCommentsSection($slug:String$commentsListSubjectThreadsCursor:String=\"\"$commentsListSubjectThreadsLimit:Int!$commentsThreadRepliesCursor:String=\"\"$commentsListSubjectFilter:ThreadFilter$includeThreadForCommentId:ID$excludeThreadForCommentId:ID){post(slug:$slug){id canManage ...PostPageComments __typename}}fragment PostPageComments on Post{_id id slug name ...on Commentable{_id id canComment __typename}...CommentsSubject ...PostReviewable ...UserSubscribed meta{canonicalUrl __typename}__typename}fragment PostReviewable on Post{id slug name canManage featuredAt createdAt disabledWhenScheduled ...on Reviewable{_id id reviewsCount reviewsRating isHunter isMaker viewerReview{_id id sentiment comment{id body __typename}__typename}...on Commentable{canComment commentsCount __typename}__typename}meta{canonicalUrl __typename}__typename}fragment CommentsSubject on Commentable{_id id ...CommentsListSubject __typename}fragment CommentsListSubject on Commentable{_id id threads(first:$commentsListSubjectThreadsLimit after:$commentsListSubjectThreadsCursor filter:$commentsListSubjectFilter include_comment_id:$includeThreadForCommentId exclude_comment_id:$excludeThreadForCommentId){edges{node{_id id ...CommentThread __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}__typename}fragment CommentThread on Comment{_id id isSticky replies(first:5 after:$commentsThreadRepliesCursor allForCommentId:$includeThreadForCommentId){edges{node{_id id ...Comment __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}...Comment __typename}fragment Comment on Comment{_id id badges body bodyHtml canEdit canReply canDestroy createdAt isHidden path repliesCount subject{_id id ...on Commentable{_id id __typename}__typename}user{_id id headline name firstName username headline ...UserSpotlight __typename}poll{...PollFragment __typename}review{id sentiment __typename}...CommentVote ...FacebookShareButtonFragment __typename}fragment CommentVote on Comment{_id id ...on Votable{_id id hasVoted votesCount __typename}__typename}fragment FacebookShareButtonFragment on Shareable{id url __typename}fragment UserSpotlight on User{_id id headline name username ...UserImage __typename}fragment UserImage on User{_id id name username avatar headline isViewer ...KarmaBadge __typename}fragment KarmaBadge on User{karmaBadge{kind score __typename}__typename}fragment PollFragment on Poll{id answersCount hasAnswered options{id text imageUuid answersCount answersPercent hasAnswered __typename}__typename}fragment UserSubscribed on Subscribable{_id id isSubscribed __typename}"}
r = requests.post(url,json=payload)
for item in r.json()['data']['post']['threads']['edges']:
pprint(item['node']['body'])
Output at this moment:
('Looks like such a powerful tool for extracting performance insights! '
'Absolutely love the documentation feature, awesome work!')
('This is awesome and so Any discounts or special pricing for '
'researchers/students/non-professionals?')
'Amazing. I think this is very helpful tools for us. Keep it up & go ahead.'
('<p>This simple system of record automatically saves logs from every '
'experiment, making it easy to look over the history of your progress and '
'compare new models with existing baselines.</p>\n'
'Pros: <p>Easy, fast, and lightweight experiment tracking</p>\n'
'Cons: <p>Only available for Python projects</p>')
('Very cool! I hacked together something similar but much more basic for '
"personal use and always wondered why TensorBoard didn't solve this problem. "
'I just wish this was open source! :) P.S. awesome use of the parallel '
'co-ordinates d3.js chart - great idea to apply it to experiment '
'configurations!')

Related

Python: BeautifulSoup Scrape, Blank Descriptions For Courses Messing Up Data

I'm trying to scrape some course data from the site https://bulletins.psu.edu/university-course-descriptions/undergraduate/ for a project.
# -*- coding: utf-8 -*-
"""
Created on Mon Nov 5 20:37:33 2018
#author: DazedFury
"""
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
import requests
# returns a CloudflareScraper instance
#scraper = cfscrape.create_scraper()
#URL and textfile
text_file = open("Output.txt", "w", encoding='UTF-8')
page_link = 'https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/'
page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")
#Array for storing URL's
URLArray = []
#Find links
for link in page_content.find_all('a'):
if('/university-course-descriptions/undergraduate' in link.get('href')):
URLArray.append(link.get('href'))
k = 1
#Parse Loop
while(k != 242):
print("Writing " + str(k))
completeURL = 'https://bulletins.psu.edu' + URLArray[k]
# this is the url that we've already determined is safe and legal to scrape from.
page_link = completeURL
# here, we fetch the content from the url, using the requests library
page_response = requests.get(page_link)
#we use the html parser to parse the url content and store it in a variable.
page_content = BeautifulSoup(page_response.content, "html.parser")
page_content.prettify
#Find and print all text with tag p
paragraphs = page_content.find_all('div', {'class' : 'course_codetitle'})
paragraphs2 = page_content.find_all('div', {'class' : 'courseblockdesc'})
j = 0
for i in range(len(paragraphs)):
if i % 2 == 0:
text_file.write(paragraphs[i].get_text())
text_file.write("\n")
if j < len(paragraphs2):
text_file.write(" ".join(paragraphs2[j].get_text().split()))
text_file.write("\n")
text_file.write("\n")
if(paragraphs2[j].get_text() != ""):
j += 1
k += 1
#FORMAT
#text_file.write("<p style=\"page-break-after: always;\"> </p>")
#text_file.write("\n\n")
#Close Text File
text_file.close()
The specific info I need are the course title and the description. The problem is that some of the courses have blank descriptions, which messes up the order and giving bad data.
I thought about just checking if the course description is blank but on the site, the 'courseblockdesc' tag doesn't exists if the course has no description. Therefore when I find_all courseblockdesc, the list doesn't actually add add an element to the array, so the order ends up messed up. There are too many errors on this to manually fix, so I was hoping someone could help me find a solution to this.

The simplest solution would be to go through each item in one find_all for the parents of the items you are looking for.
for block in page_content.find_all('div', class_="courseblock"):
title = block.find('div', {'class' : 'course_codetitle'})
description = block.find('div', {'class' : 'courseblockdesc'})
# do what you need with the navigable strings here.
print(title.get_text()
if description:
print(description.get_text())

You may be over-complicating the procedure somewhat, but you're certainly on the right track. Instead of storing the information in an array and relying on all of the indexes to line up, write the text file as you traverse the courses, pulling title and description dynamically from each course block. If a block doesn't have a description, you can handle that on the spot. Here's a working example:
from bs4 import BeautifulSoup
import requests
url = "https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/"
with open("out.txt", "w", encoding="UTF-8") as f:
for link in BeautifulSoup(requests.get(url).content, "html.parser").find_all("a"):
if "/university-course-descriptions/undergraduate" in link["href"]:
soup = BeautifulSoup(requests.get("https://bulletins.psu.edu" + link["href"]).content, "html.parser")
for course in soup.find_all("div", {"class": "courseblock"}):
title = course.find("div", {"class" : "course_title"}).get_text().strip()
try:
desc = course.find("div", {"class" : "courseblockdesc"}).get_text().strip()
except AttributeError:
desc = "No description available"
f.write(title + "\n" + desc + "\n\n")
Output snippet (from end of text file to validate alignment):
WLED 495: **SPECIAL TOPICS**
No description available
WLED 495B: Field Experience for World Languages Teacher Preparation in Grades 1-5
WL ED 495B Field Experience for World Languages Teacher Preparation in Grades 1-5 (3) Practicum situation where Prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with children in grades 1-5 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluated own designed activities and lessons; (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events; (3) inquiry projects on teaching and learning of World Languages.
WLED 495C: Field Experience for World Languages Teacher Preparation in Grades 6-12
WL ED 495C Field Experience for World Languages Teacher Preparation in Grades 6-12 (3) Practicum situation where prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements in grades 6-12 and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with students in grades 6-12 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluating their own designed activities and lessons, (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events, and (3) inquiry projects on teaching and learning of World Languages.
Additional minor remarks:
It's a good idea to use the with keyword for file I/O. This will automatically close the file handle when done.
Verbose intermediate variables and comments that add noise like:
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
or
#Close Text File
text_file.close()
can always be removed, making the program logic easier to follow.

Q: Get Item's Title Python eBay SDK

I am trying to get item's title with "GetSingleItem" method by providing the ItemID, but it does not work.
Here is the code:
from ebaysdk.shopping import Connection as Shopping
api = Shopping(appid='&',certid='&',devid='&',token='&')
ItemID=&
a = print (api.execute('GetSingleItem',{'ItemID':ItemID,'IncludeSelector':['Title']}))
print(a)
The response:
<ebaysdk.response.Response object at 0x003A3B10>
None

You don't need to specify title in your GET request. Ebays Shopping API provides that output field by default. You can check in their documentation here
It should be noted however, that when using 'InputSelector' it should come before 'ItemId' as the order seems to matter. So your code should look like this.
api.execute('GetSingleItem', {'IncludeSelector':outputField,'ItemID':ItemID})
Where outputField could be
Compatibility,
Description, Details, ItemSpecifics, ShippingCosts, TextDescription, Variations
To answer your question simply execute:
response = api.execute('GetSingleItem', {'ItemID':ItemID})
title = response.dict()['Item']['Title']
print(title)

I think you need to put the itemID like this
{
"ItemID": "000000000000"
}

How to Webscrape data from a classic asp website using python. I am having trouble getting the result after submitting the POST form

I am beginner in Web scraping and I have become very much interested in the process.
I set for myself a Project that can keep me motivated till I completed the project.
My Project
My Aim is to write a Python Program that goes to my university results page(which happens to be a " xx.asp") and enters my
MY EXAM NO
MY COURSE
SEMESTER and submit it to the website.
Clicking on the submit button leads to another "yy.asp" page in which my results are displayed. But I am having a lot of trouble doing the same.
Some Sample Data to try it out
The Results Website: http://result.pondiuni.edu.in/candidate.asp
Register Number: 15te1218
Degree: BTHEE
Exam: Second
Could anyone give me directions of how I am to accomplish the task?
I have written a sample program that I am not really proud of or nor does it work as I wanted. The following is the code that I wrote. I am a beginner so sorry if I did something terribly wrong. Please correct me and would be awesome if you could guide me to solve the problem.
The website is a .asp website not .aspx.
I have provided sample data so that you can see whats happening where we submit a request to the website.
The Code
import requests
with requests.Session() as c:
url='http://result.pondiuni.edu.in/candidate.asp'
url2='http://result.pondiuni.edu.in/ResultDisp.asp'
TXTREGNO='15te1218'
CMBDEGREE='BTHEE~\BTHEE\result.mdb'
CMBEXAMNO='B'
DPATH='\BTHEE\result.mdb'
DNAME='BTHEE'
TXTEXAMNO='B'
c.get(url)
payload = {
'txtregno':TXTREGNO,
'cmbdegree':CMBDEGREE,
'cmbexamno':CMBEXAMNO,
'dpath':DPATH,
'dname':DNAME,
'txtexamno':TXTEXAMNO
}
post_request = requests.post(url, data=payload)
page=c.get(url2)
I have no idea what to do next so that I can retrieve my result page(displayed in url2 from the code). All the data is entered in link url in the program(the starting link were all the info is entered) from where after submitting takes is to url2 the results page.
Please help me make this program.
I took all the post form parameters from Chrome's Network Tab.

You are way over complicating it and you have carriage returns in your post data so that could never work:
In [1]: s = "BTHEE~\BTHEE\result.mdb"
In [2]: print(s) # where did "\result.mdb" go?
esult.mdbHEE
In [3]: s = r"BTHEE~\BTHEE\result.mdb" # raw string
In [4]: print(s)
BTHEE~\BTHEE\result.mdb
So fix you form data and just post to get to your results:
import requests
data = {"txtregno": "15te1218",
"cmbdegree": r"BTHEE~\BTHEE\result.mdb", # use raw strings
"cmbexamno": "B",
"dpath": r"\BTHEE\result.mdb",
"dname": "BTHEE",
"txtexamno": "B"}
results_page = requests.post("http://result.pondiuni.edu.in/ResultDisp.asp", data=data).content

To add to the answer already given, you can use bs4.BeautifulSoup to find the data you need in the result page afterwards.
#!\usr\bin\env python
import requests
from bs4 import BeautifulSoup
payload = {'txtregno': '15te1218',
'cmbdegree': r'BTHEE~\BTHEE\result.mdb',
'cmbexamno': 'B',
'dpath': r'\BTHEE\result.mdb',
'dname': 'BTHEE',
'txtexamno': 'B'}
results_page = requests.get('http://result.pondiuni.edu.in/ResultDisp.asp', data = payload)
soup = BeautifulSoup(results_page.text, 'html.parser')
SubjectElem = soup.select("td[width='66%'] font")
MarkElem = soup.select("font[color='DarkGreen'] b")
Subject = []
Mark = []
for i in range(len(SubjectElem)):
Subject.append(SubjectElem[i].text)
Mark.append(MarkElem[i].text)
Transcript = dict(zip(Subject, Mark))
This will give a dictionary with the subject as a key and mark as a value.

How to get this form data without using browser?

Im new to python and figured that best way to learn is by practice, this is my first project.
So there is this fantasy football website. My goal is to create script which logins to site, automatically creates preselected team and submits it.
I have managed to get to submitting team part.
When I add a team member this data gets sent to server:
https://i.gyazo.com/e7e6f82ca91e19a08d1522b93a55719b.png
When I press save this list this data gets sent:
https://i.gyazo.com/546d49d1f132eabc5e6f659acf7c929e.png
Code:
import requests
with requests.Session() as c:
gameurl = 'here is link where data is sent'
BPL = ['5388', '5596', '5481', '5587',
'5585', '5514', '5099', '5249', '5566', '5501', '5357']
GID = '168'
UDID = '0'
ACT = 'draft'
ACT2 = 'save_draft'
SIGN = '18852c5f48a94bf3ee58057ff5c016af'
# eleven of those with different BPL since 11 players needed:
c.get(gameurl)
game_data = dict(player_id = BPL[0], action = ACT, id = GID)
c.post(gameurl, data = game_data)
# now I need to submit my list of selected players:
game_data_save = dict( action = ACT2, id = GID, user_draft_id = UDID, sign = SIGN)
c.post(gameurl, data = game_data_save)
This code works pretty fine, but the problem is, that 'SIGN' is unique for each individual game and I have no idea how to get this data without using Chromes inspect option.
How can I get this data simply running python code?

Because you said you can find it using devtools I'm assuming SIGN is written somewhere in the DOM.
In that case you can use requests.get().text to get the HTML of the page and parse it with a tool like lxml or HTMLParser

Solved by posting all data without 'SIGN' and in return I got 'SIGN' in html.

Download a Google Sites page Content Feed using gdata-python-client

My final goal is import some data from Google Site pages.
I'm trying to use gdata-python-client (v2.0.17) to download a specific Content Feed:
self.client = gdata.sites.client.SitesClient(source=SOURCE_APP_NAME)
self.client.client_login(USERNAME, PASSWORD, source=SOURCE_APP_NAME, service=self.client.auth_service)
self.client.site = SITE
self.client.domain = DOMAIN
uri = '%s?path=%s' % (self.client.MakeContentFeedUri(), '[PAGE PATH]')
feed = self.client.GetContentFeed(uri=uri)
entry = feed.entry[0]
...
Resulted entry.content has a page content in xhtml format. But this tree doesn't content any plan text data from a page. Only html page struct and links.
For example my test page has
<div>Some text</div>
ContentFeed entry has only div node with text=None.
I have debugged gdata-python-client request/response and checked resolved data from server in raw buffer - any plan text data in content. Hence it is a Google API bug.
May be there is some workaround? May be i can use some common request parameter? What's going wrong here?

This code works for me against a Google Apps domain and gdata 2.0.17:
import atom.data
import gdata.sites.client
import gdata.sites.data
client = gdata.sites.client.SitesClient(source='yourCo-yourAppName-v1', site='examplesite', domain='example.com')
client.ClientLogin('admin#example.com', 'examplepassword', client.source);
uri = '%s?path=%s' % (client.MakeContentFeedUri(), '/home')
feed = client.GetContentFeed(uri=uri)
entry = feed.entry[0]
print entry
Given, it's pretty much identical to yours, but it might help you prove or disprove something. Good luck!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.