BeautifulSoup table data extraction - data not showing up - python

-2 to make this 30 characters is some top kek information based for things idfk

As you yourself found out, the element is not present in the page source, and is loaded dynamically through an AJAX request. The urllib module (or requests) returns the page source, which is why you won't be able to get that value directly.
Go to Developer Tools > Network > XHR and refresh the page. You'll see an AJAX request made to this url:
https://ethplorer.io/service/service.php?data=0x8b353021189375591723e7384262f45709a3c3dc
This url returns the data in the form of JSON. If you have a look at it, you can get the Holders number from it using requests module and the built-in .json() method.
import requests
r = requests.get('https://ethplorer.io/service/service.php?data=0x8b353021189375591723e7384262f45709a3c3dc')
data = r.json()
holders = data['pager']['holders']['total']
print(holders)
# 2346

Related

Twitter scraping using Python

I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an "alternative" app, which is simply a localhost that can search for a user, and get its posts)
I've been searching and reading everything related to REST, AJAX, and the python modules requests, requests-html, BeautifulSoup, and more.
I can see when looking at twitter on the devtools (for example on Marvel's profile page) that the only relevant requests being sent (by POST and GET) are the following: client_event.json and UserTweets?variables=... .
I understood that these are the relevant messages being received by cleaning the network tab and recording only when I scroll down and load new tweets - these are the only messages that came up which aren't random videos (I cleaned the search using -video -init -csp_report -config -ondemand -like -pageview -recommendations -prefetch -jot -key_live_kn -svg -jpg -jpeg -png -ico -analytics -loader -sharedCore -Hebrew).
I am new to this field, so I am probably doing something wrong. I can see on UserTweets the response I'm looking for - a beautiful JSON with all the data I need - but I am unable, no matter how much I've been trying to, to access it.
I tried different modules and different headers, and I get nothing. I DON'T want to use Selenium since it's tiresome, and I know where the data I need is stored.
I've been trying to send a GET reuest to:
https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D
by doing:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
response = session.get('https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D')
response.html.render()
s = BeautifulSoup(response.html.html, 'lxml')
but I get back an HTML script that either says Chromium is unsupported, or just a static page without the javascript updating the DOM.
All help appreciated.
Thank you
P.S
I've posted the same question on reverseengineering.stackexchange, just to be safe (overflow has more appropriate tags :-))
Before you deep dive into the actual code, I would first start building the correct request to twitter. I would use a 3rd party tool focused on REST and APIs such as Postman to build and test the required request - and only then would write the actual code.
From your questions it seems that you'll be using an open API of twitter, so it means you'll only need to send x-guest-token and basic Bearer authorization in your request headers.
The Bearer is static - you can just browse to twitter and copy/paste
it from the dev tools network monitor.
To get the x-guest-token you'll need something dynamic because it has expiration, what I would suggest is send a curl request to twitter, parse the token from there and put it in your header before sending the request. You can see something very similar in: Python Downloading twitter video using python (without using twitter api)
.
After you have both of the above, build the required GET request in Postman and test if you get back the correct response. Only after you have everything working in Postman - write the same in Python, or any other language**
**You can use Postman snippets which automatically generates the code needed in many programming languages.
#TripleS, example of how one may extract json data from __INITIAL_STATE__ and write it to text file.
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://twitter.com/ThePSF')
# Extract json from "window.__INITIAL_STATE__={....};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
twitter_json = json.loads(json_string)
# Save structured json data to a text file that may help
# you to orient yourself and possible pick some parts you
# are interested in (if there are any)
with open('twitter_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(twitter_json, indent=4, sort_keys=True))
I've just tried the same, but with requests, not requests_html module. I could get all site contents, but I would not call it "beautiful".
Also, now I am blocked to access the site without logging in.
Here is my small example.
Use official Twitter API instead.
I also think that I will probably be blocked after some tries of using this script. I've tried it only 2 times.
import requests
import bs4
def example():
result = requests.get("https://twitter.com/childrightscnct")
soup = bs4.BeautifulSoup(result.text, "lxml")
print(soup)
if __name__ == '__main__':
example()
To select any element with bs4, use
some_text = soup.select('locator').getText()
I found one tool for scraping Twitter, that has quite a lot of stars on Github https://github.com/twintproject/twint I did not try it myself and hope it is legal.
What you're missing is the bearer and guest token needed to make your request. If I just hit your endpoint with curl and no headers I get no response. However, if I add headers for the bearer token and guest token then I get that json you're looking for:
curl https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'' -H 'x-guest-token: 1452696114205847552'
You can get the bearer token (which may not expire that often) and the guest token (which does expire, I think) like this:
The html of the twitter link you go to links a file called main.some random numbers.js. Within that javascript file is the bearer token. You can recognize it is because a long string starting with lots of A's.
Take the bearer token and call https://api.twitter.com/1.1/guest/activate.json using the bearer token as an authorization header
curl 'https://api.twitter.com/1.1/guest/activate.json' -X POST -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
In python this looks like:
import requests
import json
url = "https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D"
headers = {"authorization": "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA", "x-guest-token": "1452696114205847552"}
resp = requests.get(url, headers=headers)
j = json.loads(resp.text)
And now, that variable, j, holds your beautiful json. One warning, sometimes the response back can be so big that it doesn't seem to fit into a single response. If this happens, you'll notice the resp.text isn't valid json, but just some portion of a big blog of json. To fix this, you'll just need to adapt the requests to use "stream=True" and stream out the whole response before you try to parse it as json.

send a post request to a website with multiple form tags using requests in python

good evening,
im trying to write a programme that extracts the sell price of certain stocks and shares on a website called hl.co.uk
As you can imagine you have to search for the stock you want to see the sale price of.
my code so far is as follows:
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.hl.co.uk/shares"
page = requests.get(url)
parsed_html = soup(page.content, 'html.parser')
form = parsed_html.find('form', id="stock_search")
input_tag = form.find('input').get('name')
submit = form.find('input', id="stock_search_submit").get('alt')
post_data = {input_tag: "fgt", "alt": submit}
i have been able to extract the correct form tag and the input names i require. but the website has multiple forms on this page.
how can i submit a post request to this website using the data i have in "post_data" to that specfic form in order for it to search the stockk/share that i desire and then give me the next page?
thanks in advance
Actually when you submit the form from the homepage, it redirect you to the the target page with an url looking like this, "https://www.hl.co.uk/shares/search-for-investments?stock_search_input=abc&x=56&y=35&category_list=CEHGINOPW", so in my opinion, instead of submitting the homepage form, you should directly call the target page with your own GET parameters, the url you're supposed to call will look like this https://www.hl.co.uk/shares/search-for-investments?stock_search_input=[your_keywords].
Hope this helped you
This is a pretty general problem which you can use google chrome's devtools to solve. Basically,
1- Navigate to the page where you have a form and bunch of fields.
In your case page should look like this:
2- Then choose XHR tab under Network tab which will filter out all Fetch and XHR requests. These requests are generally sent after a form submission and they return a JSON with resulting data most of the time.
3- Make sure you enable the checkbox on the top left Preserve Log so the list doesn't refresh when form is submitted.
4- Submit the form, then you'll see bunch of requests are being made. Inspect them to hopefully find what you're looking for.
In this case I found this URL endpoint which gives out the results as response.
https://www.hl.co.uk/ajax/funds/fund-search/search?investment=&companyid=1324&sectorid=132&wealth=&unitTypePref=&tracker=&payment_frequency=&payment_type=&yield=&standard_ocf=&perf12m=&perf36m=&perf60m=&fund_size=&num_holdings=&start=0&rpp=20&lo=0&sort=fd.full_description&sort_dir=asc&
You can see all the query parameters here as companyid, sectorid what you need to do is change those and just make a request to URL. Then you'll get the relevant information.
To retrieve those companyid and sectorid values you can send a get request to the page https://www.hl.co.uk/shares/search-for-investments?stock_search_input=ftg&x=17&y=23&category_list=CEHGINOPW which has those dropdowns and filter the html to find these values in the screenshot below:
You can see this documentation for BS4 to find tags inside HTML source, https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

Python POST requests - how to extract html of request destination

Scraping data of mortgage from official mortgage registry. The problem is that I can't extract the html of particular document. Everything happens on POST behalf - I have all of the data required to precise the POST request, but still when i'm printing the request.url it shows me the welcome screen page. It should retrieve html from particular document. All data like number of mortgage or current page are listed in dev tools > netowrk > Form Data, so I bet it must be possible. I'm quite new in web python so I will apprecaite any help.
My code:
import requests
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
r = requests.post('https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW', data=data)
print(r.url), print(r.content)
You are getting the welcome screen because you aren't sending all the requests required to view the next page.
Go to Chrome > Network tabs, and you will see that when you click the submit/search button, a bunch of other GET requests are being sent to different URLs after that first POST request.
You need to replicate that in your script. Depending upon the website it can be tough to get the response, so you should consider using Selenium
That said, it's not impossible to do this with requests:
session = requests.Session()
You need to send the POST request, and all other GET requests that follow in the same session.
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
session.post(URL, headers=headers, params=data)
# Start sending the GET requests
session.get(URL_1, headers=headers)
session.get(URL_2, headers=headers)

Bypassing intrusive cookie statement with requests library

I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)

Python web scraping requests follow redirect

I'm trying to scrape a web site with the requests module.
Using chrome and inspect elements, I go to the url, fill in a form and click the continue button. Chrome's inspect elements (network documents) shows what chrome sent with post. It also shows multiple cookies. The site redirects to a url with among other things a session ID.
To simulate this, I try using requests. I take the form data from inspect elements and reformat it to a dictionary. I use requests.session to include the cookies.
import requests
form_data = 'currentCalForm=dep&currentCodeForm=&tripType=oneWay&searchCategory=award&originAirport=JFK&flightParams.flightDateParams.travelMonth=5&flightParams.flightDateParams.travelDay=14&flightParams.flightDateParams.searchTime=040001&destinationAirport=LHR&returnDate.travelMonth=-1000&returnDate.travelDay=-1000&adultPassengerCount=2&adultPassengerCount=1&serviceclass=coach&searchTypeMode=matrix&awardDatesFlexible=true&originAlternateAirportDistance=0&destinationAlternateAirportDistance=0&discountCode=&flightSearch=award&dateChanged=false&fromSearchPage=true&advancedSearchOpened=false&numberOfFlightsToDisplay=10&searchCategory=&aairpassSearchType=false&moreOptionsIndicator=oneWay&seniorPassengerCount=0&youngAdultPassengerCount=0&childPassengerCount=0&infantPassengerCount=0&passengerCount=2'.split('&')
payload = {}
for item in form_data:
key, value = item.split('=')
if value:
payload[key] = value
with requests.session() as s:
r = s.post('https://www.aa.com/homePage.do', params = payload, allow_redirects=True)
print r.headers
print r.history
print r.url
print r.status_code
with open('x.htm', 'wb') as f:
f.write(r.text.encode('utf8'))
requests, however, does not appear to follow the redirect. history is empty and the url appears to be the data I sent rather than what the site returned. x.htm shows a web page, but does not contain the info I expected.
From http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history I expected r.url to contain the redirected url and r.history to contain an http response code.
What am I doing wrong?
ok what you do seems to be wrong. i am not sure how you decided to sent a post on https://www.aa.com/homePage.do, but that seems to be a get and doesnt take the params you send. when you click search your browser sends this post: https://www.americanairlines.co.uk/reservation/searchFlightsSubmit.do;jsessionid=XXXXXXXXXXXXXXXXXXX and parameters:
currentCalForm=dep
currentCodeFrom=
tripType=roundTrip
originAirport=LAX
flightParams.flightDateParams.travelMonth=10
flightParams.flightDateParams.travelDay=24
flightParams.flightDateParams.searchTime=040001
destinationAirport=JFK
returnDate.travelMonth=10
returnDate.travelDay=31
returnDate.searchTime=400001
adultPassengerCount=1
adultPassengerCount=1
childPassengerCount=0
hotelRoomCount=1
serviceclass=coach
searchTypeMode=matrix
awardDatesFlexible=true
originAlternateAirportDistance=0
destinationAlternateAirportDistance=0
discountCode=
flightSearch=revenue
dateChanged=false
fromSearchPage=true
advancedSearchOpened=false
numberOfFlightsToDisplay=10
searchCategory=
aairpassSearchType=false
moreOptionsIndicator=
seniorPassengerCount=0
youngAdultPassengerCount=0
infantPassengerCount=0
passengerCount=1
This will then give you an html back. preety mach you have to send all requests send in the browser. it might be easier for you to do it with selenium.
i found this using httpfox probably is similar to chrome networks.

Categories