I am learning web scraping as my first mini-project. Currently working with python. I want to extract a weather data and use python to show the weather where I am living, I have found the data I needed by inspecting the tags but it keeps giving me all the numbers on the weather forecast table instead of the specific one I need I have tried for to write its specific index number but it still did not work. This is the code I have written so far;
import requests
from bs4 import BeautifulSoup as bs
url= "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
r= requests.get(url)
cast = bs(r.content, "lxml")
wthr = cast.findAll("div",{"class": "col-md-9"})
print (wthr)
Any help would be greatly appreciated. The data I want is the Temperature data.
Also; Can somebody explain to me the differences between using lxml or html.parser. I have seen both methods being used widely and was curious how would you decide to use one over the other.
I think element is <div class="temp">24.2 °c</div>.
If your primary focus is just temperature data, you can check out weather APIs. There are several public APIs you could find here.
Legality of scraping should be considered before the action. You can find something about it here https://www.tutorialspoint.com/python_web_scraping/legality_of_python_web_scraping.htm
This site doesn't have robots.txt file so it is permitted to crawl.
Here is a very simplified way to get the table data published at the url that you use in the question. This uses html.parser to extract data
import requests
from bs4 import BeautifulSoup
def get_soup(my_url):
HTML = requests.get(my_url)
my_soup = BeautifulSoup(HTML.text, 'html.parser')
if 'None' not in str(type(my_soup)):
return my_soup
else:
return None
url = "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
# get the whole html document
soup = get_soup(url)
# get something from that soup
# here a table header and data are extracted from the soup
table_header = soup.find("table").findAll("th")
table_data = soup.find("table").findAll("td")
# header's and data's type is list
# combine lists
for x in range(len(table_header)):
print(table_header[x].text + ' --> ' + table_data[x].text)
""" R e s u l t :
Tarih / Saat -->
Hava --> COK BULUTLU
Sıcaklık --> 27.5°C
İşba Sıcaklığı --> 17.9°C
Basınç --> 1003.5 hPa
Görüş --> 10 km
Rüzgar --> Batıdan (270) 5 kt.
12.06.2022 13:00 --> Genel Tablo Genel Harita
"""
This is just one way to do it and it gets just a part shown in a transparent table on the site. Once more, take care of the instructions stated in the robots.txt file of the site. Regards...
Did you check if the webservice has an api you can use? Many weather-apps have api's you can use for free if you stay under a certain limit of requests. If there is, you could easily request only the data you need, so there is no need of formatting it.
Related
So I am trying to scrape the following webpage: https://www.omscentral.com/
The main table there is my item of interest. I want to scrape the table, and all of its content. When I inspect the content of the page, the table is on a table tag, so I figured it would be easy to access it, with the code below.
url = 'https://www.omscentral.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('table')
However, that code only returns the table header. I saw a similar example here, but the solution of switching the parser did not work.
When I look at the soup object in itself, it seems that the requests does not expand the table, and only captures the header. Not too sure what to do here - any advice would be much appreciated!
Content is stored in script tag and rendered dynamically, so you have to extract the data from there.
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
To display in DataFrame simply use:
pd.DataFrame(data)
Example
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.omscentral.com/'
soup = BeautifulSoup(requests.get(url, headers=headers).text)
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
for item in data:
print(item['name'], item.get('officialURL'))
Output
Introduction to Information Security https://omscs.gatech.edu/cs-6035-introduction-to-information-security
Computing for Good https://omscs.gatech.edu/cs-6150-computing-good
Introduction to Operating Systems https://omscs.gatech.edu/cs-6200-introduction-operating-systems
Advanced Operating Systems https://omscs.gatech.edu/cs-6210-advanced-operating-systems
Secure Computer Systems https://omscs.gatech.edu/cs-6238-secure-computer-systems
Computer Networks https://omscs.gatech.edu/cs-6250-computer-networks
...
Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you
As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data
I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.
I want to scrape the text from pages like this: https://www.ncbi.nlm.nih.gov/protein/p22217 into a string. In particular the block of text in DBSOURCE
I've seem multiple suggestions for using soup.findall(text=true) and the like but it comes up with nothing. Anything from before at least 2018 or so also seems to be outdated (I'm using python 3.7). I think the problem is that the content I want is outside the range of r.text and r.content; when I search with ctrl F the part I'm looking for just isn't there.
from bs4 import BeautifulSoup
import requests
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
data = r.content
soup = BeautifulSoup(data, "html.parser")
PageInfo = soup.find("pre", attrs={"class":"genbank"})
print(PageInfo)
The result of this and other attempts is "None". No error message, it just doesn't return anything.
You can use this instead as the page depends on xmlhttprequests
Code :
from bs4 import BeautifulSoup
import requests,re
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
soup = BeautifulSoup(r.content,features='html.parser')
pageId = soup.find('meta', attrs={'name':'ncbi_uidlist'})['content']
api = requests.get('https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}'.format(pageId))
data = re.search(r'DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD',api.text)
print(data.group(1).strip())
Demo Code : Here
Explanation :
The request to url will help getting the id of the product you are asking for where exist in the meta of the pages.
by getting the id the second request will use the website api to get you the description you are asking for. A regex pattern wil be used to separate the wanted part and the unwanted part.
Regex :
DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD
Demo Regex : Here
The page is doing XHR call in order to get the information you are looking for.
The call is to https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=135747&db=protein&report=genpept&conwithfeat=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
and it returns
<div class="sequence">
<a name="locus_P22217.3"></a><div class="localnav"><ul class="locals"><li>Comment</li><li>Features</li><li>Sequence</li></ul></div>
<pre class="genbank">LOCUS TRX1_YEAST 103 aa linear PLN 18-SEP-2019
DEFINITION RecName: Full=Thioredoxin-1; AltName: Full=Thioredoxin I;
Short=TR-I; AltName: Full=Thioredoxin-2.
ACCESSION P22217
VERSION P22217.3
**DBSOURCE** UniProtKB: locus TRX1_YEAST, accession P22217;
class: standard.
extra accessions:D6VY45
created: Aug 1, 1991.
...
So do HTTP call from your code in order to get the data.
I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element