Appending text to a string if it matches a condition - python

I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
['Численность мужчин и женщин', '/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
/storage/mediabank/yKsfiyjR/demo13.xls
You can see that in the second case I get only part of the link, while in the first I get the whole link. To the format of the second link, I need to add a part of the text that I know in advance. But this must be done on the basis of the condition that the format of this link will be defined. That is, at the output, I want to receive the following:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
['Численность мужчин и женщин', 'https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls
How should I do it? Here is the previously reproduced code:
import requests
from bs4 import BeautifulSoup
URL = "https://rosstat.gov.ru/folder/12781"
responce = requests.get(URL).text
soup = BeautifulSoup(responce, 'lxml')
block = soup.find('div', class_="col-lg-8 order-1 order-lg-1")
list_info_block_row = block.find_all('div', class_='document-list__item document-list__item--row')
list_info_block_col = block.find_all('div', class_='document-list__item document-list__item--col')
sources = []
for text_block_row in list_info_block_row:
new_list = []
title_element_row = text_block_row.find('div', class_='document-list__item-title')
preprocessing_title = title_element_row.text.strip()
link_element_row = text_block_row.find('a').get('href')
new_list.append(preprocessing_title)
new_list.append(link_element_row)
print(new_list)
print(title_element_row.text.strip())
print(link_element_row)
print('\n\n')

You can check if the string has an scheme, and if not add it and also the host:
if not link_element_row.startswith("http"):
parsed_url = urlparse(URL)
link_element_row = (
parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
)
Full working code:
import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup
URL = "https://rosstat.gov.ru/folder/12781"
responce = requests.get(URL).text
soup = BeautifulSoup(responce, "lxml")
block = soup.find("div", class_="col-lg-8 order-1 order-lg-1")
list_info_block_row = block.find_all(
"div", class_="document-list__item document-list__item--row"
)
list_info_block_col = block.find_all(
"div", class_="document-list__item document-list__item--col"
)
for text_block_row in list_info_block_row:
new_list = []
title_element_row = text_block_row.find("div", class_="document-list__item-title")
preprocessing_title = title_element_row.text.strip()
link_element_row = text_block_row.find("a").get("href")
new_list.append(preprocessing_title)
if not link_element_row.startswith("http"):
parsed_url = urlparse(URL)
link_element_row = (
parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
)
new_list.append(link_element_row)
print(new_list)
print(title_element_row.text.strip())
print(link_element_row)
print("\n\n")
Research:
Get protocol + host name from URL
startswith

Related

How to search for predefined strings and returned the whole line if a match is found

The snippet work partially as it can produce some results. I need help to make it fully works. I am searching for strings in a url and if a partial match is found, the whole line will be returned.
from bs4 import BeautifulSoup as bs
import requests
addrlist = ['0xe56842ed550ff2794f010738554db45e60730371',
'0xe1fd7b4c9debac3c490d8a553c455da4979482e4',
'0x88c20beda907dbc60c56b71b102a133c1b29b053']
queries = ["Website", "Telegram", "https://www.", "Twitter", "https://t.me"]
url = "https://bscscan.com/address/"
for i in addrlist:
url = str(url) + str(i)
r = requests.get(url)
soup = bs(r.text,'lxml')
pre = soup.select_one('pre.js-sourcecopyarea.editor')
ss = (list(pre.stripped_strings)[0]).split('*')
for s in ss:
for query in queries:
if query in s:
print(s)
Current Output:
Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft
AttributeError: 'NoneType' object has no attribute 'stripped_strings'
Wanted Output:
Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft
// Telegram : https://t.me/stackdogebsc
// Website : https://www.stack-doge.com
*Website: www.shibuttinu.com
*Telegram: https://t.me/Shibuttinu
The problem is url variable. You concatenate each addrlist to the previous url:
# 1st iteration:
https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e60730371
# 2nd iteration:
https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e607303710xe1fd7b4c9debac3c490d8a553c455da4979482e4
# 3rd iteration:
https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e607303710xe1fd7b4c9debac3c490d8a553c455da4979482e40x88c20beda907dbc60c56b71b102a133c1b29b053
Change your code like this:
# url = "https://bscscan.com/address/"
baseurl = "https://bscscan.com/address/"
# url = str(url) + str(i)
url = str(baseurl) + str(i)
Update
Use regex to extract information.
Full code:
from bs4 import BeautifulSoup as bs
import requests
import re
addrlist = ['0xe56842ed550ff2794f010738554db45e60730371',
'0xe1fd7b4c9debac3c490d8a553c455da4979482e4',
'0x88c20beda907dbc60c56b71b102a133c1b29b053']
baseurl = "https://bscscan.com/address/"
pattern = r'(Website|Telegram|Twitter)\s*:\s*([^\s]+)'
for i in addrlist:
url = str(baseurl) + str(i)
r = requests.get(url)
soup = bs(r.text,'lxml')
pre = soup.select_one('pre.js-sourcecopyarea.editor')
print(url)
for match in re.findall(pattern, str(pre)):
print(f"{match[0]}: {match[1]}")
print()
Output:
https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e60730371
Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft
https://bscscan.com/address/0xe1fd7b4c9debac3c490d8a553c455da4979482e4
Telegram: https://t.me/stackdogebsc
Website: https://www.stack-doge.com
https://bscscan.com/address/0x88c20beda907dbc60c56b71b102a133c1b29b053
Website: www.shibuttinu.com
Telegram: https://t.me/Shibuttinu

Modifying the url parameter to download images from multiple web-sites

I was trying to download images from all the cases included in CaseIDs array, but it doesn't work. I want code to run for all cases.
from bs4 import BeautifulSoup
import requests as rq
from urllib.parse import urljoin
from tqdm import tqdm
CaseIDs = [100237, 99817, 100271]
with rq.session() as s:
for caseid in tqdm(CaseIDs):
url = 'https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID= {caseid}'
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
url = urljoin(url, soup.find('a', text='Text and Images Only')['href'])
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
links = [urljoin(url, i['src']) for i in soup.select('img[src^="GetBinary.aspx"]')]
count = 0
for link in links:
content = s.get(link).content
with open("test_image" + str(count) + ".jpg", 'wb') as f:
f.write(content)
count += 1
try use format() like this:
url = 'https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID={}'.format(caseid)
You need to use an f-string to pass your caseId value in, as you're trying to do:
url = f'https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID= {caseid}'
(You probably also need to remove the space between the = and the {)

Multiple Pages Web Scraping with Python and Beautiful Soup

I'm trying to write a code to scrape some date from pages about hotels. The final information (name of the hotel and address) should be export to csv. The code works but only on one page...
import requests
import pandas as pd
from bs4 import BeautifulSoup # HTML data structure
page_url = requests.get('https://e-turysta.pl/noclegi-krakow/')
soup = BeautifulSoup(page_url.content, 'html.parser')
list = soup.find(id='nav-lista-obiektow')
items = list.find_all(class_='et-list__details flex-grow-1 d-flex d-md-block flex-column')
nazwa_noclegu = [item.find(class_='h3 et-list__details__name').get_text() for item in items]
adres_noclegu = [item.find(class_='et-list__city').get_text() for item in items]
dane = pd.DataFrame(
{
'nazwa' : nazwa_noclegu,
'adres' : adres_noclegu
}
)
print(dane)
dane.to_csv('noclegi.csv')
I tried a loop but doesn't work:
for i in range(22):
url = requests.get('https://e-turysta.pl/noclegi-krakow/'.format(i+1)).text
soup = BeautifulSoup(url, 'html.parser')
Any ideas?
Urls are different then you use - you forgot ?page=.
And you have to use {} to add value to string
url = 'https://e-turysta.pl/noclegi-krakow/?page={}'.format(i+1)
or concatenate it
url = 'https://e-turysta.pl/noclegi-krakow/?page=' + str(i+1)
or use f-string
url = f'https://e-turysta.pl/noclegi-krakow/?page={i+1}'
EDIT: working code
import requests
from bs4 import BeautifulSoup # HTML data structure
import pandas as pd
def get_page_data(number):
print('number:', number)
url = 'https://e-turysta.pl/noclegi-krakow/?page={}'.format(number)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
container = soup.find(id='nav-lista-obiektow')
items = container.find_all(class_='et-list__details flex-grow-1 d-flex d-md-block flex-column')
# better group them - so you could add default value if there is no nazwa or adres
dane = []
for item in items:
nazwa = item.find(class_='h3 et-list__details__name').get_text(strip=True)
adres = item.find(class_='et-list__city').get_text(strip=True)
dane.append([nazwa, adres])
return dane
# --- main ---
wszystkie_dane = []
for number in range(1, 23):
dane_na_stronie = get_page_data(number)
wszystkie_dane.extend(dane_na_stronie)
dane = pd.DataFrame(wszystkie_dane, columns=['nazwa', 'adres'])
dane.to_csv('noclegi.csv', index=False)
in your loop you use the .format() function but need to insert the brackets into the string you are formatting.
for i in range(22):
url = requests.get('https://e-turysta.pl/noclegi-krakow/{}'.format(i+1)).text
soup = BeautifulSoup(url, 'html.parser')

Extracting phonetic pronunciation from a crawler returns a blank []

I'm trying to extract phonetic alphabet from a Spanish-English dictionary.
SpanishDict.com
For example, when búsqueda is searched, its phonetic alphabet will be (boos-keh-dah).
definition of búsqueda
But after I run the .py, it only shows me [] as a result.
Why is this? How can I fix it?
Here's the code I wrote:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.spanishdict.com/translate/"
search_keyword = input("input the keyword : ")
url = base_url + search_keyword + "&start="
spanishdict_r = requests.get(url)
spanishdict_soup = BeautifulSoup(spanishdict_r.text, 'html.parser')
print(spanishdict_soup.findAll('dictionaryLink--369db'))
First thing, remove "&start=". It doesn't load desired results. So URL should be url = base_url + search_keyword.
Second, the translation is present in <span class="dictionaryLink--369db">, which is a span tag with class value dictionaryLink--369db.
Therefore, your search should be spanishdict_soup.find('span', {'class': 'dictionaryLink--369db'}).
Code:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.spanishdict.com/translate/"
search_keyword = 'búsqueda'
url = base_url + search_keyword
spanishdict_r = requests.get(url)
spanishdict_soup = BeautifulSoup(spanishdict_r.text, 'html.parser')
print(spanishdict_soup.find('span', {'class': 'dictionaryLink--369db'}).text)
Output:
(boos-keh-dah)

How to obtain all the links in a domain using Python?

I want to use Python to obtain all the links in a domain given the 'root' URL (in a list). Suppose given a URL http://www.example.com this should return all the links on this page of the same domain as the root URL, then recurse on each of these links visiting them and extracting all the links of the same domain and so on. What I mean by same domain is if given http://www.example.com the only links I want back are http://www.example.com/something, http://www.example.com/somethingelse ... Anything external such as http://www.otherwebsite.com should be discarded. How can I do this using Python?
EDIT: I made an attempt using lxml. I don't think this works fully, and I am not sure how to take into account links to already processed pages (causing infinite loop).
import urllib
import lxml.html
#given a url returns list of all sublinks within the same domain
def getLinks(url):
urlList = []
urlList.append(url)
sublinks = getSubLinks(url)
for link in sublinks:
absolute = url+'/'+link
urlList.extend(getLinks(absolute))
return urlList
#determine whether two links are within the same domain
def sameDomain(url, dom):
return url.startswith(dom)
#get tree of sublinks in same domain, url is root
def getSubLinks(url):
sublinks = []
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
if not (link.startswith('#') or link.startswith('http') or link.startswith('mailto:')):
sublinks.append(link)
return sublinks
~
import sys
import requests
import hashlib
from bs4 import BeautifulSoup
from datetime import datetime
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content)
return soup
def get_status_code(link):
"""
Return the error code for any url
param: link
"""
try:
error_code = requests.get(link).status_code
except requests.exceptions.ConnectionError:
error_code =
return error_code
def find_internal_urls(lufthansa_url, depth=0, max_depth=2):
all_urls_info = []
status_dict = {}
soup = get_soup(lufthansa_url)
a_tags = soup.findAll("a", href=True)
if depth > max_depth:
return {}
else:
for a_tag in a_tags:
if "http" not in a_tag["href"] and "/" in a_tag["href"]:
url = "http://www.lufthansa.com" + a_tag['href']
elif "http" in a_tag["href"]:
url = a_tag["href"]
else:
continue
status_dict["url"] = url
status_dict["status_code"] = get_status_code(url)
status_dict["timestamp"] = datetime.now()
status_dict["depth"] = depth + 1
all_urls_info.append(status_dict)
return all_urls_info
if __name__ == "__main__":
depth = 2 # suppose
all_page_urls = find_internal_urls("someurl", 2, 2)
if depth > 1:
for status_dict in all_page_urls:
find_internal_urls(status_dict['url'])
The above snippet contains necessary modules for scrapping urls from lufthansa arlines website. The only thing additional here is you can specify depth to which you want to scrape recursively.
Here is what I've done, only following full urls like http://domain[xxx]. Quick but a bit dirty.
import requests
import re
domain = u"stackoverflow.com"
http_re = re.compile(u"(http:\/\/" + domain + "[\/\w \.-]*\/?)")
visited = set([])
def visit (url):
visited.add (url)
extracted_body = requests.get (url).text
matches = re.findall (http_re, extracted_body)
for match in matches:
if match not in visited :
visit (match)
visit(u"http://" + domain)
print (visited)
There are some bugs in the code of #namita . I modify it and it works well now.
import sys
import requests
import hashlib
from bs4 import BeautifulSoup
from datetime import datetime
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content, "lxml")
return soup
def get_status_code(link):
"""
Return the error code for any url
param: link
"""
try:
error_code = requests.get(link).status_code
except requests.exceptions.ConnectionError:
error_code = -1
return error_code
def find_internal_urls(main_url, depth=0, max_depth=2):
all_urls_info = []
soup = get_soup(main_url)
a_tags = soup.findAll("a", href=True)
if main_url.endswith("/"):
domain = main_url
else:
domain = "/".join(main_url.split("/")[:-1])
print(domain)
if depth > max_depth:
return {}
else:
for a_tag in a_tags:
if "http://" not in a_tag["href"] and "https://" not in a_tag["href"] and "/" in a_tag["href"]:
url = domain + a_tag['href']
elif "http://" in a_tag["href"] or "https://" in a_tag["href"]:
url = a_tag["href"]
else:
continue
# print(url)
status_dict = {}
status_dict["url"] = url
status_dict["status_code"] = get_status_code(url)
status_dict["timestamp"] = datetime.now()
status_dict["depth"] = depth + 1
all_urls_info.append(status_dict)
return all_urls_info
if __name__ == "__main__":
url = # your domain here
depth = 1
all_page_urls = find_internal_urls(url, 0, 2)
# print("\n\n",all_page_urls)
if depth > 1:
for status_dict in all_page_urls:
find_internal_urls(status_dict['url'])
The code worked, but I don't know if it's 100% correct
it is extracting all the internal urls in the website
import requests
from bs4 import BeautifulSoup
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content, "lxml")
return soup
visited = set([])
def visit (url,domain):
visited.add (url)
soup = get_soup(url)
a_tags = soup.findAll("a", href=True)
for a_tag in a_tags:
if "http://" not in a_tag["href"] and "https://" not in a_tag["href"] and "/" in a_tag["href"]:
url = domain + a_tag['href']
elif "http://" in a_tag["href"] or "https://" in a_tag["href"]:
url = a_tag["href"]
else:
continue
if url not in visited and domain in url:
# print(url)
visit (url,domain)
url=input("Url: ")
domain=input("domain: ")
visit(u"" + url,domain)
print (visited)
From the tags of your question, I assume you are using Beautiful Soup.
At first, you obviously need to download the webpage, for example with urllib.request. After you did that and have the contents in a string, you pass it to Beautiful Soup. After that, you can find all links with soup.find_all('a'), assuming soup is your beautiful soup object. After that, you simply need to check the hrefs:
The most simple version would be to just check if "http://www.example.com" is in the href, but that won't catch relative links. I guess some wild regular expression would do (find everything with "www.example.com" or starting with "/" or starting with "?" (php)), or you might look for everything that contains a www, but is not www.example.com and discard it, etc. The correct strategy might be depending on the website you are scraping, and it's coding style.
You can use regular expression to filter out such links
eg
<a\shref\=\"(http\:\/\/example\.com[^\"]*)\"
Take the above regex as reference and start writing script based on that.

Categories