Webcrawler: extracting string out of array using Python3 on mac - python

i have a problem with writing a webcrawler to extract currency rates:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re
url = "https://wechselkurse-euro.de/"
r = requests.get(url)
rates = []
status = r.status_code
if status != 200:
print("Something went wrong while parsing the website " + url)
temp = BeautifulSoup(r.text, "html.parser")
current_date = temp.select(".ecb")[0].text.strip().split(" ")[5]
#rates_array = temp.select(".kurz_kurz2.center", limit= 20).string
rates_array = temp.select(".kurz_kurz2.center", limit= 20)
#for i in rates_array:
# rate = rates_array[i].string
# rates.append(rate)
rates = list( map( lambda x: re.search(">\d{1}\.\d{4}",x), rates_array))
print(rates)
#rate_1EUR_to_USD =
#rate_1EUR_to_GBP =
I tried several ways which are commented out - all of them don't work and I don't know why. Especially the .string not working is suprising to me since the rates_array seems to inherit all the different information of the bs4 object, including the information that there is a td tag <td class="kurz_kurz2 center" title="Aktueller Wechselkurs am 3.4.2020">0.5554</td> where I just want the string within the tag (so the value 0.5554 in the example above). This should be easy but nothing works, what am I doing wrong?
The regular expression should not be the problem, I tested it on regExR.
I tried using the map function as currently active but I can't convert the map object to a list as I am supposed to.
The select().string returns an empty list and the same with using relgular expressions to search through the strings I saved in rates_array when I try to do the oldschool way of iterating over every item of my function with a for loop.
String as attribute of bs4-object

Your rates_array contains Beautiful Soup tag objects, not strings. So you'll have to access their text property in order to get the values. For example:
rates = [o.text for o in rates_array]
Now rates contains:
['0.5554', '0.1758']

I would recommend you to check the locator first.
Are you sure that rates_array is not empty?
Also, try:
rates_array[i].text

Related

Python get request is just returning []

I am trying to perform a get request, however the only thing it seems to be outputting are square brackets. I am trying to grab a value which is called the market cap off of this website www.coinmarketcap.com, however I can not get it to work. This is what I have as my code.
import requests
import bs4
source = requests.get("https://www.coinmarketcap.com/charts/").text
soup = bs4.BeautifulSoup(source,"lxml")
coincap = soup.select(".sc-12ja2s9-0 dzHJPm")
def output(coincap):
with open("/Users/user/Desktop/coincap.txt",mode="w+") as f:
f.write(coincap)
print(coincap)
Your query selector is incorrect. The element you are looking has 2 CSS classes so you need to join both selectors:
coincap = soup.select(".sc-12ja2s9-0.dzHJPm")
Cheers,
- J

Beautiful Soup. Text extraction into a dataframe

I'm trying to extract the information from a single web-page that contains multiple similarly structured recordings. Information is contained within div tags with different classes (I'm interested in username, main text and date). Here is the code I use:
import bs4 as bs
import urllib
import pandas as pd
href = 'https://example.ru/'
sause = urllib.urlopen(href).read()
soup = bs.BeautifulSoup(sause, 'lxml')
user = pd.Series(soup.Series('div', class_='Username'))
main_text = pd.Series(soup.find_all('div', class_='MainText'))
date = pd.Series(soup.find_all('div', class_='Date'))
result = pd.DataFrame()
result = pd.concat([user, main_text, date], axis=1)
The problem is that I receive information with all tags, while I want only a text. Surprisingly, .text attribute doesn't work with find_all method, so now I'm completely out of ides.
Thank you for any help!
list comprehension is the way to go, to get all the text within MainText for example, try
[elem.text for elem in soup.find_all('div', class_='MainText')]

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

Python BS4 not returning a Unicode value in find_all() function

For a school project I wanted to write a python program which extracts the current value of Bitcoin from this website: http://www.coindesk.com/price/. To do that I installed the BeautifulSoup4 and Requests libraries in order to extract the HTML data and parse it but when it came time to actually get the price well, my program returns nothing. Here is a picture of what I am trying to get. This is my code:
import requests as r
from bs4 import BeautifulSoup as bs
doc = r.get("http://www.coindesk.com/price/")
soup = bs(doc.content, "html.parser")
price = soup.find_all("a", {"class":"bpiUSD"})
text = []
contents = []
for item in price:
text.append(item.text)
for item in price:
contents.append(item.contents)
print "text:", type(text[0])
print "contents:", type(contents[0])
print "text[0]:", text[0]
print "contents[0]", contents[0]
And this is the output:
text: <type 'unicode'>
contents: <type 'list'>
text[0]:
contents[0] []
I used this way to get strings and number and it worked but when it came to this particular number well it returned nothing. Also, I know that the Bitcoin price is in Unicode (at least I assume this) and I tried to convert it into a string value but nothing worked despite the fact that the .type() function does mention that the list is Unicode.
You will either have to find a different website or use selenium webdriver. The price is generated by javascript that requests doesn't execute.
from bs4 import BeautifulSoup as bs
doc = r.get("http://www.coindesk.com/price/")
soup = bs(doc.content, "lxml")
price = soup.find_all(class_="currency-price")
print(price)
prints:
[<div class="currency-price">
<a class="bpiUSD" href="/price/" style="color:white;"></a>
</div>, <div class="currency-price">
<a class="bpiUSD" href="/price/" style="color:white;"></a>
</div>]
Which doesn't contain your number. If you inspect the html on your website it will have the number between the a tags. Using a library like selenium will allow you to run the javascript.
The website you are trying to parse with Beautiful soup is being rendered through javascript calls that are grabbing the data from an api publishing json rendition of the data, namely coindesk api. This is why your beautiful soup calls aren't working.
To get this data you need to make a request for the json using requests, then iterate to the data you need.
I went through the process for you in the below script. I added notes so you can understand what I did in each section. It could have done with less lines of code, but I thought this would help you better understand how to loop through the json.
This is in python 3, remove the brackets around the print statement if you want the output prettier in python2.7.
import requests
jsonurl = 'http://api2.coindesk.com/site/headerdata.json?currency=BTC'
json = requests.get(jsonurl).json()
for key, value in json.items(): #Loop through the first branch of the json
if type(value) == type(dict()): #each branch that has a dictionary is what contains the currency and rate
for subKey, subValue in value.items(): #Loop through those dictionaries
if type(subValue) == type(dict()): #if there is a dictionary in this key value pair then we loop through them.
for subKey1, subValue1 in subValue.items(): #Our final loop
if subKey1 == 'rate_float': #The rates are held in rate_float value in this key
print('exchange: ' + subKey, 'rate: ' + str(subValue1) )

Categories