How to give inputs to a website using python - python

hi all I am new to python. please help me with this requirement.
http://www.example.com/ratings/ratings-rationales.jsp?date=true&result=true
In this link, I have to choose date first, then the rating company will list its publications as links. Now i wanted to search a link that contains a word of my interest say "stable". I have tried the following using python 3.4.2
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "http://www.example.com/ratings/ratings-rationales.jsp?date=true&result=true"
r = requests.get(url)
soup = BeautifulSoup(r.content)
example_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(example_links)
result_links = [urljoin(url, tag['href']) for tag in results]
print (result_links)
This is not printing anything. Iam seeing below as result
>>>
[]
Obviously Iam not giving date as input.
1. How to input from and to dates as today's date ? (Obviously to check periodically for updates of the links containing a word of interest, which will be question for later time)
For example after giving from date: 31-12-2014 to date: 31-12-2014 as inputs
is the output I need as hyperlink.
Any suggestion will be much useful. Thanks in advance
Here is the updated code still Iam not able to get the result. >>> [] is the output
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
#Getting the current date
today = datetime.today()
#For the sake of brevity some parameters are missing on the payload
payload = {
'selArchive': 1,
'selDay': 31,
'selMonth': 12,
'selYear': 2014,
'selDay1': 31,
'selMonth1': 12,
'selYear1': 2014,
'selSector': '',
'selIndustry': '',
'selCompany': ''
}
example_url = "http://www.example.com/
r = requests.post(example_url, data=payload)
rg = requests.get(example_url)
soup = BeautifulSoup(rg.content)
crisil_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(example_links)
result_links = [urljoin(url, tag['href']) for tag in results]
print (result_links)

You should be doing a POST instead of a GET for this particular site (this link on how to form a post request with parameters).
Check this example:
from datetime import datetime
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
#Getting the current date
today = datetime.today()
#Here I'm only passing from and to dates (current date) and the industry parameter
payload = {
'selDay': 31,
'selMonth': 12,
'selYear': 2014,
'selDay1': 31,
'selMonth1': 12,
'selYear1': 2014,
'selIndustry': '',
'txtPhrase': '',
'txtInclude': '',
'txtExclude': '',
'selSubServices': 'ALL',
'selServices': 'all',
'maxresults': 10,
'pageno': 1,
'srchInSrchCol': '01',
'sortOptions': 'date',
'isSrchInSrch': '01',
'txtShowQuery': '01',
'tSearch': 'Find a Rating',
'txtSearch': '',
'selArchive': 1,
'selSector': 148,
'selCompany': '',
'x': 40,
'y': 11,
}
crisil_url = "http://www.crisil.com/ratings/ratings-rationales.jsp?result=true&Sector=true"
r = requests.post(crisil_url, data=payload)
soup = BeautifulSoup(r.content)
crisil_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(crisil_links)
result_links = [urljoin(crisil_url, tag['href']) for tag in results]
print (result_links)
You will need to check the ids of the industries you are filtering, so be sure to check them via Inspect Element, selecting a the select box of industries on the browser.
After that, you will get the response and do the parsing via BeautifulSoup, as you are doing now.
Checking periodically:
To check this periodically you should consider crontab if using Linux/Unix or a Scheduled task if using Windows.

Related

Extract date from multiple webpages with Python

I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have:
These are the links for some websites (german websites):
(3 Nov 2020)
http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226
(Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=&date_to=
(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905
I have tried 3 different solutions with Python libs such as requests, htmldate and date_guesser but I'm always getting None, or in case of htmldate lib, I always get same date (2020.1.1)
from bs4 import BeautifulSoup
import requests
from htmldate import find_date
from date_guesser import guess_date, Accuracy
# Lib find_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
my_date = find_date(response.content, extensive_search=True)
print(my_date, '\n')
# Lib guess_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
my_date = guess_date(url=url, html=requests.get(url).text)
print(my_date.date, '\n')
# Lib Requests # I DO NOT GET last modified TAG
my_date = requests.head('http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226')
print(my_date.headers, '\n')
Am I doing something wrong?
Can you please tell me is there a way to extract date of publication from websites like this (where I do not have specific divs, p, and datetime elements).
IMPORTANT!
I want to make universal date extraction, so that I can put these links in for loop and run the same function to them.
I have never had much success with some of the date parsing libraries, so I usually go another route. I believe that the best method to extract the date strings from these sites in your question is with regular expressions.
website: linden.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[1], '%d. %b. %Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
03-11-2020
website: buchholterberg.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Veröffentlicht)\s\w+:\s(\d{1,2}:\d{1,2}:\d{1,2})\s(\d{1,2}.\d{1,2}.\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[2], '%d.%m.%Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
22-10-2020
Update 12-04-2020
I looked at the source code for the two Python libraries: htmldate and date_guesser that you mentioned. Neither of these libraries can currently extract the date from the 3 sources that you listed in your question. The primary reason for this lack of extraction is linked to the date formats and language (german) of these target sites.
I had some free time so I put this together for you. The answer below can easily be modified to extract from any website and can be refined as needed based on the format of your target sources. It currently extract from all the links contained in URLs.
all urls
import requests
import re as regex
from bs4 import BeautifulSoup
def extract_date(can_of_soup):
page_body = can_of_soup.find('body')
clean_body = ''.join(str(page_body).replace('\n', ''))
if 'Datum der Neuigkeit' in clean_body or 'Veröffentlicht' in clean_body:
date_formats = '(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})|(Veröffentlicht am: \d{2}:\d{2}:\d{2} )(\d{1,2}.\d{1,2}.\d{4})'
find_date = regex.search(date_formats, clean_body, regex.IGNORECASE)
if find_date:
clean_tuples = [i for i in list(find_date.groups()) if i]
return ''.join(clean_tuples[1])
else:
tags = ['extra', 'elementStandard elementText', 'icms-block icms-information-date icms-text-gemeinde-color']
for tag in tags:
date_tag = page_body.find('div', {'class': f'{tag}'})
if date_tag is not None:
children = date_tag.findChildren()
if children:
find_date = regex.search(r'(\d{1,2}.\d{1,2}.\d{4})', str(children))
return ''.join(find_date.groups())
else:
return ''.join(date_tag.contents)
def get_soup(target_url):
response = requests.get(target_url)
soup = BeautifulSoup(response.content, 'html.parser')
return soup
urls = {'http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226',
'http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0'
'&sq=&kategorie_id=&date_from=&date_to=',
'http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905',
'https://www.steffisburg.ch/de/aktuelles/meldungen/Hochwasserschutz-und-Laengsvernetzung-Zulg.php',
'https://www.wallisellen.ch/aktuellesinformationen/924227',
'http://www.winkel.ch/de/aktuellesre/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id'
'=1093910&ls=0&sq=&kategorie_id=&date_from=&date_to=',
'https://www.aeschi.ch/de/aktuelles/mitteilungen/artikel/?tx_news_pi1%5Bnews%5D=87&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5Baction%5D=detail&cHash=ab4d329e2f1529d6e3343094b416baed'}
for url in urls:
html = get_soup(url)
article_date = extract_date(html)
print(article_date)

Scrape Text After Specific Text and Before Specific Text

<script type="text/javascript">
'sku': 'T3246B5',
'Name': 'TAS BLACKY',
'Price': '111930',
'categories': 'Tas,Wanita,Sling Bags,Di bawah Rp 200.000',
'brand': '',
'visibility': '4',
'instock': "1",
'stock': "73.0000"
</script>
I want to scrape the text between : 'stock': " and .0000" so the desireable result is 73
What I used to know is to do something like this:
for url2 in urls2:
req2 = Request(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
html2 = uReq(req2).read()
page_soup2 = soup(html2, "html.parser")
# Grab text
stock = page_soup2.findAll("p", {"class": "stock"})
stocks = stock[0].text
I used something like this in my previous code, It works before the web change their code.
But now there is more than 1 ("script", {"type": "text/javascript"}) in the entire page I want to scrape. So I dont know how to find the right ("script", {"type": "text/javascript"})
I also don't know hot to get the specific text before and after the text.
I have googled it all this day but can't find the solution. Please help.
I found that strings = 'stock': " and .0000" is unique in the entire page, only 1 'stock': and only 1 .0000"
So I think it could be the sign of location where I want to scrape the text.
Please help, thank you for your kindness.
I also apologize for my lack of English, and I am actually unfamiliar with programming. I'm just trying to learn from Google, but I can't find the answer. Thank you for your understanding.
the url = view-source:sophieparis.com/blacky-bag.html
Since you are sure 'stock' only shows up in the script tag you want, you can pull out that text that contains 'stock. Once you have that, it's a matter of trimming off the excess, and change to double quotes to get it into a valid json format and then simply read that in using json.loads()
import requests
from bs4 import BeautifulSoup
import json
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
scripts = page_soup2.find_all('script')
for script in scripts:
if 'stock' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
Output:
print (jsonData['stock'].split('.')[0])
71
You could also do this without the loop and just grab the script that has the string stock in it using 1 line:
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
Full code would look something like:
import requests
from bs4 import BeautifulSoup
import json
import re
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
I would write a regex that targets the javascript dictionary variable that houses the values of interest. You can apply this direct to response.text with no need for bs4.
The dictionary variable is called productObject, and you want the non-empty dictionary which is the second occurrence of productObject = {..} i.e. not the one which has 'var ' preceeding it. You can use negative lookbehind to specify this requirement.
Use hjson to handle property names enclosed in single quotes.
Py
import requests, re, hjson
r = requests.get('https://www.sophieparis.com/blacky-bag.html')
p = re.compile(r'(?<!var\s)productObject = ([\s\S]*?})')
data = hjson.loads(p.findall(r.text)[0])
print(data)
Regex: try
If you want to provide me with the webpage you wish to scrape the data from, I'll see if I can fix the code to pull the information.

Get value from web link

I have a url from where I want to extract the line having data as "Underlying Stock: NCC 96.70 As on Jun 06, 2019 10:12:20 IST" and extract the Symbol which is "NCC" and Underlying Price is "96.70" into a list.
url = "https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=917&symbol=NCC&symbol=ncc&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17"
You can make a request to the site and then parse the result with Beautiful Soup.
Try this:
from bs4 import BeautifulSoup
import requests
url = "https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=917&symbol=NCC&symbol=ncc&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17"
res = requests.get(url)
soup = BeautifulSoup(res.text)
# hacky way of finding and parsing the stock data
soup.get_text().split("Underlying Stock")[1][2:10].split(" ")
This prints out:
['NCC', '96.9']
PS: If you get a warning about lxml... It is the default parser given that you have installed it. Change this line then: soup = BeautifulSoup(res.text, features="lxml"). You need to have lxml installed e.g. with pip install lxml in your environment.
Another version, less hacky.
url = "https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=917&symbol=NCC&symbol=ncc&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17"
page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
page_soup.find("b").next.split(' ')
A succinct way is to select for the first right aligned table cell (td[align=right]) ; which you can actually simplify to just the attribute, [align=right]:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=917&symbol=NCC&symbol=ncc&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17')
soup = bs(r.content, 'lxml')
headline = soup.select_one('[align=right]').text.strip().replace('\xa0\n',' ')
print(headline)
You can also take first row of first table
from bs4 import BeautifulSoup
import requests
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=917&symbol=NCC&symbol=ncc&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17')
soup = bs(r.content, 'lxml')
table = soup.select_one('table')
headline = table.select_one('tr:nth-of-type(1)').text.replace('\n',' ').replace('\xa0', ' ').strip()
print(headline)
from bs4 import BeautifulSoup
import requests
url = "https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=917&symbol=NCC&symbol=ncc&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
# hacky way of finding and parsing the stock data
mylist = soup.get_text().split("Underlying Stock")[1][2:10].split(" ")
print(mylist[:2])
=============================
import pandas as pd
dict1 = {'SYMBOL': ['ACC','ADANIENT','ADANIPORTS','ADANIPOWER','AJANTPHARM','ALBK','AMARAJABAT','AMBUJACEM','APOLLOHOSP','APOLLOTYRE','ARVIND','ASHOKLEY','ASIANPAINT','AUROPHARMA','AXISBANK','BAJAJ-AUTO','BAJAJFINSV','BAJFINANCE','BALKRISIND','BANKBARODA','BANKINDIA','BANKNIFTY','BATAINDIA','BEL','BEML','BERGEPAINT','BHARATFIN','BHARATFORG','BHARTIARTL','BHEL','BIOCON','BOSCHLTD','BPCL','BRITANNIA','BSOFT','CADILAHC','CANBK','CANFINHOME','CASTROLIND','CEATLTD','CENTURYTEX','CESC','CGPOWER','CHENNPETRO','CHOLAFIN','CIPLA','COALINDIA','COLPAL','CONCOR','CUMMINSIND','DABUR','DCBBANK','DHFL','DISHTV','DIVISLAB','DLF','DRREDDY','EICHERMOT','ENGINERSIN','EQUITAS','ESCORTS','EXIDEIND','FEDERALBNK','GAIL','GLENMARK','GMRINFRA','GODFRYPHLP','GODREJCP','GODREJIND','GRASIM','GSFC','HAVELLS','HCLTECH','HDFC','HDFCBANK','HEROMOTOCO','HEXAWARE','HINDALCO','HINDPETRO','HINDUNILVR','HINDZINC','IBULHSGFIN','ICICIBANK','ICICIPRULI','IDBI','IDEA','IDFC','IDFCFIRSTB','IFCI','IGL','INDIACEM','INDIANB','INDIGO','INDUSINDBK','INFIBEAM','INFRATEL','INFY','IOC','IRB','ITC','JETAIRWAYS','JINDALSTEL','JISLJALEQS','JSWSTEEL','JUBLFOOD','JUSTDIAL','KAJARIACER','KOTAKBANK','KSCL','KTKBANK','L&TFH','LICHSGFIN','LT','LUPIN','M&M','M&MFIN','MANAPPURAM','MARICO','MARUTI','MCDOWELL-N','MCX','MFSL','MGL','MINDTREE','MOTHERSUMI','MRF','MRPL','MUTHOOTFIN','NATIONALUM','NBCC','NCC','NESTLEIND','NHPC','NIFTY','NIFTYIT','NIITTECH','NMDC','NTPC','OFSS','OIL','ONGC','ORIENTBANK','PAGEIND','PCJEWELLER','PEL','PETRONET','PFC','PIDILITIND','PNB','POWERGRID','PVR','RAMCOCEM','RAYMOND','RBLBANK','RECLTD','RELCAPITAL','RELIANCE','RELINFRA','REPCOHOME','RPOWER','SAIL','SBIN','SHREECEM','SIEMENS','SOUTHBANK','SRF','SRTRANSFIN','STAR','SUNPHARMA','SUNTV','SUZLON','SYNDIBANK','TATACHEM','TATACOMM','TATAELXSI','TATAGLOBAL','TATAMOTORS','TATAMTRDVR','TATAPOWER','TATASTEEL','TCS','TECHM','TITAN','TORNTPHARM','TORNTPOWER','TV18BRDCST','TVSMOTOR','UBL','UJJIVAN','ULTRACEMCO','UNIONBANK','UPL','VEDL','VGUARD','VOLTAS','WIPRO','WOCKPHARMA','YESBANK','ZEEL'],
'LOT_SIZE': [400,4000,2500,20000,500,13000,700,2500,500,3000,2000,4000,600,1000,1200,250,125,250,800,4000,6000,20,550,6000,700,2200,500,1200,1851,7500,900,30,1800,200,2250,1600,2000,1800,3400,400,600,550,12000,1800,500,1000,2200,700,1563,700,1250,4500,1500,8000,400,2600,250,25,4100,4000,1100,2000,7000,2667,1000,45000,700,600,1500,750,4700,1000,700,500,250,200,1500,3500,2100,300,3200,500,1375,1500,10000,19868,13200,12000,35000,2750,4500,2000,600,300,4000,2000,1200,3500,3200,2400,2200,2250,9000,1500,500,1400,1300,400,1500,4700,4500,1100,375,700,1000,1250,6000,2600,75,1250,700,1200,600,600,2850,10,7000,1500,8000,8000,8000,50,27000,75,50,750,6000,4800,150,3399,3750,7000,25,6500,302,3000,6200,500,7000,4000,400,800,800,1200,6000,1500,500,1300,1100,16000,12000,3000,50,550,33141,250,600,1100,1100,1000,76000,15000,750,1000,400,2250,2000,3800,9000,1061,250,1200,750,500,3000,13000,1000,700,1600,200,7000,600,2300,3000,1000,3200,900,1750,1300]}
df1 = pd.DataFrame(dict1)
dict2 = {'SYMBOL': ['INFY', 'TATAMOTORS', 'IDBI', 'BHEL', 'LT'],
'LTP': ['55', '66', '77', '88', '99'],
'PRICE': ['0.25', '0.36', '0.12', '0.28', '0.85']}
df2 = pd.DataFrame(dict2)
print(df1,'\n\n')
print(df2,'\n\n')
df2['LOT_SIZE']=df2[['SYMBOL']].merge(df1,how='left').LOT_SIZE
print(df2)

Python - How to retrieve certain text from a website

I have the following code:
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import re
market = 'INDU:IND'
quote_page = 'http://www.bloomberg.com/quote/' + market
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print('Market: ' + name)
This code works and lets me get the market name from the url. I'm trying to do something similar to this website. Here is my code:
market = 'BTC-GBP'
quote_page = 'https://uk.finance.yahoo.com/quote/' + market
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('span', attrs={'class': 'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})
name = name_box.text.strip()
print('Market: ' + name)
I'm not sure what to do. I want to retrieve the current rate, the amount it's increased/decreased by as a number & a percentage. And finally as of when the information was updated. How do I do this, I don't mind if you do a different method to the one I used previously as long as you explain it. If my code is inefficient/unpythonic could you also tell me what to do to fix this. I'm pretty new to web scraping and these new modules. Thanks!
You can use BeautifulSoup and when searching for the desired data, use regex to match the dynamic span classnames generated by the site's backend script:
from bs4 import BeautifulSoup as soup
import requests
import re
data = requests.get('https://uk.finance.yahoo.com/quote/BTC-GBP').text
s = soup(data, 'lxml')
d = [i.text for i in s.find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(\w+\) Fz\(\d+px\) Mb\(-\d+px\) D\(\w+\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})]
date_published = re.findall('As of\s+\d+:\d+PM GMT\.|As of\s+\d+:\d+AM GMT\.', data)
final_results = dict(zip(['current', 'change', 'published'], d+date_published))
Output:
{'current': u'6,785.02', 'change': u'-202.99 (-2.90%)', 'published': u'As of 3:55PM GMT.'}
Edit: given the new URL, you need to change the span classname:
data = requests.get('https://uk.finance.yahoo.com/quote/AAPL?p=AAPL').text
final_results = dict(zip(['current', 'change', 'published'], [i.text for i in soup(data, 'lxml').find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(b\) Fz\(\d+px\) Mb\(-\d+px\) D\(b\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})] + re.findall('At close:\s+\d:\d+PM EST', data)))
Output:
{'current': u'175.50', 'change': u'+3.00 (+1.74%)', 'published': u'At close: 4:00PM EST'}
You can directly use api provided by yahoo Finance,
For reference check this answer :-
Yahoo finance webservice API

Findall to div tag using beautiful soup yields blank return

<div class="columns small-5 medium-4 cell header">Ref No.</div>
<div class="columns small-7 medium-8 cell">110B60329</div>
Website is https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results
I would like to run a loop and return '110B60329'. I have ran beautiful soup and done a find_all(div), I then define the 2 different tags as head and data based on their class. I then ran iteration through the 'head' tags hoping it would return the info in the div tag i have defined as data .
Python returns a blank (cmd prompt reprinted the filepth).
Would anyone kindly know how i might fix this. My full code is.....thanks
import requests
from bs4 import BeautifulSoup as soup
import csv
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html= soup(response.content,"lxml")
properties_col = html.find_all('div')
for col in properties_col:
ref = 'n/a'
des = 'n/a'
head = col.find_all("div",{"class": "columns small-5 medium-4 cell header"})
data = col.find_all("div",{"class":"columns small-7 medium-8 cell"})
for i,elem in enumerate(head):
#for i in range(elems):
if head [i].text == "Ref No.":
ref = data[i].text
print ref
You can do this by two ways.
1) If you are sure that the website that your are scraping won't change its content you can find all divs by that class and get the content by providing an index.
2) Find all left side divs (The titles) and if one of them matches what you want get the next sibling to get the text.
Example:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html = soup(response.content,"lxml")
#Method 1
LeftBlockData = html.find_all("div", class_="columns small-7 medium-8 cell")
Reference = LeftBlockData[0].get_text().strip()
Description = LeftBlockData[2].get_text().strip()
print(Reference)
print(Description)
#Method 2
for column in html.find_all("div", class_="columns small-5 medium-4 cell header"):
RightColumn = column.next_sibling.next_sibling.get_text().strip()
if "Ref No." in column.get_text().strip():
print (RightColumn)
if "Description" in column.get_text().strip():
print (RightColumn)
The prints will output (in order):
110B60329
STORE
110B60329
STORE
Your problem is that you are trying to match a node text that have a lot of tabs with a non-spaced string.
For example your head [i].textvariable contains
Ref No., so if you compare it with Ref No. it'll give a false result. Striping it will solve.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results")
soup = BeautifulSoup(r.text, 'lxml')
for row in soup.find_all(class_='table-row'):
print(row.get_text(strip=True, separator='|').split('|'))
out:
['Ref No.', '110B60329']
['Office', 'LOTHIAN VJB']
['Description', 'STORE']
['Property Address', '29 BOSWALL PARKWAY', 'EDINBURGH', 'EH5 2BR']
['Proprietor', 'SCOTTISH MIDLAND CO-OP SOCIETY LTD.']
['Tenant', 'PROPRIETOR']
['Occupier']
['Net Annual Value', '£1,750']
['Marker']
['Rateable Value', '£1,750']
['Effective Date', '01-APR-10']
['Other Appeal', 'NO']
['Reval Appeal', 'NO']
get_text() is very powerful tool, you can strip the white space and put separator in the text.
You can use this method to get clean data and filter it.

Categories