Webscrape stock data

Webscrape stock data - python

I am trying to webscrape the sector weights and holdings from yahoo finance for a given etf/mutual fund. I am having trouble identifying what to find when using BeautifulSoup. For example:
import bs4 as bs
import urllib.request
ticker='SPY'
address=('https://finance.yahoo.com/quote/'+ticker+
'/holdings?p='+ticker)
source = urllib.request.urlopen(address).read()
soup = bs.BeautifulSoup(source,'lxml')
sector_weights = soup.find()
I can read the address fine and when I inspect the website, the section I want highlighted is:
<div class="MB(25px) " data-reactid="18">
But when I try soup.find_all('div', class_='MB(25px) ') it returns an empty list.
I would like to do the same thing for holdings, but the same issue came about.
P.S. if anybody know of any good website to scrape region information, that would be much appreciated, morningstar does not work sadly.

'MB(25px) ' should 'Mb(25px)'
The name is case-sensitive and you need to remove the trailing space in the literal. Your code works when I make those two changes.
I also had to remove your parser reference and let BeautifulSoup use the default parser, html.parser, because the code crashed when I used your parser reference.

Related

Unable to extract text from website containing a filter

I'm trying to get all the locations out of the following website (www.mars.com/locations) using Python, with Requests and BeautifulSoup.
The website has a filter to select continent, country and region, so that it will display only the locations the company has in the selected area. They also include their headquarters at the bottom of the page, and this information is always there regardless of the filter applied.
I have no problem extracting the data for the headquarters using the code below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mars.com/locations'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
HQ = soup.find('div', class_='global-headquarter pr-5 pl-3').text.strip()
print(HQ)
The output of the code is:
Mars, Incorporated (Global Headquarters)
6885 Elm Street
McLean
Virginia
22101
+1(703) 821-4900
I want to do the same for all other locations, but I'm struggling to extract the data using the same approach (adjusting the path, of course). I've tried everything and I'm out of ideas. Would really appreciate someone giving me a hand or at least pointing me in the right direction.
Thanks a lot in advance!

All location data can be retrieved in text format. Decomposing this into a string is one way to do it. I'm not an expert in this field, so I can't help you any more.
content_json = soup.find('div', class_='location-container')
data = content_json['data-location']

i'm not an expert in BeautifulSoup, so i'll use parsel to get the data. all the locations are embedded in a location-container css class, with a data-location attribute.
import requests
from parsel import Selector
response = requests.get(url).text
selector = Selector(text=response)
data = selector.css(".location-container").xpath("./#data-location").getall()

BeautifulSoup: Selector not extracting the right data - Yahoo Scrape

I'm trying to extract the text from an element whose class value contains compText. The problem is that it extracts everything but the text that I want.
The CSS selector identifies the element correctly when I use it in the developer tools.
I'm trying to scrape the text that appears in Yahoo SERP when the query entered doesn't have results.
If my query is (quotes included) "klsf gl glkjgsdn lkgsdg" nothing is displayed expect the complementary text "We did not find results blabla" and the Selector extract the data correctly
If my query is (quotes included) "based specialty. Blocks. Organosilicone. Reference". Yahoo will add ads because of the keyword "Organosilicone" and that triggers the behavior described in the first paragraph.
Here is the code:
import requests
from bs4 import BeautifulSoup
url = "http://search.yahoo.com/search?p="
query = '"based specialty chemicals. Blocks. Organosilicone. Reference"'
r = requests.get(url + query)
soup = BeautifulSoup(r.text, "html.parser")
for EachPart in soup.select('div[class*="compText"]'):
print (EachPart.text)
What could be wrong?
Thx,
EDIT: The text extracted seems to be the defnition of the word "Organosilicone" which I can find on the SERP.
EDIT2: This is a snippet of the text I get: "The products created and produced by ‘Specialty Chemicals’ member companies, many of which are Small and Medium Enterprises, stem from original and continuous innovation. They drive the low-carbon, resource-efficient and knowledge based economy of the future." and a screenshot of the SERP when I use my browser

Python Regex Match Line If Ends With?

This is what im trying to scrape:
<p>Some.Title.html<br />
https://www.somelink.com/yep.html<br />
Some.Title.txt<br />
https://www.somelink.com/yeppers.txt<br />
I have tried several variations of the following:
match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)
I am looking to match lines with the "p" tag and without. "p" tag only occurs on the first instance. Terrible at python so I am pretty rusty, have searched through here and google and nothing seemed to be quite the same. Thanks for any help. Really do appreciate the help I get here when I am stuck.
Desired output is an index:
http://www.SomeLink.com/yep.html
http://www.SomeLink.com/yeppers.txt

Using the Beautiful soup and requests module would be perfect for something like this instead of regex as the commenters noted above.
import requests
import bs4
html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them
This just a simple code that will select all the tags from the html site and store them in a list with the format that you illustrated up above. I'd advise checking here for a nice tutorial on bs4 and here for the actual docs.

How to scrape Price from a site that has a changing structure?

I want to scrape the pricing data from an eCommerce site called flipkart, I tried using Beautifulsoup with casperjs(nodejs utility) and similar libraries but none of them is good enough.
Here's the URL and the structure.
https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct?
the problem is the layout...What are some ways to get around this?
P.S : Is there anyway I could apply machine learning for getting the pricing data without knowing complex math? Like where do i even start?

You should probably construct your XPath in a way so it does not rely on the class, but rather on the content (node()) of the element you want to match. Alternatively you could match the data-reactid if that doesn't change?
For matching the div by data-reactid:
//div[#data-reactid=220]
Or for matching the div based on its location:
//span[child::img[#src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fa_8b4b59.png"]]/preceding-sibling::div
Assuming the img_path doesn't change you're on the safe side.

Since you can't use xpath due to dynamic changing you probably could try to use a regex for finding a price in the script tag on the page.
Something like this:
import requests
import re
url = "https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct"
r = requests.get(url)
pattern = re.compile('prexoAvailable\":[\w]+,\"price\":(\d+)')
result = pattern.search(r.text)
print(result.group(1))

from bs4 import BeatifulSoup
page = request.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.findAll('a', href=True, attrs={'class': '_31qSD5'}):
price = a.find('div', attrs={'class': '_1vC4OE _2rQ-NK'})
print(price.text)

E-commerce have does not allow anymore to scrape data like before, every entity of the product like product price, specification, reviews are now enclosed in a separate “Dynamic” class name.
And scraping certain data from the webpage you need to use specific class name which is dynamic. So using request.get() or soup() won't work.

Scrape web with a query

I am trying to scrape impact factors of journals from a particular website or entire web. I have been searching for something close but hard luck..
This is the first time I am trying web scrape with python. I am trying to find the simplest way.
I have a list of ISSN numbers belong to Journals and I want to retrieve the impact factor values of them from web or a particular site. The list has more than 50K values so manually searching the values is practically hard .
Input type
Index,JOURNALNAME,ISSN,Impact Factor 2015,URL,ABBV,SUBJECT
1,4OR-A Quarterly Journal of Operations Research,1619-4500,,,4OR Q J OPER RES,Management Science
2,Aaohn Journal,0891-0162,,,AAOHN J,
3,Aapg Bulletin,0149-1423,,,AAPG BULL,Engineering
4,AAPS Journal,1550-7416,,,AAPS J,Medicine
5,Aaps Pharmscitech,1530-9932,,,AAPS PHARMSCITECH,
6,Aatcc Review,1532-8813,,,AATCC REV,
7,Abdominal Imaging,0942-8925,,,ABDOM IMAGING,
8,Abhandlungen Aus Dem Mathematischen Seminar Der Universitat Hamburg,0025-5858,,,ABH MATH SEM HAMBURG,
9,Abstract and Applied Analysis,1085-3375,,,ABSTR APPL ANAL,Math
10,Academic Emergency Medicine,1069-6563,,,ACAD EMERG MED,Medicine
What is needed ?
The input above has a column of ISSN numbers. Read the ISSN numbers and search for it in researchgate.net or in web. Then wen the individual web pages are found search for Impact Factor 2015 and retrieve the value put it in the empty place beside ISSN Number and also place the retrieved URL next to it
so that web search can be also limited to one site and one keyword search for the value .. the empty one can be kept as "NAN"
Thanks in advance for the suggestions and help

Try this code using beautiful soup and urllib2. I am using h2 tag and searching for 'Journal Impact:', but I will let you decide on the algorithm to extract the data. The html content is present in soup and soup provides API to extract it. What I provide is an example and that may work for you.
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
issn = '0219-5305'
url = 'https://www.researchgate.net/journal/%s_Analysis_and_Applications' % (issn)
htmlDoc = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlDoc, 'html.parser')
for tag in soup.find_all('h2'):
if 'Journal Impact:' in tag.text:
value = tag.text
value = value.replace('Journal Impact:', '')
value = value.strip(' *')
print value
Output:
1.13
I think the official documentation for beautiful soup is pretty good. I will suggest spending an hour on the documentation if you are new to this, before even try to write some code. That hour spent on reading the documentation will save you lot more hours later.
https://www.crummy.com/software/BeautifulSoup/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.