Scrape web with a query

Scrape web with a query - python

I am trying to scrape impact factors of journals from a particular website or entire web. I have been searching for something close but hard luck..
This is the first time I am trying web scrape with python. I am trying to find the simplest way.
I have a list of ISSN numbers belong to Journals and I want to retrieve the impact factor values of them from web or a particular site. The list has more than 50K values so manually searching the values is practically hard .
Input type
Index,JOURNALNAME,ISSN,Impact Factor 2015,URL,ABBV,SUBJECT
1,4OR-A Quarterly Journal of Operations Research,1619-4500,,,4OR Q J OPER RES,Management Science
2,Aaohn Journal,0891-0162,,,AAOHN J,
3,Aapg Bulletin,0149-1423,,,AAPG BULL,Engineering
4,AAPS Journal,1550-7416,,,AAPS J,Medicine
5,Aaps Pharmscitech,1530-9932,,,AAPS PHARMSCITECH,
6,Aatcc Review,1532-8813,,,AATCC REV,
7,Abdominal Imaging,0942-8925,,,ABDOM IMAGING,
8,Abhandlungen Aus Dem Mathematischen Seminar Der Universitat Hamburg,0025-5858,,,ABH MATH SEM HAMBURG,
9,Abstract and Applied Analysis,1085-3375,,,ABSTR APPL ANAL,Math
10,Academic Emergency Medicine,1069-6563,,,ACAD EMERG MED,Medicine
What is needed ?
The input above has a column of ISSN numbers. Read the ISSN numbers and search for it in researchgate.net or in web. Then wen the individual web pages are found search for Impact Factor 2015 and retrieve the value put it in the empty place beside ISSN Number and also place the retrieved URL next to it
so that web search can be also limited to one site and one keyword search for the value .. the empty one can be kept as "NAN"
Thanks in advance for the suggestions and help

Try this code using beautiful soup and urllib2. I am using h2 tag and searching for 'Journal Impact:', but I will let you decide on the algorithm to extract the data. The html content is present in soup and soup provides API to extract it. What I provide is an example and that may work for you.
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
issn = '0219-5305'
url = 'https://www.researchgate.net/journal/%s_Analysis_and_Applications' % (issn)
htmlDoc = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlDoc, 'html.parser')
for tag in soup.find_all('h2'):
if 'Journal Impact:' in tag.text:
value = tag.text
value = value.replace('Journal Impact:', '')
value = value.strip(' *')
print value
Output:
1.13
I think the official documentation for beautiful soup is pretty good. I will suggest spending an hour on the documentation if you are new to this, before even try to write some code. That hour spent on reading the documentation will save you lot more hours later.
https://www.crummy.com/software/BeautifulSoup/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

Web scraping splits data from each player on basketball reference for 2021-22 season

I am currently trying to develop an unbiased rating system for NBA players over the course of the season in R, and one very important piece of information I am missing is the "splits" section for each player, where I can see how many wins his team has been involved in. For example, Darius Garland played in 68 games last season, winning 37 of them.
What I need is a csv file with 2 columns where I have the number of wins and the "code" of the player (for example, Garland's code is garlada01). I need to join it with the other table I already have in the csv file and join these 2 data frames by the same key in R, and this "code" is the perfect solution for that.
Do you have any idea or guidance on how to do this? I have never done web scraping before and my Python knowledge is not that good yet.

This would best be done using BeautifulSoup, and would look something like this.
import requests
from bs4 import BeautifulSoup
url = '' #Use whatever URL you're scraping from
r = requests.get(url)
if(r.status_code != 200):
print("Could not connect to webpage")
quit()
soup = BeautifulSoup(r.content, 'html.parser')
Now that you have the BeautifulSoup object, you can parse the html that you got from the webpage and look for specific tags that contain the data you're looking for (I can't say what those are, you would have to figure those out.
Some good references:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
How to find elements by class

Unable to extract text from website containing a filter

I'm trying to get all the locations out of the following website (www.mars.com/locations) using Python, with Requests and BeautifulSoup.
The website has a filter to select continent, country and region, so that it will display only the locations the company has in the selected area. They also include their headquarters at the bottom of the page, and this information is always there regardless of the filter applied.
I have no problem extracting the data for the headquarters using the code below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mars.com/locations'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
HQ = soup.find('div', class_='global-headquarter pr-5 pl-3').text.strip()
print(HQ)
The output of the code is:
Mars, Incorporated (Global Headquarters)
6885 Elm Street
McLean
Virginia
22101
+1(703) 821-4900
I want to do the same for all other locations, but I'm struggling to extract the data using the same approach (adjusting the path, of course). I've tried everything and I'm out of ideas. Would really appreciate someone giving me a hand or at least pointing me in the right direction.
Thanks a lot in advance!

All location data can be retrieved in text format. Decomposing this into a string is one way to do it. I'm not an expert in this field, so I can't help you any more.
content_json = soup.find('div', class_='location-container')
data = content_json['data-location']

i'm not an expert in BeautifulSoup, so i'll use parsel to get the data. all the locations are embedded in a location-container css class, with a data-location attribute.
import requests
from parsel import Selector
response = requests.get(url).text
selector = Selector(text=response)
data = selector.css(".location-container").xpath("./#data-location").getall()

Webscrape stock data

I am trying to webscrape the sector weights and holdings from yahoo finance for a given etf/mutual fund. I am having trouble identifying what to find when using BeautifulSoup. For example:
import bs4 as bs
import urllib.request
ticker='SPY'
address=('https://finance.yahoo.com/quote/'+ticker+
'/holdings?p='+ticker)
source = urllib.request.urlopen(address).read()
soup = bs.BeautifulSoup(source,'lxml')
sector_weights = soup.find()
I can read the address fine and when I inspect the website, the section I want highlighted is:
<div class="MB(25px) " data-reactid="18">
But when I try soup.find_all('div', class_='MB(25px) ') it returns an empty list.
I would like to do the same thing for holdings, but the same issue came about.
P.S. if anybody know of any good website to scrape region information, that would be much appreciated, morningstar does not work sadly.

'MB(25px) ' should 'Mb(25px)'
The name is case-sensitive and you need to remove the trailing space in the literal. Your code works when I make those two changes.
I also had to remove your parser reference and let BeautifulSoup use the default parser, html.parser, because the code crashed when I used your parser reference.

Python- Unable to retrieve complete text data for 1 more pages

I'm a newbie in Python Programming and I am facing following issue:
Objective: I need to scrap Freelancers website and store the list of theusers along with their attributes (score, ratings,reviews,details, rate,etc)
into a file. I have following codes but I am not able to get all the users.
Also, sometimes I run the program, the output changes.
import requests
from bs4 import BeautifulSoup
pages = 1
fileWriter =open('freelancers.txt','w')
url = 'https://www.freelancer.com/freelancers/skills/all/'+str(pages)+'/'
r = requests.get(url)
#gets the html contents and stores them into soup object
soup = BeautifulSoup(r.content)
links = soup.findAll("a")
#Finds the freelancer-details nodes and stores the html content into c_data
c_data = soup.findAll("div", {"class":"freelancer-details"})
for item in c_data:
print item.text
fileWriter.write('Freelancers Details:'+item.text+'\t')
#Writes the result into text file
I need to get the user details under specific users. But so far, the output looks dispersed.
Sample Output:
Freelancers Details:
thetechie13
507 Reviews
$20 USD/hr
Top Skills:
Website Design,
HTML,
PHP,
eCommerce,
Volusion
Dear Customer - We are a team of 75 Most Creative People and proud to be
Preferred Freelancer on Freelancer.com. We offer wide range of web
solutions and IT services that are bespoke in nature, can best fit our
clients' business needs and provide them cost benefits.

If you want each individual text component on its own (each assigned a different name), I would advise you to parse the text from from the HTML separately. However if you want it all grouped together you could join the strings:
print ' '.join(item.text.split())
This will place a single space between each word.

Generate a list of web queries

I'm pretty new at this and I'm trying to figure out a way to look up a list of websites automatically. I have a very large list of companies and essentially I'd want the algorithm to type the company into Google, click the first link (most likely the company website) and figure out whether the company matches the target industry (ice cream distributors) or has anything to do with the industry. The way I'd want to check for this is by seeing if the home page contains any of the key words in a given dictionary (let's say, 'chocolate, vanilla, ice cream, etc'). I would really appreciate some help with this - thank you so much.

I recommend using a combination of requests and lxml. To accomplish this you could do something similar to this.
import requests
from lxml.cssselect import CSSSelector
from lxml import html
use requests or grequests to get the html from all the pages.
queries = ['cats', 'dogs']
queries = [requests.get(x) for x in queries]
data = [x.text for x in queries]
parse the html with lxml and extract the first link on each page.
data = [html.document_fromstring(x) for x in data]
sel = CSSSelector('h3.r a')
links = [sel(x)[0] for x in data]
finally grab the html from all the first results.
pages = [requests.get(a.attrib['href'] for a in links]
this will give you an html string each of the pages you want. From there you should be able to simply search for the words you want in the pages html. You might find a counter helpful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.