Extract data from a site using python - python

I am making a program that will extract the data from http://www.gujarat.ngosindia.com/
I wrote the following code :
def split_line(text):
words = text.split()
i = 0
details = ''
while ((words[i] !='Contact')) and (i<len(words)):
i=i+1
if(words[i] == 'Contact:'):
break
while ((words[i] !='Purpose')) and (i<len(words)):
if (words[i] == 'Purpose:'):
break
details = details+words[i]+' '
i=i+1
print(details)
def get_ngo_detail(ngo_url):
html=urlopen(ngo_url).read()
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'border3'})
td = soup.find('td', {'class': 'border'})
split_line(td.text)
def get_ngo_names(gujrat_url):
html = urlopen(gujrat_url).read()
soup = BeautifulSoup(html)
for link in soup.findAll('div',{'id':'mainbox'}):
for text in link.find_all('a'):
print(text.get_text())
ngo_link = 'http://www.gujarat.ngosindia.com/'+text.get('href')
get_ngo_detail(ngo_link)
#NGO_name = text2.get_text())
a = get_ngo_names(BASE_URL)
print a
But when i run this script i only get the name of NGOs and contact person.
I want Email, telephone number, website, purpose and contact person.

Your split_line could be improved. Imagine you have this text:
s = """Add: 3rd Floor Khemha House
Drive in Road, Opp Drive in Cinema
Ahmedabad - 380 054
Gujarat
Tel: 91-79-7457611 , 79-7450378
Email: a.mitra1#lse.ac.uk
Website: http://www.aavishkaar.org
Contact: Angha Mitra
Purpose: Economics and Finance, Micro-enterprises
Aim/Objective/Mission: To provide timely financing, management support and professional expertise ..."""
Now we can turn this into lines using s.split("\n") (split on each new line), giving a list where each item is a line:
lines = s.split("\n")
lines == ['Add: 3rd Floor Khemha House',
'Drive in Road, Opp Drive in Cinema',
...]
We can define a list of the elements we want to extract, and a dictionary to hold the results:
targets = ["Contact", "Purpose", "Email"]
results = {}
And work through each line, capturing the information we want:
for line in lines:
l = line.split(":")
if l[0] in targets:
results[l[0]] = l[1]
This gives me:
results == {'Contact': ' Angha Mitra',
'Purpose': ' Economics and Finance, Micro-enterprises',
'Email': ' a.mitra1#lse.ac.uk'}

Try to split the contents of the ngos site better, you can give the "split" method a regular expression to split by.
e.g. "[Contact]+[Email]+[telephone number]+[website]+[purpose]+[contact person]
My regular expression could be wrong but this is the direction you should head in.

Related

My scrapping code skips new line - Scrapy

I have this code to scrape review text from IMDB. I want to retrieve the entire text from the review, but it skips every time there is a new line, for example:
Saw an early screening tonight in Denver.
I don't know where to begin. So I will start at the weakest link. The
acting. Still great, but any passable actor could have been given any
of the major roles and done a great job.
The code will only retrieve
Saw an early screening tonight in Denver.
Here is my code:
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')
first_review = reviews[0]
sel2 = Selector(text = first_review.get_attribute('innerHTML'))
rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []
error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')
for d in tqdm(reviews):
try:
sel2 = Selector(text = d.get_attribute('innerHTML'))
try:
rating = sel2.css('.rating-other-user-rating span::text').extract_first()
except:
rating = np.NaN
try:
review = sel2.css('.text.show-more__control::text').get()
except:
review = np.NaN
try:
review_date = sel2.css('.review-date::text').extract_first()
except:
review_date = np.NaN
try:
author = sel2.css('.display-name-link a::text').extract_first()
except:
author = np.NaN
try:
review_title = sel2.css('a.title::text').extract_first()
except:
review_title = np.NaN
rating_list.append(rating)
review_date_list.append(review_date)
review_title_list.append(review_title)
author_list.append(author)
review_list.append(review)
except Exception as e:
error_url_list.append(url)
error_msg_list.append(e)
review_df = pd.DataFrame({
'review_date':review_date_list,
'author':author_list,
'rating':rating_list,
'review_title':review_title_list,
'review':review_list
})
Use .extract() instead of .get() to extract all texts in the type of list. Then, you can use .join() to concatenate all texts into a single string.
review = sel2.css('.text.show-more__control::text').extract()
review = ' '.join(review)
output:
'For a teenager today, Dunkirk must seem even more distant than the
Boer War did to my generation growing up just after WW2. For some,
Christopher Nolan's film may be the most they will know about the
event. But it's enough in some ways because even if it doesn't show
everything that happened, maybe it goes as close as a film could to
letting you know how it felt. "Dunkirk" focuses on a number of
characters who are inside the event, living it ....'

How to perform paging to scrape quotes over several pages?

I'm looking to scrape the website 'https://quotes.toscrape.com/' and retrieve for each quote, the author's full name, date of birth, and location of birth. There are 10 pages of quotes. To retrieve the author's date of birth and location of birth, one must follow the <a href 'about'> link next to the author's name.
Functionally speaking, I need to scrape 10 pages of quotes and follow each quote author's 'about' link to retrieve their data mentioned in the paragraph above ^, and then compile this data into a list or dict, without duplicates.
I can complete some of these tasks separately, but I am new to BeautifulSoup and Python and am having trouble implementing them all together. My success so far is limited to retrieving the author's info from quotes on page 1, but being unable to properly assign the function's returns to a variable (without an erroneous in-function print statement), and unable to implement the 10 page scan... Any help is greatly appreciated.
def get_author_dob(url):
response_auth = requests.get(url)
html_auth = response_auth.content
auth_soup = BeautifulSoup(html_auth)
auth_tag = auth_soup.find("span", class_="author-born-date")
return [auth_tag.text]
def get_author_bplace(url):
response_auth2 = requests.get(url)
html_auth2 = response_auth2.content
auth_soup2 = BeautifulSoup(html_auth2)
auth_tag2 = auth_soup2.find("span", class_="author-born-location")
return [auth_tag2.text]
url = 'http://quotes.toscrape.com/'
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
def auth_retrieval (url):
for t in tag:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print (authorss)
I need to use 'return' in the above function to be able to assign the results to a variable, but when I do, it only returns one value. I have tried the generator route with yield but am confused on how to implement the counter when I am already iterating over tag. Also confused with where and how to insert 10-page scan task. Thanks in advance
You are on the right way but you could simplify the process a bit:
Use while-loop and check if next button is available to perform paging. This would also work if number of pages is not known. You could still add an interuption by a specific number of pages if needed.
Reduce number of requests and scrape available and necessarry information in one go.
If you pick a bit more it is not bad you could filter it in a easy way to get your goal df[['author','dob','lob']].drop_duplicates()
Store information in a structured way like dict instead of single variables.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_author(url):
soup = BeautifulSoup(requests.get(url).text)
author = {
'dob': soup.select_one('.author-born-date').text,
'lob': soup.select_one('.author-born-location').text,
'url': url
}
return author
base_url = 'http://quotes.toscrape.com'
url = base_url
quotes = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('div.quote'):
qoute = {
'author':e.select_one('small.author').text,
'qoute':e.select_one('span.text').text
}
qoute.update(get_author(base_url+e.a.get('href')))
quotes.append(qoute)
if soup.select_one('li.next a'):
url=base_url+soup.select_one('li.next a').get('href')
print(url)
else:
break
pd.DataFrame(quotes)
Output
author
qoute
dob
lob
url
0
Albert Einstein
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
March 14, 1879
in Ulm, Germany
http://quotes.toscrape.com/author/Albert-Einstein
1
J.K. Rowling
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
July 31, 1965
in Yate, South Gloucestershire, England, The United Kingdom
http://quotes.toscrape.com/author/J-K-Rowling
...
...
...
...
...
...
98
Dr. Seuss
“A person's a person, no matter how small.”
March 02, 1904
in Springfield, MA, The United States
http://quotes.toscrape.com/author/Dr-Seuss
99
George R.R. Martin
“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”
September 20, 1948
in Bayonne, New Jersey, The United States
http://quotes.toscrape.com/author/George-R-R-Martin
Your code is almost working and just needs a bit of refactoring.
One thing I found out was that you could access individual pages using this URL pattern,
https://quotes.toscrape.com/page/{page_number}/
Now, once you've figured out that, we can take advantage of this pattern in the code,
#refactored the auth_retrieval to this one for reusability
def get_page_data(base_url, tags):
all_authors = []
for t in tags:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = base_url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print(authorss)
all_authors.append(authorss)
return all_authors
url = 'https://quotes.toscrape.com/' #base url for the website
total_pages = 10
all_page_authors = []
for i in range(1, total_pages):
page_url = f'{url}page/{i}/' #https://quotes.toscrape.com/page/1, 2, ... 10
print(page_url)
page = requests.get(page_url)
soup = BeautifulSoup(page.content,'html.parser')
tags = soup.find_all("div", class_="quote")
all_page_authors += get_page_data(url, tags) #merge all authors into one list
print(all_page_authors)
get_author_dob and get_author_bplace remain the same.
The final output will be an array of authors where each author's info is an array.
[['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],
['J.K. Rowling', 'July 31, 1965', 'in Yate, South Gloucestershire, England, The United Kingdom'],
['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],...]

Regex - Extracting PubMed publications via Beautiful Soup, identify authors from my list that appear in PubMed article, and add bold HTML tags

I'm working with a project where we are web-scraping PubMed research abstracts and detecting if any researchers from our organization have authorship on any new publications. When we detect a match, we want to add a bold HTML tag. For example, you might see something like this is PubMed: Sanjay Gupta 1 2 3, Mehmot Oz 3 4, Terry Smith 2 4 (the numbers denote their academic affiliation, which corresponds to a different field, but I've left this out for simplicity. If Mehmot Oz and Sanjay Gupta were in my list, I would add a bold tag before their first name and a tag to end the bold at the end of their name.
One of my challenges with PubMed is the authors sometimes only show their first and last name, other times it includes a middle initial (e.g., Sanjay K Gupta versus just Sanjay Gupta). In my list of people, I only have first and last name. What I tried to do is import my list of names, split first and last name, and then bold them in the list of authors. The problem is that my code will bold anyone with the first name or anyone with the last name (example: Sanjay Smith 1 2 3, Sanjay Gupta 1 3 4, Wendy Gupta 4 5 6, Linda Oz 4, Mehmet Jones 5, Mehmet Oz 1 4 6.) gets bolded. I realize the flaw in my code, but I'm struggling for how to get around this. Any help is appreciated.
Bottom Line: I have a list of people by first name and last name, I want to find their publications in PubMed and bold their name in the author credits. PubMed sometimes has their first and last name, but sometimes their middle initial.
To make things easier, I denoted the section in all caps for the part in my code where I need help.
import time
import requests
import re
import pandas as pd
from datetime import datetime
all_pmids = []
out = []
base_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=sanjay+gupta&filter=years.2021-2021','https://pubmed.ncbi.nlm.nih.gov/?term=AHRQ+Diabetes+telehealth&filter=years.2016-2016', 'https://pubmed.ncbi.nlm.nih.gov/?term=mehmet+oz&filter=years.2020-2020']
author_list = ['Mehmet Oz', 'Sanjay Gupta', 'Ken Jeong', 'Susie Bates', 'Vijay Singh', 'Cynthia Berg']
for search_url in base_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print(p + ' already in list, skipping')
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title = soup2.select('h1.heading-title')[0].text.strip() if soup2.find(class_='item-list') is not None else ''
#THIS IS THE START OF THE SECTION I NEED HELP WITH
authors = soup2.find(class_='authors-list').get_text(' ') if soup2.find(class_='authors-list') is not None else ''
authors = authors.rstrip() if soup2.find(class_='authors-list') is not None else ''
authors = " ".join(authors.split()) if soup2.find(class_='authors-list') is not None else ''
for au in author_list:
au_l = au.split()[1] + ' '
au_f = au.split()[0] + ' '
authors = re.sub(au_f, '<b>'+au_f, authors) if '<b>' + au_f not in authors else authors
authors = re.sub(au_l, au_l+'</b>', authors) if '</b>' + au_l not in authors else authors
#THIS IS THE END OF THE SECTION I NEED HELP WITH
data = {'title': title, 'authors': authors}
time.sleep(5)
out.append(data)
df = pd.DataFrame(out)
df.to_excel('my_output.xlsx')
Here is the modification that needs to be done in the section you want help with.
Here is the algorithm:
Create list of authors by splitting on ,
For each author in authors, check if au_l and au_f are present in author.
If true, add <b> tags
#THIS IS THE START OF THE SECTION I NEED HELP WITH
authors = None
if (authors_html := soup2.find(class_='authors-list')):
authors = authors_html.get_text(' ')
if not authors:
continue
authors = " ".join(authors.rstrip().split()).split(",")
for au in author_list:
au_f, au_l = au.split()
for i in range(len(authors)):
if au_f in authors[i] and au_l in authors[i]:
authors[i] = f"<b> {authors[i]} <b>"
#THIS IS THE END OF THE SECTION I NEED HELP WITH
data = {'title': title, 'authors': ",".join(authors)}
Also, made some minor updates to improve readability.

How to scrape specific text from specific table elements

I am trying to scrape specific text from specific table elements on an Amazon product page.
URL_1 has all elements - https://www.amazon.com/dp/B008Q5LXIE/
URL_2 has only 'Sales Rank' - https://www.amazon.com/dp/B001V9X26S
URL_1:
The "Product Details" table has 9 items and I am only interested in 'Product Dimensions', 'Shipping Weight', Item Model Number, and all 'Seller's Rank'
I am not able to parse out the text on these items as some are in one block of code, where others are not.
I am using beautifulsoup and I have run a text.strip() on the table and got everything but very messy. I have tried soup.find('li') and text.strip() to find individual elements but with seller rank, it returns all 3 ranks jumbled in one return. I have also tried regex to clean text but it won't work for the 4 different seller ranks. I have had success using the Try, Except, Pass method for scraping and would have each of these in that format
A bad example of the code used, I was trying to get sales rank past the </b>
element in the HTML
#Sales Rank
sales_rank ='NOT'
try:
sr = soup.find('li', attrs={'id':'SalesRank'})
sales_rank = sr.find('/b').text.strip()
except:
pass
I expect to be able to scrape the listed elements into a dictionary. I would like to see the results as
dimensions = 6x4x4
weight = 4.8 ounces
Item_No = IT-DER0-IQDU
R1_NO = 2,036
R1_CAT = Health & Household
R2_NO = 5
R2_CAT = Joint & Muscle Pain Relief Medications
R3_NO = 3
R3_CAT = Naproxen Sodium
R4_NO = 6
R4_CAT = Migraine Relief
my_dict = {'dimensions':'dimensions','weight':'weight','Item_No':'Item_No', 'R1_NO':R1_NO,'R1_CAT':'R1_CAT','R2_NO':R2_NO,'R2_CAT':'R2_CAT','R3_NO':R3_NO,'R3_CAT':'R3_CAT','R4_CAT':'R4_CAT'}
URL_2:
The only element of interest on page is 'Sales Rank'. 'Product Dimensions', 'Shipping Weight', Item Model Number are not present. However, I would like a return similar to that of URL_1 but the missing elements would have a value of 'NA'. Same results as URL_1, only 'NA' is given when an element is not present. I have had success accomplishing this by setting a value prior to the Try/Except statement. Ex: Shipping Weight = 'NA' ... then run try/except: pass , so I get 'NA' and my dictionary is not empty.
You could use stripped_strings and :contains with bs4 4.7.1. This feels like a lot of jiggery pokery to get the desired output format. Sure someone with more python experience could reduce this and improve its efficiency. Merging dicts syntax taken from #aaronhall.
import requests
from bs4 import BeautifulSoup as bs
import re
links = ['https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']
for link in links:
r = requests.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']
temp_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
temp_dict[field] = 'N/A'
else:
if field == 'Amazon Best Sellers Rank':
item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
temp_dict[field] = item
else:
item = [string for string in element.stripped_strings][1]
temp_dict[field] = item.replace('(', '').strip()
ranks = soup.select('.zg_hrsr_rank')
ladders = soup.select('.zg_hrsr_ladder')
if ranks:
cat_nos = [item.text.split('#')[1] for item in ranks]
else:
cat_nos = ['N/A']
if ladders:
cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
else:
cats = ['N/A']
rankings = dict(zip(cat_nos, cats))
map_dict = {
'Product Dimensions': 'dimensions',
'Shipping Weight': 'weight',
'Item model number': 'Item_No',
'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']
}
final_dict = {}
for k,v in temp_dict.items():
if k == 'Amazon Best Sellers Rank' and v!= 'N/A':
item = dict(zip(map_dict[k],v))
final_dict = {**final_dict, **item}
elif k == 'Amazon Best Sellers Rank' and v == 'N/A':
item = dict(zip(map_dict[k], [v, v]))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[k]] = v
for k,v in enumerate(rankings):
#print(k + 1, v, rankings[v])
prefix = 'R' + str(k + 2) + '_'
final_dict[prefix + 'NO'] = v
final_dict[prefix + 'CAT'] = rankings[v]
print(final_dict)

Removing new line characters in web scrape

I'm trying to scrape baseball lineup data but would only like to return the player names. However, as of right now, it is giving me - position, newline character, name, newline character, and then batting side. For example I want
'D. Fletcher'
but instead I get
'LF\nD. Fletcher\nR'
Additionally, it is giving me all players on the page. It would be preferable that I group them by team, which maybe requires a dictionary set up of some sort but am not sure what that code would look like.
I've tried using the strip function but I believe that only removes leading or trailing issues as opposed to in the middle. I've tried researching how to just get the title information from the anchor tag but have not figured out how to do that.
from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily_lineups.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
players = soup.find_all('li', {'class': 'lineup__player'})
####for link in players.find('a'):
##### print (link.string)
awayPlayers = [player.text.strip() for player in players]
print(awayPlayers)
You should only get the .text for the a tag, not the whole li:
awayPlayers = [player.find('a').text.strip() for player in players]
That would result in something like the following:
['L. Martin', 'Jose Ramirez', 'J. Luplow', 'C. Santana', ...
Say you wanted to build that dict with team names and players you could do something like as follows. I don't know if you want the highlighted players e.g. Trevor Bauer? I have added variables to hold them in case needed.
Ad boxes and tools boxes are excluded via :not pseudo class which is passed a list of classes to ignore.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.rotowire.com/baseball/daily-lineups.php')
soup = bs(r.content, 'lxml')
team_dict = {}
teams = [item.text for item in soup.select('.lineup__abbr')] #26
matches = {}
i = 0
for teambox in soup.select('.lineups > div:not(.is-ad, .is-tools)'):
team_visit = teams[i]
team_home = teams[i + 1]
highlights = teambox.select('.lineup__player-highlight-name a')
visit_highlight = highlights[0].text
home_highlight = highlights[1].text
match = team_visit + ' v ' + team_home
visitors = [item['title'] for item in teambox.select('.is-visit .lineup__player [title]')]
home = [item['title'] for item in teambox.select('.is-home .lineup__player [title]')]
matches[match] = {'visitor' : [{team_visit : visitors}] ,
'home' : [{team_home : home}]
}
i+=1
Example info:
Current structure:
I think you were almost there, you just needed to tweak it a little bit:
awayPlayers = [player.find('a').text for player in players]
This list comprehension will grab just the names from the list then pull the text from the anchor...you get just a list of the names:
['L. Martin',
'Jose Ramirez',
'J. Luplow'...]
You have to find a tag and title attribute in it, check below answer.
awayPlayers = [player.find('a').get('title') for player in players]
print(awayPlayers)
Output is:
['Leonys Martin', 'Jose Ramirez', 'Jordan Luplow', 'Carlos Santana',

Categories