Python, Issue with Regex. Trying to scrape for some comicbook titles - python

I'm trying to scrape for comic book titles and their respective numbers, from this site.
But I'm having issue with Regex which I've never used before.
I don't want to bore you with my full code, suffice it say I'm using beautiful soup, and what I need from Regex is simply to point to the title name and also the episode number of each comic title, out of the list looping through.
As you can tell from the webpage this should be simplicity itself, the Publisher name comes in all caps, always followed by the title, always followed by a #-symbol, always followed by the episode number.
Here is my approach:
import re
text = "876876 PUBLISHER title #345 jklhljhljh"
texpat = re.compile(r"PUBLISHER(.*?)#")
thename = pattern.search(text)
name = thename.group()
numpat = re.compile(r"#(\d+)")
num = numpat.search(text)
print(name)
print(num.group())
The output is:
PUBLISHER title #
#345
But it should be:
title
345
I can use the replace string method to remove the stuff I don't want, but then I get stuck with this output:
title
and name.strip() or name.lstrip() does NOT remove the extra three spaces.
It's late, I've never used regex before, I'm sure I'm doing something stupid.

I would utilize BeautifulSoup here to help with html parsing:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.comiclistdatabase.com/doku.php?id=comiclist_for_09_10_2014"
soup = BeautifulSoup(urllib2.urlopen(url))
for row in soup.select('div.table tr')[1:]:
publisher = row.find('td', class_='col1').text
title = row.find('td', class_='col2').text
print {'publisher': publisher, 'title': title}
Prints:
{'publisher': u'AMIGO COMICS', 'title': u'Ghost Wolf #4 (Of 4)$3.99 '}
{'publisher': u'AMIGO COMICS', 'title': u'Rogues Volume 2 Cold Ship #4 (Of 5)'}
{'publisher': u'ARCHIE COMIC PUBLICATIONS', 'title': u'Archie Giant Comics Digest TP'}
{'publisher': u'ARCHIE COMIC PUBLICATIONS', 'title': u'Betty And Veronica #272 (Dan Parent Regular Cover)'}
...
Then, you can grab the number from the title if you want to extract it too. I'm using #(\d+) regular expression that matches a hashtag followed by 1 or more digits, parenthesis help to capture the number:
import re
import urllib2
from bs4 import BeautifulSoup
url = "http://www.comiclistdatabase.com/doku.php?id=comiclist_for_09_10_2014"
soup = BeautifulSoup(urllib2.urlopen(url))
NUMBER_RE = re.compile('#(\d+)')
for row in soup.select('div.table tr')[1:]:
publisher = row.find('td', class_='col1').text
title = row.find('td', class_='col2').text
match = NUMBER_RE.search(title)
number = match.group(1) if match else 'n/a'
print {'publisher': publisher, 'title': title, 'number': number}
Prints:
{'publisher': u'AMIGO COMICS', 'number': u'4', 'title': u'Ghost Wolf #4 (Of 4)$3.99 '}
{'publisher': u'AMIGO COMICS', 'number': u'4', 'title': u'Rogues Volume 2 Cold Ship #4 (Of 5)'}
{'publisher': u'ARCHIE COMIC PUBLICATIONS', 'number': 'n/a', 'title': u'Archie Giant Comics Digest TP'}
...

import re
text = "876876 PUBLISHER title #345 jklhljhljh"
texpat = re.compile(r"PUBLISHER\s*(\S.*?)#")
thename = texpat.search(text)
name = thename.groups()[0]
numpat = re.compile(r"#(\d+)")
num = numpat.search(text)
print(name)
print(num.groups()[0])
The output is:
title
345

Match this to capture the title (in group one) and the number (in group two) with one expression:
PUBLISHER\s*(.+?)\s*#(\d+)
Demo
Then you need to use the array pattern.search(text).group(i) to get the capture group instead of the entire match:
import re
text = "876876 PUBLISHER title #345 jklhljhljh"
pattern = re.compile(r"PUBLISHER\s*(.+?)\s*#(\d+)")
results = pattern.search(text)
print(results.group(1))
print(results.group(2))
Output:
title
345

Related

Problem with .Get href link using scraper?

So I am trying to follow a video tutorial that is just a bit outdated. In the video, using href = links[idx].get('href') grabs the link, however if I use it here, it won't work. It just says none. If I just type .getText() it will grab the title.
The element for the entire href and title is Stop the proposal on mass surveillance of the EU
Here's my code:
`import requests
from bs4 import BeautifulSoup
res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titleline')
votes = soup.select('.score')
def create_custom_hn(links, votes):
hn = []
for idx, item in enumerate(links):
title = links[idx].getText()
href = links[idx].get('href')
print(href)
#hn.append({'title': title, 'link': href})
return hn
print(create_custom_hn(links, votes))`
I tried to grab the link using .get('href')
Try to select your elements more specific and avoid using different lists there is no need for that and you have to ensure that they will have same length.
You could get all information in one go, selecting the <tr> with class athing and its next sibling.
Example
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://news.ycombinator.com/news').text)
data = []
for i in soup.select('.athing'):
data.append({
'title' : i.select_one('span a').text,
'link' : i.select_one('span a').get('href'),
'score' : list(i.next_sibling.find('span').stripped_strings)[0]
})
data
Output
[{'title': 'Stop the proposal on mass surveillance of the EU',
'link': 'https://mullvad.net/nl/blog/2023/2/2/stop-the-proposal-on-mass-surveillance-of-the-eu/',
'score': '287 points'},
{'title': 'Bay 12 Games has made $7M from the Steam release of Dwarf Fortress',
'link': 'http://www.bay12forums.com/smf/index.php?topic=181354.0',
'score': '416 points'},
{'title': "Google's OSS-Fuzz expands fuzz-reward program to $30000",
'link': 'https://security.googleblog.com/2023/02/taking-next-step-oss-fuzz-in-2023.html',
'score': '31 points'},
{'title': "Connecticut Parents Arrested for Letting Kids Walk to Dunkin' Donuts",
'link': 'https://reason.com/2023/01/30/dunkin-donuts-parents-arrested-kids-cops-freedom/',
'score': '225 points'},
{'title': 'Ronin 2.0 – open-source Ruby toolkit for security research and development',
'link': 'https://ronin-rb.dev/blog/2023/02/01/ronin-2-0-0-finally-released.html',
'score': '62 points'},...]

How to perform paging to scrape quotes over several pages?

I'm looking to scrape the website 'https://quotes.toscrape.com/' and retrieve for each quote, the author's full name, date of birth, and location of birth. There are 10 pages of quotes. To retrieve the author's date of birth and location of birth, one must follow the <a href 'about'> link next to the author's name.
Functionally speaking, I need to scrape 10 pages of quotes and follow each quote author's 'about' link to retrieve their data mentioned in the paragraph above ^, and then compile this data into a list or dict, without duplicates.
I can complete some of these tasks separately, but I am new to BeautifulSoup and Python and am having trouble implementing them all together. My success so far is limited to retrieving the author's info from quotes on page 1, but being unable to properly assign the function's returns to a variable (without an erroneous in-function print statement), and unable to implement the 10 page scan... Any help is greatly appreciated.
def get_author_dob(url):
response_auth = requests.get(url)
html_auth = response_auth.content
auth_soup = BeautifulSoup(html_auth)
auth_tag = auth_soup.find("span", class_="author-born-date")
return [auth_tag.text]
def get_author_bplace(url):
response_auth2 = requests.get(url)
html_auth2 = response_auth2.content
auth_soup2 = BeautifulSoup(html_auth2)
auth_tag2 = auth_soup2.find("span", class_="author-born-location")
return [auth_tag2.text]
url = 'http://quotes.toscrape.com/'
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
def auth_retrieval (url):
for t in tag:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print (authorss)
I need to use 'return' in the above function to be able to assign the results to a variable, but when I do, it only returns one value. I have tried the generator route with yield but am confused on how to implement the counter when I am already iterating over tag. Also confused with where and how to insert 10-page scan task. Thanks in advance
You are on the right way but you could simplify the process a bit:
Use while-loop and check if next button is available to perform paging. This would also work if number of pages is not known. You could still add an interuption by a specific number of pages if needed.
Reduce number of requests and scrape available and necessarry information in one go.
If you pick a bit more it is not bad you could filter it in a easy way to get your goal df[['author','dob','lob']].drop_duplicates()
Store information in a structured way like dict instead of single variables.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_author(url):
soup = BeautifulSoup(requests.get(url).text)
author = {
'dob': soup.select_one('.author-born-date').text,
'lob': soup.select_one('.author-born-location').text,
'url': url
}
return author
base_url = 'http://quotes.toscrape.com'
url = base_url
quotes = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('div.quote'):
qoute = {
'author':e.select_one('small.author').text,
'qoute':e.select_one('span.text').text
}
qoute.update(get_author(base_url+e.a.get('href')))
quotes.append(qoute)
if soup.select_one('li.next a'):
url=base_url+soup.select_one('li.next a').get('href')
print(url)
else:
break
pd.DataFrame(quotes)
Output
author
qoute
dob
lob
url
0
Albert Einstein
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
March 14, 1879
in Ulm, Germany
http://quotes.toscrape.com/author/Albert-Einstein
1
J.K. Rowling
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
July 31, 1965
in Yate, South Gloucestershire, England, The United Kingdom
http://quotes.toscrape.com/author/J-K-Rowling
...
...
...
...
...
...
98
Dr. Seuss
“A person's a person, no matter how small.”
March 02, 1904
in Springfield, MA, The United States
http://quotes.toscrape.com/author/Dr-Seuss
99
George R.R. Martin
“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”
September 20, 1948
in Bayonne, New Jersey, The United States
http://quotes.toscrape.com/author/George-R-R-Martin
Your code is almost working and just needs a bit of refactoring.
One thing I found out was that you could access individual pages using this URL pattern,
https://quotes.toscrape.com/page/{page_number}/
Now, once you've figured out that, we can take advantage of this pattern in the code,
#refactored the auth_retrieval to this one for reusability
def get_page_data(base_url, tags):
all_authors = []
for t in tags:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = base_url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print(authorss)
all_authors.append(authorss)
return all_authors
url = 'https://quotes.toscrape.com/' #base url for the website
total_pages = 10
all_page_authors = []
for i in range(1, total_pages):
page_url = f'{url}page/{i}/' #https://quotes.toscrape.com/page/1, 2, ... 10
print(page_url)
page = requests.get(page_url)
soup = BeautifulSoup(page.content,'html.parser')
tags = soup.find_all("div", class_="quote")
all_page_authors += get_page_data(url, tags) #merge all authors into one list
print(all_page_authors)
get_author_dob and get_author_bplace remain the same.
The final output will be an array of authors where each author's info is an array.
[['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],
['J.K. Rowling', 'July 31, 1965', 'in Yate, South Gloucestershire, England, The United Kingdom'],
['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],...]

Regex - Extracting PubMed publications via Beautiful Soup, identify authors from my list that appear in PubMed article, and add bold HTML tags

I'm working with a project where we are web-scraping PubMed research abstracts and detecting if any researchers from our organization have authorship on any new publications. When we detect a match, we want to add a bold HTML tag. For example, you might see something like this is PubMed: Sanjay Gupta 1 2 3, Mehmot Oz 3 4, Terry Smith 2 4 (the numbers denote their academic affiliation, which corresponds to a different field, but I've left this out for simplicity. If Mehmot Oz and Sanjay Gupta were in my list, I would add a bold tag before their first name and a tag to end the bold at the end of their name.
One of my challenges with PubMed is the authors sometimes only show their first and last name, other times it includes a middle initial (e.g., Sanjay K Gupta versus just Sanjay Gupta). In my list of people, I only have first and last name. What I tried to do is import my list of names, split first and last name, and then bold them in the list of authors. The problem is that my code will bold anyone with the first name or anyone with the last name (example: Sanjay Smith 1 2 3, Sanjay Gupta 1 3 4, Wendy Gupta 4 5 6, Linda Oz 4, Mehmet Jones 5, Mehmet Oz 1 4 6.) gets bolded. I realize the flaw in my code, but I'm struggling for how to get around this. Any help is appreciated.
Bottom Line: I have a list of people by first name and last name, I want to find their publications in PubMed and bold their name in the author credits. PubMed sometimes has their first and last name, but sometimes their middle initial.
To make things easier, I denoted the section in all caps for the part in my code where I need help.
import time
import requests
import re
import pandas as pd
from datetime import datetime
all_pmids = []
out = []
base_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=sanjay+gupta&filter=years.2021-2021','https://pubmed.ncbi.nlm.nih.gov/?term=AHRQ+Diabetes+telehealth&filter=years.2016-2016', 'https://pubmed.ncbi.nlm.nih.gov/?term=mehmet+oz&filter=years.2020-2020']
author_list = ['Mehmet Oz', 'Sanjay Gupta', 'Ken Jeong', 'Susie Bates', 'Vijay Singh', 'Cynthia Berg']
for search_url in base_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print(p + ' already in list, skipping')
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title = soup2.select('h1.heading-title')[0].text.strip() if soup2.find(class_='item-list') is not None else ''
#THIS IS THE START OF THE SECTION I NEED HELP WITH
authors = soup2.find(class_='authors-list').get_text(' ') if soup2.find(class_='authors-list') is not None else ''
authors = authors.rstrip() if soup2.find(class_='authors-list') is not None else ''
authors = " ".join(authors.split()) if soup2.find(class_='authors-list') is not None else ''
for au in author_list:
au_l = au.split()[1] + ' '
au_f = au.split()[0] + ' '
authors = re.sub(au_f, '<b>'+au_f, authors) if '<b>' + au_f not in authors else authors
authors = re.sub(au_l, au_l+'</b>', authors) if '</b>' + au_l not in authors else authors
#THIS IS THE END OF THE SECTION I NEED HELP WITH
data = {'title': title, 'authors': authors}
time.sleep(5)
out.append(data)
df = pd.DataFrame(out)
df.to_excel('my_output.xlsx')
Here is the modification that needs to be done in the section you want help with.
Here is the algorithm:
Create list of authors by splitting on ,
For each author in authors, check if au_l and au_f are present in author.
If true, add <b> tags
#THIS IS THE START OF THE SECTION I NEED HELP WITH
authors = None
if (authors_html := soup2.find(class_='authors-list')):
authors = authors_html.get_text(' ')
if not authors:
continue
authors = " ".join(authors.rstrip().split()).split(",")
for au in author_list:
au_f, au_l = au.split()
for i in range(len(authors)):
if au_f in authors[i] and au_l in authors[i]:
authors[i] = f"<b> {authors[i]} <b>"
#THIS IS THE END OF THE SECTION I NEED HELP WITH
data = {'title': title, 'authors': ",".join(authors)}
Also, made some minor updates to improve readability.

Removing new line characters in web scrape

I'm trying to scrape baseball lineup data but would only like to return the player names. However, as of right now, it is giving me - position, newline character, name, newline character, and then batting side. For example I want
'D. Fletcher'
but instead I get
'LF\nD. Fletcher\nR'
Additionally, it is giving me all players on the page. It would be preferable that I group them by team, which maybe requires a dictionary set up of some sort but am not sure what that code would look like.
I've tried using the strip function but I believe that only removes leading or trailing issues as opposed to in the middle. I've tried researching how to just get the title information from the anchor tag but have not figured out how to do that.
from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily_lineups.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
players = soup.find_all('li', {'class': 'lineup__player'})
####for link in players.find('a'):
##### print (link.string)
awayPlayers = [player.text.strip() for player in players]
print(awayPlayers)
You should only get the .text for the a tag, not the whole li:
awayPlayers = [player.find('a').text.strip() for player in players]
That would result in something like the following:
['L. Martin', 'Jose Ramirez', 'J. Luplow', 'C. Santana', ...
Say you wanted to build that dict with team names and players you could do something like as follows. I don't know if you want the highlighted players e.g. Trevor Bauer? I have added variables to hold them in case needed.
Ad boxes and tools boxes are excluded via :not pseudo class which is passed a list of classes to ignore.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.rotowire.com/baseball/daily-lineups.php')
soup = bs(r.content, 'lxml')
team_dict = {}
teams = [item.text for item in soup.select('.lineup__abbr')] #26
matches = {}
i = 0
for teambox in soup.select('.lineups > div:not(.is-ad, .is-tools)'):
team_visit = teams[i]
team_home = teams[i + 1]
highlights = teambox.select('.lineup__player-highlight-name a')
visit_highlight = highlights[0].text
home_highlight = highlights[1].text
match = team_visit + ' v ' + team_home
visitors = [item['title'] for item in teambox.select('.is-visit .lineup__player [title]')]
home = [item['title'] for item in teambox.select('.is-home .lineup__player [title]')]
matches[match] = {'visitor' : [{team_visit : visitors}] ,
'home' : [{team_home : home}]
}
i+=1
Example info:
Current structure:
I think you were almost there, you just needed to tweak it a little bit:
awayPlayers = [player.find('a').text for player in players]
This list comprehension will grab just the names from the list then pull the text from the anchor...you get just a list of the names:
['L. Martin',
'Jose Ramirez',
'J. Luplow'...]
You have to find a tag and title attribute in it, check below answer.
awayPlayers = [player.find('a').get('title') for player in players]
print(awayPlayers)
Output is:
['Leonys Martin', 'Jose Ramirez', 'Jordan Luplow', 'Carlos Santana',

Extract data from a site using python

I am making a program that will extract the data from http://www.gujarat.ngosindia.com/
I wrote the following code :
def split_line(text):
words = text.split()
i = 0
details = ''
while ((words[i] !='Contact')) and (i<len(words)):
i=i+1
if(words[i] == 'Contact:'):
break
while ((words[i] !='Purpose')) and (i<len(words)):
if (words[i] == 'Purpose:'):
break
details = details+words[i]+' '
i=i+1
print(details)
def get_ngo_detail(ngo_url):
html=urlopen(ngo_url).read()
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'border3'})
td = soup.find('td', {'class': 'border'})
split_line(td.text)
def get_ngo_names(gujrat_url):
html = urlopen(gujrat_url).read()
soup = BeautifulSoup(html)
for link in soup.findAll('div',{'id':'mainbox'}):
for text in link.find_all('a'):
print(text.get_text())
ngo_link = 'http://www.gujarat.ngosindia.com/'+text.get('href')
get_ngo_detail(ngo_link)
#NGO_name = text2.get_text())
a = get_ngo_names(BASE_URL)
print a
But when i run this script i only get the name of NGOs and contact person.
I want Email, telephone number, website, purpose and contact person.
Your split_line could be improved. Imagine you have this text:
s = """Add: 3rd Floor Khemha House
Drive in Road, Opp Drive in Cinema
Ahmedabad - 380 054
Gujarat
Tel: 91-79-7457611 , 79-7450378
Email: a.mitra1#lse.ac.uk
Website: http://www.aavishkaar.org
Contact: Angha Mitra
Purpose: Economics and Finance, Micro-enterprises
Aim/Objective/Mission: To provide timely financing, management support and professional expertise ..."""
Now we can turn this into lines using s.split("\n") (split on each new line), giving a list where each item is a line:
lines = s.split("\n")
lines == ['Add: 3rd Floor Khemha House',
'Drive in Road, Opp Drive in Cinema',
...]
We can define a list of the elements we want to extract, and a dictionary to hold the results:
targets = ["Contact", "Purpose", "Email"]
results = {}
And work through each line, capturing the information we want:
for line in lines:
l = line.split(":")
if l[0] in targets:
results[l[0]] = l[1]
This gives me:
results == {'Contact': ' Angha Mitra',
'Purpose': ' Economics and Finance, Micro-enterprises',
'Email': ' a.mitra1#lse.ac.uk'}
Try to split the contents of the ngos site better, you can give the "split" method a regular expression to split by.
e.g. "[Contact]+[Email]+[telephone number]+[website]+[purpose]+[contact person]
My regular expression could be wrong but this is the direction you should head in.

Categories