I am having trouble displaying the right named capture group by using regex. I already have the regex formula it to capture that group. Here is my regex link to show. By looking at the link, I am trying to display the text highlighted in green.
The green part is the page titles from the link-contained JSON API. They are labeled as 'article.' What I've done so far is to parse through the JSON to get the list of articles and display it. Some articles have multiple pages and I am just trying to display that very first page. That is why I used REGEX since I am working with huge files here. I am trying to get that green part of the regex to display within my function. This is the link of where my working code without regex implementation. Here is what I tried my code so far:
import json
import requests
import re
link = "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikiversity/all-access/2018/01/10"
def making_data(link):
response = requests.get(link, [])
data = response.json()
json_data = data['items']
articles_list = []
whole_re= re.compile(r'^[^\/].*')
rx = re.compile(r'(^[^\/]+)')
for items in json_data:
articles = items['articles']
#Iterate over the list of articles
for article in articles:
m = whole_re.match(article)
if m:
articles_list.append(m)
articles = article.get("article")
search_match = rx.match(article)
if search_match:
print("Page: %s" % articles)
return sorted(articles_list)
making_data(link)
I keep getting an error with regex. I think I am implementing this wrong with JSON and regex.
I want the output to just display what is highlighted in green from the regex link provided and not the following text after that.
Page: Psycholinguistics
Page: Java_Tutorial
Page: United_States_currency
I hope this all makes sense. I appreciate all the help.
If you print your article you will see it is a dictionary format. Your regex isn't what is wrong here, instead it is how you are referencing article.
You intend to reference article_title = article.get("article") from your original code that you linked, I believe.
Another thing that will become an issue is renaming articles in the middle of your loop. I made some edits for you to get you going but it will need some refinement based on your exact usage and results that you want.
You can reference a match object group with .group(1)
import json
import requests
import re
link = "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikiversity/all-access/2018/01/10"
def making_data(link):
response = requests.get(link, [])
data = response.json()
json_data = data['items']
articles_list = []
whole_re= re.compile(r'^[^\/].*')
rx = re.compile(r'(^[^\/]+)')
for items in json_data:
articles = items['articles']
#Iterate over the list of articles
for article in articles:
article_title = article.get("article")
m = whole_re.match(article_title)
if m:
articles_list.append(m[0])
search_match = rx.match(article_title)
if search_match:
print("Page: %s" % search_match.group(1))
return sorted(articles_list)
making_data(link)
Related
I am very new and need your help. I want to write a script I can generalize to do web-scraping. So far I have the code below, but it keeps giving me a blank output file. I would like to be able to easily modify this code to work on all websites and eventually be able to make the search strings a little more complex. For now, I have CNN as a general page, and "mccarthy" b/c I figure there are certainly articles with him in them right now. Can you help me get this to work?
#Begin Code
import requests
from bs4 import BeautifulSoup
import docx
# Set the search parameters
search_term = 'mccarthy' # Set the search term
start_date = '2023-01-04' # Set the start date (format: YYYY-MM-DD)
end_date = '2023-01-05' # Set the end date (format: YYYY-MM-DD)
website = 'https://www.cnn.com' # Set the website to search
document = open('testfile.docx','w') # Open the existing Word document
# Initialize the list of articles and the page number
articles = []
page_number = 1
# Set the base URL for the search API
search_url = f'{website}/search'
# Set the base URL for the article page
article_base_url = f'{website}/article/'
while articles or page_number == 1:
# Send a request to the search API
response = requests.get(search_url, params={'q': search_term, 'from': start_date, 'to': end_date, 'page': page_number})
# Check if the response is in JSON format
if response.headers['Content-Type'] == 'application/json':
# Load the JSON data
data = response.json()
# Get the list of articles from the JSON data
articles = data['articles']
else:
# Parse the HTML content of the search results page
soup = BeautifulSoup(response.text, 'html.parser')
# Find all articles on the search results page
articles = soup.find_all('article')
# Loop through the articles
for article in articles:
# Find the link element
link_element = article.find('a', class_='title')
# Extract the link from the link element
link = link_element['href']
# Check if the link is a relative URL
if link.startswith('/'):
# If the link is relative, convert it to an absolute URL
link = f'{website}{link}'
# Add the link to the document
document.add_paragraph(link)
# Increment the page number
page_number += 1
# Save the document
document.close()
I have tried numerous iterations, but I have deleted them all so cannot really post any here. This keeps giving me a blank output file.
This won't solve the main issue but a couple of things to fix:
https://edition.cnn.com/search?q=&from=0&size=10&page=1&sort=newest&types=all§ion=
Looking at the CNN search page URL, we can see that the from parameter is not referring to a date but a number instead, i.e. if from=5, it will only show the 5th article onwards. Therefore you can remove 'from' and 'to' from your request params.
articles = soup.find_all('article')
This is returning an empty list because there are no <article> tags within the HTML page. Inspecting the CNN HTML we see that the URLs you are looking for are within <div class="card container__item container__item--type- __item __item--type- "> tags so I would change this line to soup.find_all('div', class_="card container__item container__item--type- __item __item--type- ")
document = open('testfile.docx','w') # Open the existing Word document
You've imported the docx module but are not using it. Word documents (which require extra data for formatting) should be opened like this document = Document(). For reference, here are the docx docs: https://python-docx.readthedocs.io/en/latest/
while articles or page_number == 1:
I don't think this line is needed.
The main issue seems to be that this page requires Javascript to be run to render the content. Using request.get() by itself won't do this. You'll need to use a library such as Requests-HTML. I tried doing this but the articles still don't render so I'm not sure.
I intend to extract the article text from an NYT article. However I don't know how to extract by html5 tags such as section name.
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('https://www.nytimes.com/2019/10/24/opinion/chuck-schumer-electric-car.html?action=click&module=Opinion&pgtype=Homepage')
soup = BeautifulSoup(html)
data = soup.findAll(text=True)
The main text is wrapped in a section named 'articleBody'. What kind of soup.find() syntax can I use to extract that?
The find method searches tags, it doesn't differentiate HTML5 from any other (X)HTML tag name
article = soup.find("section",{"name":"articleBody"})
You can scrape the pre-loaded data from script tag and parse with json library. The first code block brings back a little more content than you wanted.
You can further restrict by looking up ids of paragraphs within body, and use those to filter content, as shown in bottom block; You then get exactly the article content you describe.
import requests, re, json
r = requests.get('https://www.nytimes.com/2019/10/24/opinion/chuck-schumer-electric-car.html?action=click&module=Opinion&pgtype=Homepage')
p = re.compile(r'window\.__preloadedData = (.*})')
data = json.loads(p.findall(r.text)[0])
for k,v in data['initialState'].items():
if k.startswith('$Article') and 'formats' in v:
print(v['text#stripHtml'] if 'text#stripHtml' in v else v['text'])
You can explore the json here: https://jsoneditoronline.org/?id=f9ae1fb774af439d8e9b32247db9d853
The following shows how to use additional logic to limit to just output you want:
ids = []
for k,v in data['initialState'].items():
if k.startswith('$Article') and v['__typename'] == 'ParagraphBlock' and 'content' in v:
ids += [v['content'][0]['id']]
for k,v in data['initialState'].items():
if k in ids:
print(v['text'])
I am trying to use re to pull out a url from something I have scraped. I am using the below code to pull out the data below but it seems to come up empty. I am not very familiar with re. Could you give me how to pull out the url?
match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', match`
#print url just prints both. I only need the match = "http://www.stats.gov.cn/tjsj/zxfb/ANYTHINGHERE/ANYTHINGHERE.html"
print(url)
Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';"]
Okay I found the solution. The .+ looks for any number of characters between http://www.stats.gov.cn/ & .html. Thanks for your help with this.
match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]
url = re.findall('http://www.stats.gov.cn/.+.html', str(match))
print(url)
Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html"]
I'm trying to figure out how to get Python3 to display a certain phrase from an HTML document. For example, I'll be using the search engine https://duckduckgo.com .
I'd like the code to do key search for var error=document.getElementById; and get it to display what in the parenthesis are, in this case, it would be "error_homepage". Any help would be appreciated.
import urllib.request
u = input ('Please enter URL: ')
x = urllib.request.urlopen(u)
print(x.read())
You can simply read the website of interest, as you suggested, using urllib.request, and use regular expressions to search the retrieved HTML/JS/... code:
import re
import urllib.request
# the URL that data is read from
url = "http://..."
# the regex pattern for extracting element IDs
pattern = r"var error = document.getElementById\(['\"](?P<element_id>[a-zA-Z0-9_-]+)['\"]\);"
# fetch HTML code
with urllib.request.urlopen(url) as f:
html = f.read().decode("utf8")
# extract element IDs
for m in re.findall(pattern, html):
print(m)
I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.