Using re to pull out url

Using re to pull out url - python

I am trying to use re to pull out a url from something I have scraped. I am using the below code to pull out the data below but it seems to come up empty. I am not very familiar with re. Could you give me how to pull out the url?
match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', match`
#print url just prints both. I only need the match = "http://www.stats.gov.cn/tjsj/zxfb/ANYTHINGHERE/ANYTHINGHERE.html"
print(url)
Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';"]

Okay I found the solution. The .+ looks for any number of characters between http://www.stats.gov.cn/ & .html. Thanks for your help with this.
match = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html';", "http://www.stats.gov.cn'+urlstr+'"]
url = re.findall('http://www.stats.gov.cn/.+.html', str(match))
print(url)
Expected Output = ["http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html"]

Related

Python: Attempting to find a word in a string from HTTPRequest

I'm trying to find a way to search the source code of a web page to see if it contains a key word. However, no matter what I search for on this page, the only result I get is -1, which I think is telling me I'm doing something wrong. Otherwise, I think it should tell me the position of when the word starts. Can someone tell me what I'm doing wrong? Here's the code.
import urllib.request
page = urllib.request.urlopen("http://www.google.com")
print(page.read())
str_page = str(page)
substring = "content"
print(str_page.find("lang"))

import urllib2
webUrl = urllib.request.urlopen('https://www.youtube.com/user/guru99com')
print ("result code: " + str(webUrl.getcode()))
data = webUrl.read()
Source_text = (data)
Keyword = 'your keyword'
if Keyword in Source_text:
#put whatever you want here

Please see if the below mentioned code helps
import urllib.request
url = "http://www.google.com"
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8','ignore') # decode the html page here
if 'lang' in html:
print (html.find("lang")) # gives the position of lang
Kindly decode() the html page may be that will be help out .

Syntax error in web scraping program using beautifulsoup, requests and regex

As part of 'Automate the boring stuff' I am trying to learn how to code in python. One of the exercises is to create a web scraper using beautifulsoup and requests.
I decided to try amazons stock price instead of a price of a product on amazon. I managed to get it to work, but the output was several lines.
So wanted to use regex to just return the stock price and not the loss/win and time stamp as well.
It however kept giving me syntax errors one line 1, I've tried removing the Regex part to return it to just the bs4 and requests part going back to the start but that still gave me the syntax error (I am using VSC to avoid parenthesis errors).
Where am I going wrong? and depending on how wrong, how would the correct code look like?
My code currently looks like this:
import bs4, requests, re
def extractedStockPrice(price):
stockPriceRegex = re.compile(r'''
[0-9]?
,?
[0-9]+
/.
[0-9]*
''', re.VERBOSE)
return stockPriceRegex.search(price)
def getStockPrice(stockUrl):
res = requests.get(stockUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\)')
return elems[0].text.strip()
stockPrice = extractedStockPrice(getStockPrice('https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch'))
print('The price is ' + stockPrice)

The issue seems to be with you regex expression - in the function extractedStockPrice. It does not match the price expression and the search returns "None" which causes the type error mentioned in the comment.
The price string variable, when it reaches the regex part looks like this (example):
'2,042.76-0.24 (-0.01%)At close: 4:00PM EDT'
You can use a regex syntax checker to confirm your regex code: https://www.regexpal.com/ (post the above string as "Test String" and your regex code as "Regular Expression).
Looks like your forward slash should be backwards slash. Also, you need to extract the match once found - you can do this with group(0) (see this and search for re.search: https://docs.python.org/3/library/re.html).
The below code should work (run with Python 3.7):
import bs4, requests, re
def extractedStockPrice(price):
# fixes here:
# 1) use backslash "\" instead of "/".
# 2) use ".group(0)" to extract match.
stockPriceRegex = re.compile(r'''[0-9]?,?[0-9]+\.[0-9]*''', re.VERBOSE)
return stockPriceRegex.search(price).group(0)
def getStockPrice(stockUrl):
res = requests.get(stockUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\)')
return elems[0].text.strip()
stockPrice = extractedStockPrice(getStockPrice('https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch'))
print('The price is ' + stockPrice)
Result: "The price is 2,042.76".

Python: Using JSON API link to display named capture group through regex

I am having trouble displaying the right named capture group by using regex. I already have the regex formula it to capture that group. Here is my regex link to show. By looking at the link, I am trying to display the text highlighted in green.
The green part is the page titles from the link-contained JSON API. They are labeled as 'article.' What I've done so far is to parse through the JSON to get the list of articles and display it. Some articles have multiple pages and I am just trying to display that very first page. That is why I used REGEX since I am working with huge files here. I am trying to get that green part of the regex to display within my function. This is the link of where my working code without regex implementation. Here is what I tried my code so far:
import json
import requests
import re
link = "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikiversity/all-access/2018/01/10"
def making_data(link):
response = requests.get(link, [])
data = response.json()
json_data = data['items']
articles_list = []
whole_re= re.compile(r'^[^\/].*')
rx = re.compile(r'(^[^\/]+)')
for items in json_data:
articles = items['articles']
#Iterate over the list of articles
for article in articles:
m = whole_re.match(article)
if m:
articles_list.append(m)
articles = article.get("article")
search_match = rx.match(article)
if search_match:
print("Page: %s" % articles)
return sorted(articles_list)
making_data(link)
I keep getting an error with regex. I think I am implementing this wrong with JSON and regex.
I want the output to just display what is highlighted in green from the regex link provided and not the following text after that.
Page: Psycholinguistics
Page: Java_Tutorial
Page: United_States_currency
I hope this all makes sense. I appreciate all the help.

If you print your article you will see it is a dictionary format. Your regex isn't what is wrong here, instead it is how you are referencing article.
You intend to reference article_title = article.get("article") from your original code that you linked, I believe.
Another thing that will become an issue is renaming articles in the middle of your loop. I made some edits for you to get you going but it will need some refinement based on your exact usage and results that you want.
You can reference a match object group with .group(1)
import json
import requests
import re
link = "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikiversity/all-access/2018/01/10"
def making_data(link):
response = requests.get(link, [])
data = response.json()
json_data = data['items']
articles_list = []
whole_re= re.compile(r'^[^\/].*')
rx = re.compile(r'(^[^\/]+)')
for items in json_data:
articles = items['articles']
#Iterate over the list of articles
for article in articles:
article_title = article.get("article")
m = whole_re.match(article_title)
if m:
articles_list.append(m[0])
search_match = rx.match(article_title)
if search_match:
print("Page: %s" % search_match.group(1))
return sorted(articles_list)
making_data(link)

Unique phrase in the source code of an HTML page in Python3

I'm trying to figure out how to get Python3 to display a certain phrase from an HTML document. For example, I'll be using the search engine https://duckduckgo.com .
I'd like the code to do key search for var error=document.getElementById; and get it to display what in the parenthesis are, in this case, it would be "error_homepage". Any help would be appreciated.
import urllib.request
u = input ('Please enter URL: ')
x = urllib.request.urlopen(u)
print(x.read())

You can simply read the website of interest, as you suggested, using urllib.request, and use regular expressions to search the retrieved HTML/JS/... code:
import re
import urllib.request
# the URL that data is read from
url = "http://..."
# the regex pattern for extracting element IDs
pattern = r"var error = document.getElementById\(['\"](?P<element_id>[a-zA-Z0-9_-]+)['\"]\);"
# fetch HTML code
with urllib.request.urlopen(url) as f:
html = f.read().decode("utf8")
# extract element IDs
for m in re.findall(pattern, html):
print(m)

How to scrape data from a website using Python 2?

So when I run this code I keep getting empty brackets instead of the actual data.
I am trying to figure out why sense I don't receive any error messages.
import urllib
import re
symbolslist = ["aapl","spy","goog","nflx"]
for symbol in symbolslist:
url = "http://finance.yahoo.com/q?s=%s&ql=1"%(symbol)
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_%s">(.+?)</span>'%(symbol.lower())
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print price

The brackets come up because the element code for regex is not 184 its l84 its an L not a one.

There is a number of libraries around which can help you to scrape sites. Take a look at Scrapy or at Beautiful Soup they should support both Python 2 and 3 as far as I know.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using re to pull out url - python

Related

Python: Attempting to find a word in a string from HTTPRequest

Syntax error in web scraping program using beautifulsoup, requests and regex

Python: Using JSON API link to display named capture group through regex

Unique phrase in the source code of an HTML page in Python3

How to scrape data from a website using Python 2?

Categories

Resources