Problem with re.findall (duplicates)

Problem with re.findall (duplicates) - python

I tried to fetch source of 4chan site, and get links to threads.
I have problem with regexp (isn't working). Source:
import urllib2, re
req = urllib2.Request('http://boards.4chan.org/wg/')
resp = urllib2.urlopen(req)
html = resp.read()
print re.findall("res/[0-9]+", html)
#print re.findall("^res/[0-9]+$", html)
The problem is that:
print re.findall("res/[0-9]+", html)
is giving duplicates.
I can't use:
print re.findall("^res/[0-9]+$", html)
I have read python docs but they didn't help.

That's because there are multiple copies of the link in the source.
You can easily make them unique by putting them in a set.
>>> print set(re.findall("res/[0-9]+", html))
set(['res/3833795', 'res/3837945', 'res/3835377', 'res/3837941', 'res/3837942',
'res/3837950', 'res/3100203', 'res/3836997', 'res/3837643', 'res/3835174'])
But if you are going to do anything more complex than this, I'd recommend you use a library that can parse HTML. Either BeautifulSoup or lxml.

Related

How do I extract the data from the URL using Regex (Know the variable name)?

I am trying to extract data from a website https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited using Scrapy and Beautiful Soup. However, both scrapers return empty when I use the class 'list-nw'.
I tried different parsers using BS but the same. On closer look, I noticed the view source has the data I need. Thus I get the page content in text which has the data. (rather than the class).
How do I extract the entire array using Regex for the key "LstrationaleDetails" inside variable var Model. (Line number 793)?
I tried several Regex but was unable to. Is Regex the only option or I can use Scrapy or BS? Also confused as after extracting how will I store it? If it was a JSON I could de-serialize it. I was thinking of something in lines of split and eval.
I tried this for BS.
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html5lib.parser')
print(soup)
Thanks for the help.

Attributable to #t.m.adam
You can use the following regex to extract from source html. Use the DOTALL flag to allow for newlines. User-Agent is required in headers.
import requests
import re
import json
url = 'https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited'
headers = {
'User-Agent' : 'Mozilla/5.0'
}
r = requests.get(url, headers = headers)
data = re.search('var Model =(.*?);\s+Ratinoal', r.text, flags=re.DOTALL).group(1)
result = json.loads(data)
for item in result['LstrationaleDetails']:
print(item)

How to print the URL's of google in python?

How can we print the URL's which appears in the Google's results in the python in version 2.7?
Here is the code,
domain = sys.argv[1];
print domain;
test = []
test.append("aa")
mainURL = "http://google.com/?q=";
finalurl = mainURL + test[0]
req = requests.get(finalurl)
How can I print the URL's after firing up the request.get(finalurl)?

Normally it would be:
r = requests.get("url");
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all("cite")
You can look at the website's source code (ctrl+U, or dev-tools/Inspector and find the tags/class/id where your data is. I see e.g.: <cite class="iUh30">link...)
Now the problem is here, if you look at, the output of:
print(soup.prettify())
Then you see, they do not just give you back plain-text, but generate it with JS.
You could have a look at other search engines responses
Dig deeper into Js code execution from python (maybe other libs etc?)
Please, also think about, that some would not consider parsing out data from other websites programatically and in big amount without permission legit. But I have no idea what google's policy is regarding this.
requests advanced usage
bs4
=======
EDIT:
#Rahul's link is the fastest solution to OP.
https://www.geeksforgeeks.org/performing-google-search-using-python-code/

Scraping links in Pattern library for Python

I found code similar to this in a course I was taking. This code gets all of the links of a certain format that are mentioned in the source code of the webpage. I understand everything, except for the last line. The last line says the following:
print link.attrs.get('href', '')
This works, however I'm unsure as to how the instructor figured out how to do this. I've looked through the documentation and I can't figure out what .get does. Could someone please let me know how I can find this information.
Documentation for Pattern Library: http://www.clips.ua.ac.be/pages/pattern-web
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
pattern = 'http://www.realclearpolitics.com/epolls/????/governor/??/*-*.html'
dom = web.Element(xml)
all_links = dom.by_tag('a')
for link in all_links:
print link.attrs.get('href', '')

It would get all the href "hyperlinks" in that page. You can BeautifulSoup package which is more convinient
from bs4 import BeautifulSoup
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html")
soup = BeautifulSoup(xml, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.

The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.

For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

Downloading files from multiple websites.

This is my first Python project so it is very basic and rudimentary.
I often have to clean off viruses for friends and the free programs that I use are updated often. Instead of manually downloading each program, I was trying to create a simple way to automate the process. Since I am also trying to learn python I thought it would be a good opportunity to practice.
Questions:
I have to find the .exe file with some of the links. I can find the correct URL, but I get an error when it tries to download.
Is there a way to add all of the links into a list, and then create a function to go through the list and run the function on each url? I've Google'd quite a bit and I just cannot seem to make it work. Maybe I am not thinking in the right direction?
import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup
# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []
# Find exe files to download
match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)
# Check links
#def findexe():
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
As you can see, I have left the function commented out as I cannot get it to work correctly.
Should I abandon the list and just download them individually? I was trying to be efficient.
Any suggestions or if you could point me in the right direction, it would be most appreciated.

In addition to mikez302's answer, here's a slightly more readable way to write your code:
import os
import re
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
websites = [
'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
'http://www.simplysup.com/tremover/download.html'
]
download_links = []
for url in websites:
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection)
connection.close()
for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
download_links.append(link['href'])
for url in download_links:
urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))

urllib2.urlopen is a function for accessing a single URL. If you want to access multiple ones, you should loop over the list. You should do something like this:
for url in urllist:
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
# Check links
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))

The code above didn't work for me, in my case it was because the pages assemble their links through a script instead of including it in the code. When I ran into that problem I used the following code which is just a scraper:
import os
import re
import urllib
import urllib2
from bs4 import BeautifulSoup
url = ''
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here
regex = '(.+?).zip' #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
link[x]=i.split(' ')[len(i.split(' '))-1]
# When it finds all the .zip, it usually comes back with a lot of undesirable
# text, luckily the file name is almost always separated by a space from the
# rest of the text which is why we do the split
x+=1
os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download
for i in link:
urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that.
This is not as efficient as the codes in the previous answers but it will work for most almost any site.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with re.findall (duplicates) - python

Related

How do I extract the data from the URL using Regex (Know the variable name)?

How to print the URL's of google in python?

Scraping links in Pattern library for Python

Opening webpage and returning a dict of all the links and their text

Downloading files from multiple websites.

Categories

Resources