Currently trying to extract links of of pastebin using python. What I have so far:
from bs4 import BeautifulSoup
import re
import requests
from random import randint
import time
from lxml import etree
from time import sleep
import random
a = requests.get('http://pastebin.com/JGM3p9c9')
scrape = BeautifulSoup(a.text, 'lxml')
linkz = scrape.find_all("textarea", {"id":"paste_code"})
rawlinks = str(linkz)
partition1 = rawlinks.partition('\\r')[0]
links = partition1.partition('">')[-1]
I cant seem to get python to compile all of the http:// formatted links but only the first... Using regex's I found online didnt work
End goal I'm trying to get the links into a list, in which I can send requests to all of the links in the list I compile.
Firstly, you do not have to extract the complete tag and change it into str. Better way to achieve it is:
# `next` to extract content within tag v
# instead use `find` v v
>>> my_links = scrape.find("textarea", {"id":"paste_code"}).next
where my_links will hold the value:
u'http://www.walmart.com\r\nhttp://www.target.com\r\nhttp://www.lowes.com\r\nhttp://www.sears.com'
In order to convert this string to your desired list of links, you may split the string on \r\n as:
>>> my_links.split('\r\n')
[u'http://www.walmart.com', u'http://www.target.com', u'http://www.lowes.com', u'http://www.sears.com']
You need to navigate through a couple of layers of HTML but I had a look at the pastebin page and I think this code will find what you want (sorry for switching a couple of modules I just use these ones instead)
from bs4 import BeautifulSoup
import urllib.request
a = urllib.request.urlopen('http://pastebin.com/JGM3p9c9')
scrape = BeautifulSoup(a, 'html.parser')
x1 = scrape.find_all('div', id = 'selectable')
for x2 in x1:
x3 = x2.find_all('li')
for x4 in x3:
x5 = x4.find_all('div')
for x6 in x5:
print(x6.string)
Next time you need to scrape a specific thing I advice looking at the HML of the website by right-clicking and selecting 'Inspect Element'. also you can do:
print(scrape.prettify())
To get a better idea of how the HTML is nested.
Forget using BS to parse the HTML - in this specific case, you can get the content of the PasteBin directly, and turn this into a one liner.
import requests
links = [link.strip() for link in requests.get('http://pastebin.com/raw/JGM3p9c9').text.split('\n')]
You can also split on \r\n
Related
I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.
I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809
I am trying to access a table from the NIST website here:
http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html
Assume that I click the element zinc. I would like to retrieve the information for Energy, u/p and u[en]/p into 3 columns of a table using python 2.7.
I am beginning to learn BeautifulSoup and Mechanize. However, I am finding it hard to identify a clear pattern in the HTML code relating to the table on this site.
What I am looking for is some way to something like this:
import mechanize
from bs4 import BeautifulSoup
page=mech.open("http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html")
html = page.read()
soup = BeautifulSoup(html)
My thought was to try:
table = soup.find("table",...)
The ... above would be some identifier. I can't find a clear identifier on the NIST website above.
How would I be able to import this table using python 2.7?
EDIT: Is it possible to put these 3 columns in a table?
If I understood you well,
Try this:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('table').find_all('tr')
for i in range(3 , len(l)):
print l[i].get_text()
Edit:
Other way (Getting ASCII column) and put rows to the list l:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('pre').get_text()[145:].split("\n")
print l
I've been trying to scrap the names in a table from a website using bsoup script but the program is returning nothing or "[]". I would appreciate if any one can help me pointing what I'm doing wrong.
Here is what I'm trying to run:
from bs4 import BeautifulSoup
import urllib2
url="http://www.trackinfo.com/entries-race.jsp?raceid=GBM$20140228E02"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=soup.findAll('a',{'href':'href="dog.jsp?runnername=[^.]*'})
for eachname in names:
print eachname.string
And here is one of the elements that I'm trying to get:
<a href="dog.jsp?runnername=PG+BAD+GRANDPA">
PG BAD GRANDPA
</a>
See the documentation for BeautifulSoup, which says that if you want to give a regular expression in a search, you need to pass in a compiled regular expression.
Taking your variables, this is what you want:
import re
names = soup.find_all("a",{"href":re.compile("dog")})
A different approach, this one using Requests instead of urllib2. Matter of preference, really. Main point is that you should clean up your code, especially the indentation on the last line.
from bs4 import BeautifulSoup as bs
import requests
import re
url = "http://www.trackinfo.com/entries-race.jsp?raceid=GBM$20140228E02"
r = requests.get(url).content
soup = bs(r)
soup.prettify()
names = soup.find_all("a", href=re.compile("dog"))
for name in names:
print name.get_text().strip()
Let us know if this helps.
This is my first Python project so it is very basic and rudimentary.
I often have to clean off viruses for friends and the free programs that I use are updated often. Instead of manually downloading each program, I was trying to create a simple way to automate the process. Since I am also trying to learn python I thought it would be a good opportunity to practice.
Questions:
I have to find the .exe file with some of the links. I can find the correct URL, but I get an error when it tries to download.
Is there a way to add all of the links into a list, and then create a function to go through the list and run the function on each url? I've Google'd quite a bit and I just cannot seem to make it work. Maybe I am not thinking in the right direction?
import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup
# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []
# Find exe files to download
match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)
# Check links
#def findexe():
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
As you can see, I have left the function commented out as I cannot get it to work correctly.
Should I abandon the list and just download them individually? I was trying to be efficient.
Any suggestions or if you could point me in the right direction, it would be most appreciated.
In addition to mikez302's answer, here's a slightly more readable way to write your code:
import os
import re
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
websites = [
'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
'http://www.simplysup.com/tremover/download.html'
]
download_links = []
for url in websites:
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection)
connection.close()
for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
download_links.append(link['href'])
for url in download_links:
urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))
urllib2.urlopen is a function for accessing a single URL. If you want to access multiple ones, you should loop over the list. You should do something like this:
for url in urllist:
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
# Check links
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
The code above didn't work for me, in my case it was because the pages assemble their links through a script instead of including it in the code. When I ran into that problem I used the following code which is just a scraper:
import os
import re
import urllib
import urllib2
from bs4 import BeautifulSoup
url = ''
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here
regex = '(.+?).zip' #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
link[x]=i.split(' ')[len(i.split(' '))-1]
# When it finds all the .zip, it usually comes back with a lot of undesirable
# text, luckily the file name is almost always separated by a space from the
# rest of the text which is why we do the split
x+=1
os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download
for i in link:
urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that.
This is not as efficient as the codes in the previous answers but it will work for most almost any site.