Scraping zip files from website in Python - python

I was hoping someone could help me figure out how to scrape data from this page. I don't know where to start, as I've never worked with scraping or automating downloads in Python, but I'm just trying to find a way to automate downloading all the files on the linked page (and others like it -- just using this one as an example).
There is no discernible pattern in the file names linked; they appear to be random numbers that reference an ID-file name lookup table elsewhere.

for above URL provided you could download zip files by following the below code:
import re
import requests
from bs4 import BeautifulSoup
hostname="http://mis.ercot.com"
r = requests.get(f'{hostname}/misapp/GetReports.do?reportTypeId=13060&reportTitle=Historical%20DAM%20Load%20Zone%20and%20Hub%20Prices&showHTMLView=&mimicKey')
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*misdownload/servlets/mirDownload.*')
atgs=soup.findAll("a",{"href":regex})
for link in atgs:
data=requests.get(f"{hostname}{link['href']}")
filename=link["href"].split("doclookupId=")[1][:-1]+".zip"
with open(filename,"wb") as savezip:
savezip.write(data.content)
print(filename,"Saved")
Let me know if you have any questions :)

Related

Best way to scrape job details from job descriptions

New to web scrapers and I prefer to use Python. Does anyone have any ideas for the easiest way to scrape job descriptions and input them into an excel file? Which scraper would you use?
Depends, for a dynamic website Selenium is great. Selenium is a tool that automates web actions. Beautiful Soup is also another option. Beautiful Soup doesn't automate website actions, it will just scrape website data. In my opinion, Beautiful Soup is easier to learn. One basic introduction will be all you need. As for the excel file, there are several libraries you could use, that is more of a preference.
However, for your project I would go with beautiful soup.
As for the process of learning, YouTube is a great place to find tutorials, there are several for both. It's also really easy to find help with issues with either on here.
To give you a hint as to the general structure of your program, I would suggest something like this:
First Step: open an excel file, this file will remain open for the whole time
Second Step: webscraper locates the HTML tag for the job description
Third Step: use a for loop to cycle through each job description within this tag
Fourth Step: for each tag you retrieve the data and send it to an excel sheet
Fifth Step: once your done you close the excel sheet
Libraries I personally use: here
This is generally the boilerplate code most people probably use to start web scraping:
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
from os.path import dirname, join
current_dir = dirname(__file__)
print(current_dir)
code = 0
url_loop = "test.com"
r = (requests.get(url_loop))
error = "The page cannot be displayed because an internal server error has occurred."
soup = BeautifulSoup(r.text, 'html.parser')
Request is how you send HTTP Requests
BS4 is how you parse and extract specific info from the page such as all h1
tags
Pprint just formats the result nicely
As for using the collected data in excel: Here
Good luck!

Scraping youtube to get dynamically loaded content

I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)
Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.
I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.

Probem with webscraping with BeautifulSoup

I am new in using beautifulSoup and having a question; appreciate your help:
from bs4 import BeautifulSoup as soup
import requests
URL = 'https://www.kbb.com/car-values/'
page = requests.get(URL)
soup1 = soup(page.content, 'html-parser')
print(soup1.prettify())
In parallel, I went to the URL in a separate browser and inspect the page to get the HTML version of the page to establish patterns.
I found two independent patterns that meet my need
yyyy1
and
yyyy2
P.S. xxxx1, xxxx2, yyyy1 and yyyy2 are just strings
I went back to the prettify() output and searched for the pattern xxxx1 and I found it but when I searched for pattern xxxx2 I could not find it?
It seems like the soup object does not contain all info in the HTML page? or I am not looking at the right HTML page?
I can not guess what I did wrong and how to do it right?
Thanks
Initially a modification was needed to run your code, changed the 'html-parser' to 'html.parser'. This fixed the bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html-parser. Do you need to install a parser library?
Locally when I try your code I get:
Access Denied
You don't have permission to access "http://www.kbb.com/" on this server.
Reference #18.afe17b5c.1587328194.c07350f
Are there restrictions on some countries?

parse entire website using python beatifulsoup

When i try to parse https://www.forbes.com/ for learning purpose. when i run the code, it only parse one page, i mean, home page.
How can i parse entire website, i mean, all the page from a site.
My attempted codes are given below:
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.bdjobs.com/")
soup = BeautifulSoup(html_page, "html.parser")
# To Export to csv file, we used below code.
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
links.append(link.get('href'))
import pandas as pd
df = pd.DataFrame(links)
df.to_csv('link.csv')
#print(df)
Can you tell me please how can i parse entire websites, not one page?
You have a couple of alternatives, it depends what you want to achieve.
Write your own crawler
Similarly as what you are trying to do in your code snippet, fetch a page from the website, identify all the interesting links in this page (using xpath, regular expressions, ...) and iterate until you have visited the whole domain.
This is probably most suitable for learning the basics of crawling, or to get some information quickly as a one-off task.
You'll have to be careful about a couple of thinks, like not to visit the same links twice, limit the domain(s) to avoid going to other websites etc.
Use a web scraping framework
If you are looking to perform some serious scraping, for a production application or some large scale scraping, consider using a framework such as scrapy.
It solves a lot of common problems for you, and it is a great way to learn advanced techniques of web scraping, by reading the documentation and diving into the code.

Scraping data from multiple links within a site

I would like to use scraperwiki and python to build a scraper that will scrape large amounts of information off of different sites. I am wondering if it is possible to point to a single URL and then scrape the data off of each of the links within that site.
For example: A site would contain information about different projects, each within its own individual link. I don't need a list of those links but the actual data contained within them.
The scraper would be looking for the same attributes on each of the links.
Does anyone know how or if I could go about doing this?
Thanks!
Check out BeautifulSoup with urllib2.
http://www.crummy.com/software/BeautifulSoup/
An (very) rough example link scraper would look like this:
from bs4 import BeautifulSoup
import urllib2
c = urllib2.urlopen(url)
contents = c.read()
soup = BeautifulSoup(contents)
links = soup.find_all(a):
Then just write a for loop to do that many times over and you're set!

Categories