Scraping RSS scraping system - python

I am relatively new to python only about 2 months of learning mostly by myself and loving it. I have been trying to design a program that will scrape text RSS feeds from the National Weather Service but I have no idea where to start. I want something that will scan for severe weather aka tornado watches warnings exct and send them to my email. I have already scripted a simple email alert system that will even text my phone. I was wondering if any of you guys could point me in the right direction in how to go about building an rss scraper and incorporating that with the email program to build a functional weather alert system? I am a huge weather nerd if you cant tell, and this will end up being my senior year project and something to hopefully impress my meteorology professors next year. I would appreciate any help anybody could give.
Thanks,
Andrew :D

Don't reinvent the wheel, just use FeedParser. It knows how to handle all corner cases and crazy markup better than you'll ever do.

You will need a RSS Feed parser. Once you have parsed the feeds, you will have all the relevant information needed by you. Take a look at feedparser: http://code.google.com/p/feedparser/

you can use scrapy. scrapy is the one of the latest, greatest crawling tool.
You can use this to scrape any web content. Its worth learning.
http://doc.scrapy.org/en/0.14/index.html

Related

Scan webpage for text changes and content

New to programming. If this is too basic I do apologize.
As a more complex project to hone my budding skills with I am trying to build a price scanner and reporting feature that waits for new text on a webpage such as sale,discount,etc and read the price and send that information back to me.
Can you help me get pointed in the right direction?
I think, you need to do some research on scraping.
Some trainings and books that you may be interested to go deeper:
Data Scraping and Data Mining from Beginner to Pro with Python
Python Web Scraping Cookbook

Can websites detect web scraping if I act like a human (Selenium, Python)?

I use Selenium in Python and I want to scrape a lot of websites from one company (many hundreds). But that shouldn't burden the system under any circumstances and because this is a very large website anyway, it shouldn't be a problem for them.
Now my question is if the company can somehow discover that I'm doing web scraping if I'm acting like a human. That means I stay on a website for an extra long time and allow extra time to pass.
I don't think you can recognize me by my IP, because the period of time is very long while I do this and I think it looks like normal traffic.
Are there any other ways that websites can see that I am doing webscraping or generally running a script?
Many Thanks
(P.S.:I know that a similar question has already been asked, but the answer was simply that he doesn't behave like a human and visits the website too quickly. But it's different for me ...)
When you scraping make sure that you respect the robots.txt file which is based at the root of the website. It set the rules of crawling: which parts of the website should not be scraped, how frequently it can be scraped.
User navigation patterns are monitored by large companies to detect bots and scraping attempts. There are many anti scraping tools available in market which are using AI to monitor the various patterns to differentiate between a human and a bot.
Some of the main techniques used to prevent scraping apart from software are
Captcha,
Honey traps,
UA monitoring,
IP monitoring,
Javascript encryption, etc..
There are many more, so what i am saying is that yes it can be detected.
One way they can tell is from your browser headers

Web scraping CNN data

I have a question- does CNN permit you to scrape data if it's for your own personal use? for instance, if i wanted to write a quick program that would scrape the price of a certain stock, can i scrape CNN money?
I've just started learning python so I apologize if this is a stupid question.
Obligatory I am not a lawyer.
In CNN's terms of use page it states that
You may not modify, publish, transmit, participate in the transfer or
sale, create derivative works, or in any way exploit, any of the
content, in whole or in part.
You may download copyrighted material
for your personal use only
So it looks like if you do it for personal use only and don't share any of the results of the work you would be fine.
However, some sites can scrapers automatically if they issue too many requests, so be sure to rate-limit your scraping, and don't request too many pages.

Building comprehensive scraping program/database for real estate websites

I have a project I’m exploring where I want to scrape the real estate broker websites in my country (30-40 websites of listings) and keep the information about each property in a database.
I have experimented a bit with scraping in python using both BeautifulSoup and Scrapy.
What I would Ideally like to achieve is a daily updated database that will find new properties and remove properties when they are sold.
Any pointers as to how to achieve this?
I am relatively new to programming and open to learning different languages and resources if python isn’t suitable.
Sorry if this forum isn’t intended for this kind of vague question :-)
Build a scraper and schedule a daily run. You can use scrapy and the daily run will update the database daily.

To redirect a twitter page through the Chinese firewall

My younger brother, who still lives in China is a fan of Michael Phelps. He wants to see his twitter posts. Since they can't access twitter behind the GFW and setting up a VPN is too hard for my mom. I want to write something that grabs the twitter and sends them to my mom's email.
I use python as my main language. Familiar with tweepy / request / scrapy
I have tried or thought about three ways of doing this:
Use the twitter API and grabs the user_timeline. However, this method will lost all graphical data and throws a bunch of useless links that are only visible after proper rendering
Do a web scraping and save the html content. Then send the html file as an attachment. However, this method still loses some graphical contents and is not that user friendly to someone in her 40s. In addition, it will be kinda hard to tell how many tweets I have scraped and if there's any updates.
Wrap the html content in the email and use html rendering within the email. I haven't work with this before so I am not exactly sure how its gonna work out.
I am aware that "what's the best way to do this" kinda question is always downvoted on SO but I do believe this problem is particular enough to engage meaningful Q&As. Any suggestion will be appreciated.
Have you thought of using selenium and taking screen shots of the browser window? Taking a screen shot with selenium is as easy as
browser.get('twitter.com')
browser.get_screenshot_as_file('twitter_screenshot.png')
You'd have to figure out a way to automate both watching for new tweets and running the selenium script when a new tweet is found. However in terms of preserving graphical content, taking screenshots w/ Selenium would be simple to implement.
Docs: http://selenium-python.readthedocs.io/api.html#selenium.webdriver.remote.webdriver.WebDriver.get_screenshot_as_file

Categories