Let's say I want to make a website that automatically scrapes specific websites in order to find the ex. bike model that my customer has typed.
Customer: Wants to find this one specific bike model that is really hard to get
Customer: Finds the website www.EXAMPLE.com, the website will notify him when there is an auction on ex. ebay or amazon.
Customer: Creates free account, and makes a post.
Website: Makes an automated scraping and keeps looking for this bike on ebay and amazon.
Website: As soon as scraping succeed and finds the bike, website sends notification to the customer.
Is that possible to make in python? And will I be able to make such a website with little knowledge after learning a bit of Python?
Yes it possible, you can achieve that by using a package such as Requests for scraping and Flask to build the website, it does require however a bit of knowledge.
Feel free to post a question after diving into the two links
Related
I am rn trying to scrape twitter for an nlp research. I already used snscrape to get tweets with the required filters, the issue is that we need tweets from a specific age range. In my head I guess some profiles on twitter have their birthdate public, so maybe we can fetch that. Maybe webscrape that from the profile? Any ideas are welcomed.
Till now I have tried some methods of webscraping but can't find something concrete
Twitter has a pretty well documented API that works very well with Python.
Try to make a simple crawler and see one of the JSONs that you get for a Tweet/User.
You will need to sign up and get some Access Tokens/Keys to use in your script, but other than that you are ready to go:
https://developer.twitter.com/en/docs/twitter-api
The age of Twitter users is not made available via the API (and the website may show birthday but not year). There are also a number of other factors you should read about in relation to analysis of Twitter user data.
I use Selenium in Python and I want to scrape a lot of websites from one company (many hundreds). But that shouldn't burden the system under any circumstances and because this is a very large website anyway, it shouldn't be a problem for them.
Now my question is if the company can somehow discover that I'm doing web scraping if I'm acting like a human. That means I stay on a website for an extra long time and allow extra time to pass.
I don't think you can recognize me by my IP, because the period of time is very long while I do this and I think it looks like normal traffic.
Are there any other ways that websites can see that I am doing webscraping or generally running a script?
Many Thanks
(P.S.:I know that a similar question has already been asked, but the answer was simply that he doesn't behave like a human and visits the website too quickly. But it's different for me ...)
When you scraping make sure that you respect the robots.txt file which is based at the root of the website. It set the rules of crawling: which parts of the website should not be scraped, how frequently it can be scraped.
User navigation patterns are monitored by large companies to detect bots and scraping attempts. There are many anti scraping tools available in market which are using AI to monitor the various patterns to differentiate between a human and a bot.
Some of the main techniques used to prevent scraping apart from software are
Captcha,
Honey traps,
UA monitoring,
IP monitoring,
Javascript encryption, etc..
There are many more, so what i am saying is that yes it can be detected.
One way they can tell is from your browser headers
I have a question- does CNN permit you to scrape data if it's for your own personal use? for instance, if i wanted to write a quick program that would scrape the price of a certain stock, can i scrape CNN money?
I've just started learning python so I apologize if this is a stupid question.
Obligatory I am not a lawyer.
In CNN's terms of use page it states that
You may not modify, publish, transmit, participate in the transfer or
sale, create derivative works, or in any way exploit, any of the
content, in whole or in part.
You may download copyrighted material
for your personal use only
So it looks like if you do it for personal use only and don't share any of the results of the work you would be fine.
However, some sites can scrapers automatically if they issue too many requests, so be sure to rate-limit your scraping, and don't request too many pages.
I'm trying to scrape Facebook public page likes data using Python. My scraper uses the post number in order to scrape the likes data. However, some posts have more than 6000 likes and I can only scrape 6000 likes, also I have been told that this is due to Facebook restriction which doesn't allow to scrape more than 6000 per day. How can I continue scrape the likes for the post from the point the scraper stop scraping.
I am thinking maybe facebook has limited the scraping from the same address which over 6000 times. You can try to use scrapy which is a package that used to scrap webpages, it has a component which like a ip pool that can be used for this.
In tags I see facebook-graph-api, which has limitations. Why don't you use requrests + lxml? It would be such easier, and as you want to scrape public pages, you don't even have to login, so it could be easily solve.
I have a task to complete. I need to make a web crawler kind of application. What i need to do is to pass a url to my application. This url is website of a government agency. This url also having some links to other individual agencies which are approved by this government agency. I need to go to those links and get some information from that site about that agency. I hope i make myself clear.Now i have to make this application generic. It means i can't hard code it for just one website(government agency). I need to make it like any url given to it , it should check it and then get all the links and proceed. Now in some website these links present in pdfs and in some they are present on a page.
I have to use python for this. I don't know how to approach this. I spend time on this using BeautifulSoup but that require lots of parsing. Other options are scrapy or twill. Honestly i am new to python. I dont know which one is better for this task. So any one can help me in selecting the right tool and right approach to solve this problem. Thanks in advance
There is plenty of information out there about building web scrapers with Python. Python is a great tool for the job.
There are also tons of posts about web scrapers on this website if you search for them.