(Python) Redirecting URLs when I click on them to a different link - python

I couldn’t really find any questions similar to mine, but I was curious if there’s a way to redirect a URL you click on, and make it go to a sub link or a sub URL for a different website. For example:
If you click on the website URL “chess.com” it will redirect you to either: “google.com/ a random sublink “ for example, or “chess.dethgrr45dffrr/google.com” or something like that. I want it to basically load that selected website, but instead of that url I want it to be a different one. This may seem confusing, so my apologies. I was wondering if this could be done either in python or simply in the web browser. I wanted to implement this into my script, so it would just stay on one website, rather than leaving and going to different websites. It doesn’t have to be Google, it could be a different website. I know this is not the best explanation of what I was thinking. If someone could help me out, that would be great, thanks!

Related

How to scrape a website and all its directories from the one link?

Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.

Is there an easy and fast way to generate JavaScript?

My problem begins when i try to crawl an app store, lets say google play.
for every app there are alot of comments and i want to crawl them FAST.
but the comment section in google is generated by java script.
here is a link for example: https://play.google.com/store/apps/details?id=com.gameloft.android.ANMP.GloftAMHM in that link you can see that in order to generate more comments you need to click on a button several times. (after 5-6 clicks aprox) the page generate more comments by executing a javascript.
At first i solved this problem using a web driver (firefox) and simulate a real person clicking on the button, and it generate comments, and he keep pressing till all comments are generated.
Problem with this is: 1, it takes too much time. 2, sometimes after tons fo clicks and JS generation the web browser is fail to response.
What I need is a way to generate all comments per application in a better, faster way. maybe theres some kind of tech, or just anything else that would improve my solution,
Im using a spider I've created in scrapy.
All kind of help will be much appreciated
One of the reasons they generate/show additional comments is exactly that they do not want someone to crawl them... the other is for the initial page to load without them (faster), and only if someone starts reading comments to show few more..
Unless they provide an API where you can pull all the comments at once, I do not see another quick way of pulling them, apart of simulating clicks and scrolls... (slow way of doing it)
Are you respecting robots.txt? Why or why not?

Change website text with python

This is my first StackOverflow post so please bear with me.
What I'm trying to accomplish is a simple program written in python which will change all of a certain html tag's content (ex. all <h1> or all <p> tags) to something else. This should be done on an existing web page which is currently open in a web browser.
In other words, I want to be able to automate the inspect element function in a browser which will then let me change elements however I wish. I know these changes will just be on my side, but that will serve my larger purpose.
I looked at Beautiful Soup and couldn't find anything in the documentation which will let me change the website as seen in a browser. If someone could point me in the right direction, I would be greatly appreciative!
What you are talking about seems to be much more of the job of a browser extension. Javascript will be much more appropriate, as #brbcoding said. Beautiful Soup is for scraping web pages, not for modifying them on the client side in a browser. To be honest, I don't think you can use Python for that.

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

Using Python mechanize on websites that use DHTML, AJAX, etc.?

So, let's say I'm trying to create something that replies to tweets of a certain "hashtag keyword" on twitter (for example "#FirstWorldProblems") I have a script that looks like this:
# apply settings, create a mechanize.Browser, etc.
login() # log into twitter
# at this point we've logged into twitter, now, we will perform navigate to their search page and run a search query:
br.open('http://twitter.com/search?q=' + hashtag)
print(br.response().read()) # print the response
So, what I have above is sort of an abbreviated version to quickly get to the spot giving me trouble.
I set up a browser, log into twitter, all done no problemo. But, then I run a search for the hashtag (using br.open) and then I print the response.
On Twitter, the "Reply" link only appears when you hover over a specific link and leads to "#" (because it opens a little pop-up thing where you can enter your reply), how would I click on the "Reply" link, because it doesn't show up in the response?
If your problem is actually just accessing Twitter, dmedvinsky is probably right.
However, if you really want to be able to scrape websites (while allowing their javascript to run as it normally would..) you'll probably want something a bit more robust.
While it's a lot of baggage, I strongly urge you to grab Qt, PySide, and get familiar with QWebKit. You can drive a 'real' web browser from Python and get all the benefits (and problems;) one would expect. But, so far it's the best and cleanest method I've found to do what you're asking about.
http://qt.nokia.com/
http://www.pyside.org/

Categories