I'm scraping Ali Express products such as this one using Python. It has multiple variations, each with its own price. When one is clicked on, the price is updated to reflect this choice.
In a similar fashion, there are multiple buttons to choose where you want the item to be shipped from, which updates the shipping cost accordingly.
I want to scrape each variation's price as sent from each country. How can I do that without simulating clicks to change the prices so that I can scrape them? Where is the underlying logic that governs these price changes laid out? I couldn't find it when inspecting elements. Is it easily decipherable?
Or do I just need to give up and simulate clicks? If so, would that be done with Selenium? The reason I would prefer to extract it without clicking is that, for products such as the one I linked to, for example, there are 49 variations and 5 places from which the product is shipped so it would be a lot of clicking and a rather inelegant approach.
Thanks a lot!
take a look in the browser, all the data is in the dom
type window.runParams.data.skuModule.skuPriceList in you console you will see
I know that ecommerce companies applies this kind of logic in their backend apis. And to protect the apis from normal users. They use consul which is used to resolve the ips recieved from front end.
Now coming to your question. There can be two cases.
Frontend recieves the data from backend and applies their own logic. So i can tell you that the front end has already recieved all the data related to variants and its price. So they are storing it at their end in some data structure. And they update the values on the view only when you click the item.( You can find if this is the case if after clicking there is no delay and result is shown instantly). Though you can check the response fetched from the backend, it is bound to have all data which frontend is recieving and storing. You can check in chrome-debug tools->network->gql to filter
Second case in which it is fetching data each time from backend when you click. In that case it is changing some parameters on the link. If you can find out some kind of logic behind how parameters are being changed for similar variants maybe you can fetch the information then.(There will be delay in showing results after clicking)
I think its a good idea to use selenium or cypress. I know it will take time. But its the best option you got.
Related
I'm working on a project trying to autonomously monitor item prices on an Angular website.
Here's what a link to a particular item would look like:
https://www.<site-name>.com/categories/<sub-category>/products?prodNum=9999999
Using Selenium (in Python) on a page with product listings, I can get some useful information about the items, but what I really want is the prodNum parameter.
The onClick attribute for the items = clickOnItem(item, $index).
I do have some information for items including the presumable item and $index values which are visible within the html, but I'm doubtful there is a way of seeing what is actually happening in clickOnItem.
I've tried looking around using dev-tools to find where clickOnItem is defined, but I haven't been successful.
Considering that I don't see any way of getting prodNum without clicking, I'm wondering, is there's a way I could simulate a click to see where it would redirect to, but without actually loading the link- as this would take way too much time to do for each item?
Note: I want to get the specific prodNumber. I want to be able to hit the item page directly without first going though the main listing page.
I'm trying to get a specific data from a website, but this is a little bit complicated to understand so here is some images.
So, first, I'm on this page,
Image1
then I click on the icon in the middle and something pop,
popup
then I have to click on this,
almost there
And finally I land here
arrival
And I want to get all the names of the people here
So, my question is, is there a way to get directly this list with a requests ?
If yes, how do i have do to ? I can't find the URL of this kind of pop up and I'm a complete beginner with requests and all this kind of things..
(To get the name, I have to be connected on my account by the way)
So, since I don't know how to access to the pop-up windows, this is the only code I got :
import requests
x = requests.get('https://www.tiktok.com/#programm___r?lang=en', headers={'User-Agent':'test'})
print(x.text)
I checked what it prints, and i didn't see a sign of the pop-up window
you can get some sort of network interception tool like Burpsuite and watch the network traffic that comes through each time you click on each link along the way to your final destination, this should give you an endpoint you may be able to send your request too. I think this network information should also be available in the browser tools but I'm not sure. A potential issue here is that usually tokens and other information has to be passed down the chain along the way, which might make scripting something like this too hard.
So aside from that, with browser automation software like selenium, you could automate the process of getting to that point on the page, and be able to pull out the list you want once you're there. I've used selenium myself and it's really usable and well documented!
GOAL
Extract data from a web page.. automatically.
Data are on this page... Be careful , it's in French...
MY HARD WAY, manually
I choose the data I want by clicking on the desired fields on the left side ('CHOISIR DES INDICATEURS')
Then I select ('Tableau' = Table), to have data table.
Then I click on ('Action'), on the right side, then ('Exporter' = Export)
I choose the format I want (ie CSV) and hit ('Executer'= Execute) to download the file.
WHAT I TRIED
I tried to automate this process, but It's like an impossible task for me. I tried to inspect the page for the network exchanges to see if there is an underlying server I could make easy json request.
I mainly work with python and frameworks like BS4 or scrapy.
I have few data to extract, so I can easily do it manually. Thus this question, I just purely for my own knowledge, to see if it is possible to scrape a page like that.
I would appreciate if you could share your skills!
Thank you,
It is possible. Check this website for details. This website will tell you how to scrape a website with an example.
https://realpython.com/beautiful-soup-web-scraper-python/#scraping-the-monster-job-site
Let me provide a little background.
An organization i am volunteering for delivers meals to people who are unable to come pick them up during the holidays.
They currently have a SQL Server DB that stores the information of all their clients along with the meal information for each year.
Currently a Java desktop application connects to the SQL Server DB and allows several functions to happen.
i.e. Add a client, add meals, remove clients, print delivery sheets.
I am hoping to use python Flask to rewrite the application as a web based application. The one function i am interested in at the moment is the print delivery sheets function.
The way this works is there is a setting for the current year. When you click the print deliveries for year button it will batch print a document for each customer onto an 8.5" x 11.5" paper. The sheet will be split in two with the same exact information on each side. This information includes the customer name, address, number of meals and so forth.
What i am wondering is how/what would be the best way to setup this template so that i could batch print it using python. I was thinking of creating an html template for the page but i am not sure how that would work.
Again i need to pass in every customer within that year into the template and batch print to 8.5" by 11.5" sheet.
What i am asking is.....
How could i create a template for the print that i can pass every customer two.
How would i print that template for every customer in a batch manner for every customer.
I was hoping to do this all in python if possible.
Thank you for your time and help.
If you are already deploying this as a web app, it will probably be easier to design and generate a pdf. You can use an html to pdf converter, which there are several of on PyPI, or there are plenty of resources online, such as:
How to convert webpage into PDF by using Python
https://www.smallsurething.com/how-to-generate-pdf-reports-with-jinja2-and-pyqt/
Once you have found a way to generate PDFs, you can then just use them like any other PDF, and either have the user download them or print them from the browser (this may require a little bit of Javascript, but this shouldn't be that hard since it should pretty much just be a window.open call.
For instance, you can add a button
<button onclick="getPDF()">Download PDF</button>
Which will then call a function called getPDF() which you define, which finds a way to generate the pdf.
function getPDF() {
// Find the uri for the pdf by some method
var urlToPdf = getUrlToPdf();
// Open PDF in new window
window.open(urlToPdf, "_blank");
}
Note Since you are using Flask, you can include the URL for the pdf in the source for the page, even the Javascript using the {{ }} syntax. Then the pdfs are only generated when someone requests that route.
This way you will not have to worry about connecting to a printer yourself at all, just use the browser to handle those kinds of tasks.
My problem begins when i try to crawl an app store, lets say google play.
for every app there are alot of comments and i want to crawl them FAST.
but the comment section in google is generated by java script.
here is a link for example: https://play.google.com/store/apps/details?id=com.gameloft.android.ANMP.GloftAMHM in that link you can see that in order to generate more comments you need to click on a button several times. (after 5-6 clicks aprox) the page generate more comments by executing a javascript.
At first i solved this problem using a web driver (firefox) and simulate a real person clicking on the button, and it generate comments, and he keep pressing till all comments are generated.
Problem with this is: 1, it takes too much time. 2, sometimes after tons fo clicks and JS generation the web browser is fail to response.
What I need is a way to generate all comments per application in a better, faster way. maybe theres some kind of tech, or just anything else that would improve my solution,
Im using a spider I've created in scrapy.
All kind of help will be much appreciated
One of the reasons they generate/show additional comments is exactly that they do not want someone to crawl them... the other is for the initial page to load without them (faster), and only if someone starts reading comments to show few more..
Unless they provide an API where you can pull all the comments at once, I do not see another quick way of pulling them, apart of simulating clicks and scrolls... (slow way of doing it)
Are you respecting robots.txt? Why or why not?