Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
It's been a month or so, yeah?
I've been searching the web for web scraping using python and I have found beautifulSoup and lots of other scraping tools such as Scrapy, scraper, etc.
All of them are the same...a little differences there might be..
Most tutorials I watch or read, are the same to...
Okay, what I am trying to do here is the following:
Except of putting the URL that I wanna scrape into the code...I want the USER to INPUT the url and then the scraper scrapes that url that the user has pasted in the HTML field
All tutorials have the code like this:
url = (http://......)
No...I want it somehow like this:
url = (USER INPUT)
Example video:
Link scraper
It's funny how he didn't actually to this in his tutorials...I think?
but yes, that's what I am trying to do, if you have any tutorial or documentation of doing this, please please help me out!!
Thank you!
If your using django, set up a form with a text input field for the url on your html page. On submission this url will appear in the POST variables if you've set it up correctly. Then in your back end, where you handle the url that was POST'd, grab the user's input url.
see https://tutorial.djangogirls.org/en/django_forms/ if you don't know how to set up a form.
in your view
import requests
from bs4 import BeautifulSoup
create a form from where the user will post the url for scrapping
in the specific view function
url = form.cleaned_data.get['name of the input field']
data = requests.get(url)
and then do what you need to do with your scrapped data
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I don't understand, I'm trying to take data from the steam community market for the price of the skin and I get a quote at the place where the price is located (Please help.
from bs4 import BeautifulSoup as bs44
import requests
url = "https://steamcommunity.com/market/listings/730/AWP%20%7C%20Hyper%20Beast%20%28Well-Worn%29"
info = requests.get(url)
soup = bs44(info.content, "html.parser")
name = soup.find(id='market_buyorder_info').find(id='market_commodity_buyrequests')
print(name)
This particular website is a real time web app. When you open the page javascript fires up in the background and keeps requesting sale details every few seconds and updates the page.
If you developer tools of your browser (usually F12 key) and click network tab you'll see these requests are being made:
To url: https://steamcommunity.com/market/itemordershistogram?country=US&language=english¤cy=1&item_nameid=49399562&two_factor=0
If you click on it you'll see it returns sale information in json format.
All you have to do in your web-scraper is request this url instead of the one you're requesting. Most important parameter here seems to be item_nameid which is ID of the sold item - you can find it in the html of your original url:
You can use regex to search html body for it:
re.findall(r"Market_LoadOrderSpread\( (\d+) \)", html)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
My goal is to use python to web scrape nytimes.com and find today's date.
I did some research and here is my code:
from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
import requests
link="https://www.nytimes.com/"
response=requests.get(link)
soup=BeautifulSoup(response.text,"html.parser")
time = soup.findAll("span",{"data-testid": "todays-date"})
print(time)
This is a picture of the html from nytimes website:
screenshot from nytimes html
And this is what my terminal found after running the code - the list is empty, it could not find anything:
An empty list shown on my terminal
I think the element might be rendered via JS, so you don't find it when downloading the html via requests.
masthead = soup.find('section', {'id':'masthead-bar-one'})
What you get is
<section class="hasLinks css-1oajkic e1csuq9d3" id="masthead-bar-one"><div><div class="css-1jxco98 e1csuq9d0"></div><div class="css-bfvq22 e1csuq9d2"><a class="css-hnzl8o" href="https://www.nytimes.com/section/todayspaper">Today’s Paper</a></div></div><div class="css-103zufb" id="masthead-bar-one-widgets"><div class="css-i1s3vq e1csuq9d1"></div><div class="css-77hcv e1ll57lj2"></div></div><div class="css-9e9ivx"><a class="css-1k0lris" data-testid="login-link" href="https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi"></a></div></section>
No sign at all of the element you are looking for. I would suggest you to look into the selenium library in order to do this - it mocks a browser and therefore you can scrape also data generated by JS.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I need to scrape historical market rates of freight from differing origins and destinations. Currently, I only have interactive graphs like this available to me:
Sample Graph
You have to click on the graph to get the numbers to appear (all of them appear at once).
I have some experience with HTML web scraping through the Scrapy library, but I was wondering if something like BeautifulSoup would be capable of handling this type of problem.
To put it shortly - yes but it depends.
Most javascript graphs work by embeding json data in <script> tags or making ajax request for it. So there is graph data in json format somewhere - you just need to find it.
To find it you should first open up page source and ctrl+f for some keypoints you see in the graph. In your case start with £407 - it's very likely it's in embeded json:
<script type="application/ld+json">
{'prices': ['£407',...]}
</script>
Alternative it could also be retrieved as AJAX request. For example take this craft.co case. When you load https://craft.co/netflix page it makes AJAX request for graph data:
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am using Django 1.9 to build a link shortener. I have created a simple HTML page where the user can enter the long URL. I have also coded the methods for shortening this URL. The data is getting stored in the database and I am able to display the shortened URL to the user.
I want to know what I have to do next. What happens when a user visits the shorter URL? Should I use redirects or something else? I am totally clueless about this topic.
Normally when you provide a url shortner, after calling the url, you have to redirect to main url by 301 Permanently moved.
def resolve_url(request,url):
origin_url=resolve(url) # read from redis or so.
return HttpResponseRedirect(origin_url)
EDIT:
add code using #danny-cullen hint
You could just navigate to the URL via HttpResponseRedirect
Write a middleware instead of writing same code in every view, such that, if the shortened url is in the model that you stored the you can redirect the shortened url to the long url using HttpResponseRedirect.
class RedirectMiddleware(object):
# Check if client IP is allowed
def process_request(self, request):
'''you can get the current url from request and just filter with the model and redirect to longurl with HttpResponseRedirect.'''
return HttpResponseRedirect(full_url)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
If I want to give input to the website through python program and display the result on terminal after online computation then how can I do it using python wrapper? As I am new to python, so Can anyone suggest me some tutorial for this?
It all depends on the website that you want to retrieve your result from, and how it accepts input from you. For example, if your webpage accepts GET or POST requests, then you can send it a HTTP request and print out the response onto terminal.
If your website accepts input via a submit form, on the other hand, you would have to find the link of the submit button and send your data to that page.
There is a Python library called Requests, which you can use to send HTTP requests to a webpage and get the response. I suggest you read its documentation, it has some good examples that you can base your idea off. Another library is the inbuilt urllib2, which would also work for your purposes.
The response to your request is most likely to be a HTTP webpage, so you may have to scrape out your desired content from inside that.