I am a python newbie. I am currently doing basic web-scraping. On browsing through several GitHub projects, I found one that lets the user download an srt file.
Here's the doubt. Suppose the url is like this:
http://www.opensubtitles.org/en/subtitles/6528547/silicon-valley-the-lady-bs
How to get the random hash value 6528547? On a side note, I request tips on how to get started working with APIs
Assuming that you have the url and just want to get the "hash", the easiest way to get the hash is to split it using '/' as the parameter and then getting the 5th element of the list returned.
url = "" #suppose you have the url here
hash = url.split('/')[5]
Screenshot
Related
hope you are all doing well. This question is a bit more random than others I have asked. I am making a bot that extracts every username of the first 600,000,000 accounts on the platform Roblox, and loads it into a list.
This is my problem. I am using requests to get to the account page, but I can't find out how to extract the username from that page. I have tried using headers and inspect element but they don't work. If anyone has some suggestions on how to complete this, please help. Also, I am extraordinarily bad at network programming, so I may have made a noob mistake somewhere. Code is attached below.
import requests
users = []
for i in range(1, 600000001):
r = requests.get("https://web.roblox.com/users/{i}/profile".format(i=i))
print(r.status_code)
if r.status_code == 404:
users.append('Deleted')
continue
print(r.headers.get('username'))
You have to know that before working on the scraping, you have some errors in the code:
First of all in the 4th line if you want to use the .format command to insert values in a string you only have to insert the {}; so you should write:
r = requests.get("https://web.roblox.com/users/{}/profile".format(i))
And later you should remove continue from your code
But before doing anything you have to try the link be sure it's working, so copy the link, past it on your browser and remove i and add a number.
If it works you can go on with the code, if not you have to find another link to access to the page you want.
Eventually, to take the elements from the html page you have to use r.content.
But before continuing with coding you have to print(r.content).
You will see a long dict full of elements but you don't have to be afraid of it:
you have to search the value that interest you, and see how it's called, and you will be able to call that value writing
`<name_of_variable> = r.content['<name_of_the_value>']`
I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).
My goal is to create a small sript that find all the result of a google search but in "raw".
I don't speak english very well so i prefer to give an exemple to show you what i would like :
I Type : elephant
The script return
www.elephant.com
www.bluelephant.com
www.ebay.com/elephant
.....
I was thinking about urllib.request but the return value will not be usable to that !
I found some tutorials but not adapted at all to my wish !
Like i told you my goal is to have an .txt file as output witch contains alls the website who match with my query !
Thanks all
One simple way is to make a request to google search, then parse the html result. You can use some Python libraries such us Beautiful Soup to parse the html content easily, finally get the url link you need.
These seem to change often, so hopefully this answer remains useful for a little while...
First, you'll need to create a Google Custom Search, by visiting their site or following the instructions provided here https://developers.google.com/custom-search/docs/tutorial/creatingcse.
This will provide you with both
Custom Search Engine ID
API Key
credentials which are needed to use the service.
In your python script, you'll want to import the following package:
from googleapiclient.discovery import build
which will enable you to create a build object:
service = build("customsearch", developerKey=my_api_key)
According to the docs, this constructs a resource for interacting with the API.
When you want to return search results, call execute() on service's cse().list() method:
res = service.cse().list(q=my_search_keyword, cx=my_cse_id, **kwargs).execute()
to return a list of search results, where each result is a dictionary object. The i'th result's URL can be accessed with the "link" key:
ithresult = res[i]['link']
Note that you can only return 10 results in a single call, so make use of the start keyword argument in .list(), and consider embedding this call in a loop to generate several links at a time.
You should be able to find plenty of SO answers about saving your search results to a text file.
N.B. One more thing that confused me at first - presumably you'll want to search the entire web, and not just a single site. But when creating your CSE you will be asked to specify a single site, or list of sites, to search. Don't worry, just enter any old thing, you can delete it later. Even Google endorses this hack:
Convert a search engine to search the entire web: On the Custom Search
home page, click the search engine you want. Click Setup, and then
click the Basics tab. Select Search the entire web but emphasize
included sites. In the Sites to search section, delete the site you
entered during the initial setup process.
I just add 2 points to "9th Dimension" answer.
Use this guide to find your Custom Search Engine ID
A small modification should be made in the second line of the code: as the following, the "version" should be added as an argument
service = build('customsearch','v1',developerKey= my_api_key)
You have 2 options - using API or make a request like a browser does and then parse HTML.
First option is rather tricky to set up and is limited - 100 free queries/day, then 1000 for $5.
Second option is easier but it violates Google's ToS.
tl;dr: I am using the Amazon Product Advertising API with Python. How can I do a keyword search for a book and get XML results that contain TITLE, ISBN, and PRICE for each entry?
Verbose version:
I am working in Python on a web site that allows the user to search for textbooks from different sites such as eBay and Amazon. Basically, I need to obtain simple information such as titles, ISBNS, and prices for each item from a set of search results from one of those sites. Then, I can store and format that information as needed in my application (e.g, displaying HTML).
In eBay's case, getting the info I needed wasn't too hard. I used urllib2 to make a request based on a sample I found. All I needed was a special security key to add to the URL:
def ebaySearch(keywords): #keywords is a list of strings, e.g. ['moby', 'dick']
#findItemsAdvanced allows category filter -- 267 is books
#Of course, I replaced my security appname in the example below
url = "http://svcs.ebay.com/services/search/FindingService/v1?OPERATION-NAME=findItemsAdvanced&SERVICE-NAME=FindingService&SERVICE-VERSION=1.0.0&SECURITY-APPNAME=[MY-APPNAME]&RESPONSE-DATA-FORMAT=XML&REST-PAYLOAD&categoryId=267&keywords="
#Complete the url...
numKeywords = len(keywords)
for k in range(0, numKeywords-1):
url += keywords[k]
url += "%20"
#There should not be %20 after last keyword
url += keywords[numKeywords-1]
request = urllib2.Request(url)
response = urllib2.urlopen(request) #file like thing (due to library conversion)
xml_response = response.read()
...
...Then I parsed this with minidom.
In Amazon's case, it doesn't seem to be so easy. I thought I would start out by just looking for an easy wrapper. But their developer site doesn't seem to provide a python wrapper for what I am interested in (the Product Advertising API). One that I have tried, python-amazon-product-api 0.2.5 from https://pypi.python.org/pypi/python-amazon-product-api/, has been giving me some installation issues that may not be worth the time to look into (but maybe I'm just exasperated..). I also looked around and found pyaws and pyecs, but these seem to use deprecated authentication mechanisms.
I then figured I would just try to construct the URLs from scratch as I did for eBay. But Amazon requires a time stamp in the URLs, which I suppose I could programatically construct (perhaps something like these folks, who go the whole 9 yards with the signature: https://forums.aws.amazon.com/thread.jspa?threadID=10048).
Even if that worked (which I doubt will happen, given the amount of frustration the logistics have given so far), the bottom line is that I want name, price, and ISBN for the books that I search for. I was able to generate a sample URL with the tutorial on the API website, and then see the XML result, which indeed contained titles and ISBNs. But no prices! Gah! After some desperate Google searching, a slight modification to the URL (adding &ResponseGroup=Offers and &MerchantID=All) did the trick, but then there were no titles. (I guess yet another question I would have, then, is where can I find an index of the possible ResponseGroup parameters?)
Overall, as you can see, I really just don't have a solid methodology for this. Is the construct-a-url approach a decent way to go, or will it be more trouble than it is worth? Perhaps the tl;dr at the top is a better representation of the overall question.
Another way could be amazon-simple-product-api:
from amazon.api import AmazonAPI
amazon = AmazonAPI(ACCESS_KEY, SECRET, ASSOC)
results = amazon.search(Keywords = "book name", SearchIndex = "Books")
for item in results:
print item.title, item.isbn, item.price_and_currency
To install, just clone from github and run
sudo python setup.py install
Hope this helps!
If you have installation issues with python-amazon-product-api, send details to mailing list and you will be helped.
I'm currently looking to put together a quick script using Python 2.x to try and obtain the MD5 hash value of a number of images and movies on specific websites. I have noted on the w3.org website that the HTTP/1.1 protocol does offer an option within the content field to access the MD5 value but I'm wondering if this has to be set by the website admin? My script is as below:-
import httplib
c = httplib.HTTPConnection("www.samplesite.com")
c.request("HEAD", "/sampleimage.jpg")
r = c.getresponse()
res = r.getheaders()
print res
I have a feeling I need to edit 'HEAD' or possibly r.getheaders but I'm just not sure what to replace them with.
Any suggestions? As said, I'm just looking to point at an image and to then capture the MD5 hash value of the said image / movie. Ideally I don't want to have to download the image / movie to save bandwidth hence why I'm trying to do it this way.
Thanks in advance
Yes, it's rare that servers will actually respond to requests with an MD5 header. You can check for that, but in most cases, you'll actually need to download the video or image, unfortunately.
(At least hashlib is simple!)