Use search function on a website with Python Requests(ebay) - python

I'm trying to create a Python program using the Requests library that searches ebay for an item that they enter. Rather than hard-coding the url, is it possible to use requests library to perform an Ebay search (or a search on any website)?

I believe what you want here is to input a text in a search element. According to realpython:
The requests library is the de facto standard for making HTTP requests in Python.
I would recommend to use selenium to control the website's source code such as inputting a text in an element and press a button on the website.
However, if you still want to use requests then try to find their api endpoint which handle the searching part and use POST method to get data from it.
resp = requests.post(url)

I created and Ebay developer account to access the API then wrote a small script to search eBay for historical pricing on an item. Save it an call is search.py and call it like this:
./search.py "ebay item you are looking for"
You can change the itemFilter to your liking, currently it is set for solditems for since 10-10-2019. The complete list is here: https://developer.ebay.com/devzone/finding/callref/types/ItemFilterType.html
The comments at the bottom show the complete set of fields returned from Ebay, you can pick and choose the fields you like and add them to a print statement.
Also, this script will return for than the first page of items and each page costs you one of your 5,000 developer queries for the day. I am unable to get it to work with the sandbox, not matter what I try. I believe the Ebay sandbox is broken.
#!/usr/local/bin/python3
from ebaysdk.finding import Connection
import sys
DEBUG = False
#search_keywords = "2019 Hot Wheels Dumbo"
search_keywords = sys.argv[1]
print ("Search Keywords: " + search_keywords)
# Function accepts keywords for query and pageNumber of search to pull
# Ebay will only return 100 items per search
def build_request( keywords, pageNumber):
# Create a request structure
# Item Filter List https://developer.ebay.com/devzone/finding/callref/types/ItemFilterType.html
request = {
'keywords': keywords,
'itemFilter': [
{'name': 'condition', 'value': 'new' ,
'name': 'SoldItemsOnly', 'value': True ,
'name': 'EndTimeFrom', 'value': '2019-10-10T00:00:00.000Z' }
],
'paginationInput': {
'entriesPerPage': 100, # EBay limits API Calls to 100 items per page
'pageNumber': pageNumber
},
'sortOrder': 'PricePlusShippingLowest',
}
return (request)
# Connect using yaml file to EBAY-US production site
# put in __main__ just in case we turn this into a module later
if __name__ == '__main__':
api = Connection(config_file='ebay.yaml', debug=False, siteid="EBAY-US")
#api = Connection(config_file='ebay.yaml', debug=False, domain="api.sandbox.ebay.com", siteid="EBAY-US")
# Run the request
query=build_request(search_keywords, 1)
query['paginationInput']['pageNumber'] = 1
response = api.execute('findCompletedItems', query)
if DEBUG:
print (response.dict()) #Use this to see the dictionary structure
# Display how many entries and results are returned
print("API Call: findCompletedItems")
print("----------------------------")
print(f"totalEntries: {response.reply.paginationOutput.totalEntries}, totalPages: {response.reply.paginationOutput.totalPages}")
maxpage = int(str(response.reply.paginationOutput.totalPages)) + 0
# Display item information fields from the request, see below for all possible fields
for item in response.reply.searchResult.item:
print(f"Date: {item.listingInfo.endTime} Title: {item.title}, Price: {item.sellingStatus.currentPrice.value} Shipping: {item.shippingInfo.shippingServiceCost.value}")
# Now run the request for each page and change the page in the request each time
for page in range (2,maxpage):
print ("**** PAGE: "+str(page) +" of "+ str(maxpage)+ " ****")
# Rebuild the Request and Update the Page Number
# Run the request
query['paginationInput']['pageNumber'] = page
response = api.execute('findCompletedItems', query)
# Display item information fields from the request, see below for all possible fields
for item in response.reply.searchResult.item:
print(f"Date: {item.listingInfo.endTime} Title: {item.title}, Price: {item.sellingStatus.currentPrice.value} Shipping: {item.shippingInfo.shippingServiceCost.value}")
#{'ack': 'Success', 'version': '1.13.0', 'timestamp': '2019-10-16T01:28:25.891Z',
#
#searchResult': {'item': [{'itemId': '123719989207', 'title': '2019 HOT WHEELS 2 SET CORVETTE STINGRAY SUPER CHROMES 5/5 TREASURE HUNT PAIR', 'globalId': 'EBAY-US', 'primaryCategory': {'categoryId': '180506', 'categoryName': 'Contemporary Manufacture'}, 'galleryURL': 'https://thumbs4.ebaystatic.com/m/mFuyRQgYjSutGli33dqsqcA/140.jpg', 'viewItemURL': 'https://www.ebay.com/itm/2019-HOT-WHEELS-2-SET-CORVETTE-STINGRAY-SUPER-CHROMES-5-5-TREASURE-HUNT-PAIR-/123719989207', 'paymentMethod': 'PayPal', 'autoPay': 'false', 'postalCode': '54650', 'location': 'Onalaska,WI,USA', 'country': 'US', 'shippingInfo': {'shippingServiceCost': {'_currencyId': 'USD', 'value': '6.0'}, 'shippingType': 'Flat', 'shipToLocations': 'Worldwide', 'expeditedShipping': 'false', 'oneDayShippingAvailable': 'false', 'handlingTime': '2'}, 'sellingStatus': {'currentPrice': {'_currencyId': 'USD', 'value': '9.0'}, 'convertedCurrentPrice': {'_currencyId': 'USD', 'value': '9.0'}, 'sellingState': 'Ended'}, 'listingInfo': {'bestOfferEnabled': 'false', 'buyItNowAvailable': 'false', 'startTime': '2019-04-02T22:14:03.000Z', 'endTime': '2019-10-02T18:44:49.000Z', 'listingType': 'StoreInventory', 'gift': 'false', 'watchCount': '2'}, 'returnsAccepted': 'false', 'condition': {'conditionId': '1000', 'conditionDisplayName': 'New'}, 'isMultiVariationListing': 'false', 'topRatedListing': 'false'},
#
#
#{'itemId': '153679182310', 'title': "Hot Wheels 2019 Super Treasure Hunt '68 Mercury Cougar Loose 1/64 STH Green", 'globalId': 'EBAY-US', 'primaryCategory': {'categoryId': '73252', 'categoryName': 'Collections & Lots'}, 'galleryURL': 'https://thumbs3.ebaystatic.com/m/mEN9EsbCJY0wb6WzXjO8hNg/140.jpg', 'viewItemURL': 'https://www.ebay.com/itm/Hot-Wheels-2019-Super-Treasure-Hunt-68-Mercury-Cougar-Loose-1-64-STH-Green-/153679182310', 'paymentMethod': 'PayPal', 'autoPay': 'false', 'location': 'Malaysia', 'country': 'MY', 'shippingInfo': {'shippingServiceCost': {'_currencyId': 'USD', 'value': '9.0'}, 'shippingType': 'Flat', 'shipToLocations': 'Worldwide', 'expeditedShipping': 'false', 'oneDayShippingAvailable': 'false', 'handlingTime': '15'}, 'sellingStatus': {'currentPrice': {'_currencyId': 'USD', 'value': '9.9'}, 'convertedCurrentPrice': {'_currencyId': 'USD', 'value': '9.9'}, 'bidCount': '1', 'sellingState': 'Ended'}, 'listingInfo': {'bestOfferEnabled': 'false', 'buyItNowAvailable': 'false', 'startTime': '2019-10-10T04:13:32.000Z', 'endTime': '2019-10-15T04:13:32.000Z', 'listingType': 'Auction', 'gift': 'false', 'watchCount': '1'}, 'returnsAccepted': 'false', 'condition': {'conditionId': '3000', 'conditionDisplayName': 'Used'}, 'isMultiVariationListing': 'false', 'topRatedListing': 'false'}],
#
#'_count': '100'}, 'paginationOutput': {'pageNumber': '3', 'entriesPerPage': '100', 'totalPages': '40', 'totalEntries': '3966'}}

You can scrape eBay using BeautifulSoup web scraping library.
In order not to enter the full URL of the request, you can set params in which the necessary request parameters will be indicated and the input of the question itself for the search:
query = input('Your query is: ')
params = {
'_nkw': query, # search query
'_pgn': 1 # page number
#'LH_Sold': '1' # shows sold items
}
If using requests library the request might be blocked as default user-agent in requests library is a python-requests so website understands that's it's a bot or a script that sends a request. Check what's your user-agent.
An additional step besides providing browser user-agent could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
}
query = input('Your query is: ')
params = {
'_nkw': query, # search query
'_pgn': 1 # page number
#'LH_Sold': '1' # shows sold items
}
data = []
while True:
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
link = products.select_one(".s-item__link")["href"]
data.append({
"title" : title,
"price" : price,
"link" : link
})
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
Your query is: shirt # query entry example
Extracting page: 1
----------
[
{
"title": "Men's Polo Shirt 100% Cotton Knockout Jeans NVY WHT 220 Stripe MEDIUM Free Ship",
"price": "$11.99",
"link": "https://www.ebay.com/itm/133992813518?hash=item1f329813ce:g:tWMAAOSwXBxhTP7Q&amdata=enc%3AAQAHAAAAwJ9%2BDbqKGCoZye6JelYY1tJHQWotUalKHQJ%2FixwyplnvOC60SofXkLVsNgRfoX09uOZLerjkBtwcW%2FQQa1wmJ6%2BYVEEagzH1GAK6Bx4rX%2BRNnj9g6SlvB2WagWETpbmrLdiFHGTIRvAL2EvfXDRqPFnEGWZ2nk%2BM0zEkiGzp%2F4ADUbPslGui3zTDJsIgVpXjAHzL2EUH3s7tiOxtd3qVTXxaE095evq5YrBgkJFJu4KB5o%2F%2BCiCURfy7xR%2FbTU7mnQ%3D%3D%7Ctkp%3ABlBMUJavlrOEYQ"
},
{
"title": "5 Pack Oroblu Micromodal Perfect Line Round Neck Short Sleeve T-Shirt",
"price": "$192.00",
"link": "https://www.ebay.com/itm/275287531865?hash=item40186a6159:g:OtUAAOSweKFiZr2S&amdata=enc%3AAQAHAAAAsMRLg1VeYAIKHTiXXdD8xv56DpaeH6jc3EhFP26RJ66bqmlzXHQrMMxuo78x6S2i8DfxvuzjbXrpmYYdyRLhzgQCoaauMNvRwVNuhx11qorNlPoHrig%2BdIGG2RB4xHmXdB2fjOciLCsdYkL23jaH23ehXakQu%2BrBzER%2F2v94Sdg%2BkchjwWmRidsv0kPfLRcpiy%2BOeDBHEas4i9EQY%2F0VAzLGj2U%2FwLdcqjqSjgngj%2BRr%7Ctkp%3ABlBMUJavlrOEYQ"
},
# ...
]
As an alternative, you can use Ebay Organic Results API from SerpApi. It`s a paid API with a free plan that handles blocks and parsing on their backend.
Example code that paginates through all pages with input query:
from serpapi import EbaySearch
import os, json
query = input('Your query is: ')
params = {
"api_key": os.getenv("API_KEY"), # serpapi api key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": query, # search query
"_pgn": 1 # page number
#"LH_Sold": "1" # shows sold items
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
link = organic_result.get("link")
price = organic_result.get("price")
data.append({
"price" : price,
"link" : link
})
page_num += 1
print(page_num)
if "next" in results.get("pagination", {}):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"price": {
"raw": "$25.99",
"extracted": 25.99
},
"link": "https://www.ebay.com/itm/285018595898?hash=item425c6ea23a:g:mT0AAOSwBjljAFsl&amdata=enc%3AAQAHAAAAkI1P1C%2BE2boIutliCMWXCADm%2BXyUp2a6Q1qOjpifaAIo6%2FWD0yHCd8Mejyfc2jc%2BQ5zzVcITrcWM0XxIfiSUILMZFsMewB154skl5re5%2FS8W9kRrabjRdy%2BoC6aQoS%2FWGq%2F6A%2BZWQ1GQkcd5Tstamu%2FgzZKoL6VYfO4YpC4oO4Im23h0wiIfI0%2BxPG8uuFRMPw%3D%3D%7Ctkp%3ABk9SR_i1vbKEYQ"
},
{
"price": {
"raw": "$14.16",
"extracted": 14.16
},
"link": "https://www.ebay.com/itm/234347615312?hash=item369034d450:g:hvYAAOSwNspg0TAH&amdata=enc%3AAQAHAAAA0B1m3DPC4q0R4AQp6MO8rXnKt6qFIX2p%2BaypmySYXkIvi6XE3FHzpbtN%2B%2Bvd9P3TZPYu3fuQVl5kH0ZYDO5eqtnjh1EcZ%2Fb9rZMlMx6r6RcH%2B5wOY7X65bvRcmQ7OUmoaNGAMOZpOc4hg8vHj2afxCa%2FR7F3jDr1KjnHk%2BKnln3opoiqAVMFIoXv338f70KZw8CDd%2Fg9xU0jQlzgxDpDwSL6Y6OMz0oKxh4T%2BRUMKHj03VE5E9%2B8VKzPUMWAQ%2BZWuZyGMpWxwzn%2BomggywV5RhI%3D%7Ctkp%3ABk9SR_i1vbKEYQ"
},
# ...
]

Related

can't get the correct data by calling an API using python request library while it works with postman

After opening the enter link description here I can see an API gets called which returns some data in JSON format. Attached you may see as well.
I am trying to call this API using my python script as follows.
payload={
"sort" : "tendersTotal.desc" ,
"filter" : {
"keyword" : {
"searchedText" : "" ,
"searchedFields" : []
} ,
"tenderLocationsOfActivity" : [] ,
"grantLocationsOfActivity" : [] ,
"grantSectorsOfActivity" : [] ,
"tenderSectorsOfActivity" : [] ,
"name" : "" ,
"types" : [] ,
"numberOfEmployees" : [] ,
"legalResidences" : []
} ,
"pageSize" : 1 ,
"pageNr" : 1
}
headers = {
'Content-Type' : 'application/json;charset=UTF-8'
}
r = requests.post("https://www.developmentaid.org/api/frontend/donors/search" , data=payload, headers=headers)
print(r.json())
but unfortunately, it returns me this message instead of the real data.
{"message":"Donor Search Request cannot be validated.","errors":{"pageNr":["The page nr field is required."],"pageSize":["The page size field is required."],"filter":["The filter field must be present."]}}
on the other hand, when I sent the same request using postman it returns me data. here is how I sent request using the postman.
my question is what changes do I need to bring in my python code so that the API calls correctly and returns me the data which I wish to receive.
Try using the json parameter instead of data:
r = requests.post("https://www.developmentaid.org/api/frontend/donors/search" , json=payload, headers=headers)
Output:
print(r.json())
{'data': {'total': 11573, 'meta': {'page_title': 'Search for donors — ', 'page_description': 'Search for donors and filter by legal residence, organization type, tender sectors of activity, tender countries of activity, grant sectors of activity and grant countries of activity.'}, 'items': [{'id': 118363, 'name': 'World Bank (USA)', 'shortName': 'WB', 'description': "<p><strong>WB - World Bank</strong> - is an international financial institution that provides loans to developing countries for capital programs. The World Bank's official goal is the reduction of poverty.</p><p>\xa0</p><p>\xa0</p>", 'avatar': 'https://www.developmentaid.org/files/organizationLogos/world-bank-118363.jpg', 'types': 'Financial Institution', 'legalResidence': 'USA', 'activeMember': False, 'slug': 'wb', 'jobsTotal': 1503, 'tenders_total': 117040, 'grants_total': 132, 'countryProgrammingTotal': 172, 'contractorsTotal': 28915}]}}

Trello - Updating a cards position and color at the same time

I'm trying to have my board update based on some results and having a hard time finding the best way to update both a card's position and the color. The idea is to have a card update based on a result, red and to the top to catch my attention; but if everything is working correctly, then green and to the bottom.
So far I have:
def updateCard():
url = f"https://api.trello.com/1/cards/{CARD_ID}/cover"
headers = {
"Accept": "application/json"
}
query = {
'key': API_KEY,
'token': OAUTH_TOKEN,
'name': 'New Title',
'desc': 'New Description',
'pos': 'bottom',
'value': {'color': 'green'}
}
response = requests.request(
"PUT",
url,
headers=headers,
json=query
)
print(json.dumps(json.loads(response.text), sort_keys=True, indent=4, separators=(",", ": ")))
The pseudo code is from: https://developer.atlassian.com/cloud/trello/rest/api-group-cards/#api-cards-id-put and added my own variables.
I noticed that for changing the color, I need to pass the json variable in the response, and have the URL end with '/cover'. However, this does not work when trying to update the position. If I take the /cover out of the URL, then the position gets updated. Is there a way to have both update at the same time.
Thanks in advance!
I don't know Trello's API, and I can't test it without setting up a developer account, but my instinct is that you might be able to change your query like so:
. . .
url = f"https://api.trello.com/1/cards/{CARD_ID}"
. . .
query = {
'key': API_KEY,
'token': OAUTH_TOKEN,
'value': {
'name': 'New Title',
'desc': 'New Description',
'pos': 'bottom',
'cover': {'color': 'green'}
}
}
It looks like "cover" is a nested object under the "card" object, so you can just put an object in that field. When you're updating the card's position, you should be going for the card object directly, not the /cover field. You would use the /cover endpoint if you intended your request only to be scoped to the cover of the card.
EDIT: Trying a new request format
The workaround to this issue so far (and if anyone has a better syntax, feel free to add a comment):
url = f"https://api.trello.com/1/cards/{CARD_ID}"
url_cover = f"https://api.trello.com/1/cards/{CARD_ID}/cover"
headers = {
"Accept": "application/json"
}
query = {
'key': API_KEY,
'token': OAUTH_TOKEN,
'name': 'New Title 3',
'desc': 'New Description',
'pos': 'top'
}
json = {
'key': API_KEY,
'token': OAUTH_TOKEN,
'value': {
'brightness': 'dark',
'color': card_color,
'size': 'full'}}
response = requests.request(
"PUT",
url,
headers=headers,
params=query
)
response = requests.request(
"PUT",
url_cover,
headers=headers,
json=json
)

Scrape tables with python for player lists

I am trying to scrape the EA sports football tables for players to this website:
https://www.easports.com/fifa/ultimate-team/fut/database/results?position_secondary=LF,CF,RF,ST,LW,LM,CAM,CDM,CM,RM,RW,LWB,LB,CB,RB,RWB
I have run this simple code however I am unable to get any output
code:
import requests, bs4
r = requests.get('https://www.easports.com/fifa/ultimate-team/fut/database/results?position_secondary=LF,CF,RF,ST,LW,LM,CAM,CDM,CM,RM,RW,LWB,LB,CB,RB,RWB')
soup = bs4.BeautifulSoup(r.text, 'lxml')
contents = soup.find(class_='contrast-white')
Can anybody help me with it please?
So the problem with that page is that those elements are dynamically generated by javascript.
Fortunately for us, most of the data comes through api calls. So we can use our browser cookies to bypass this limitation and make requests to the actual api.
This is what I came up with, I hope it fits your needs:
import requests
def parse_item(item):
attr_list = item['attributes']
return {
'name': item['name'],
'type': item['playerType'],
'OVR': item['composure'],
'POS': item['position'],
'PAC': get_attr_by_name(attr_list, 'PAC'),
'DRI': get_attr_by_name(attr_list, 'DRI'),
'SHO': get_attr_by_name(attr_list, 'SHO'),
'DEF': get_attr_by_name(attr_list, 'DEF'),
'PAS': get_attr_by_name(attr_list, 'PAS'),
'PHY': get_attr_by_name(attr_list, 'PHY'),
}
def get_attr_by_name(attr_list, attr_name):
attr_name = attr_name.upper()
try:
return next(item['value'] for item in attr_list if item['name'].endswith(attr_name))
except:
return None
cookies = {
'hl': 'us',
'ak_bmsc': '2F856B67859A41FAFB7A62172F068FA7C99F9D14F555000037F4435B86E7E136~plcKkcciaz+3qtfstmojfDw6NLaOVQ0MD41+JJKpeGyyladBNwRB0lLcC8lVi+ELaolN0j0Yzs6HiXjknNAgxjejeFu1I32ZeiaXDNykNhtnNweIIWc26f6y1G6fcpEnkqc2shuFIGn0qSRkilVLfccdJ9pi6yVVjS09lvCSNsi8dNPeU8QUxup+jHmez3zlPebfRyk1zZ8bFb6DBiZ0Dyj6fJepQ89AJ6Kcaf5Ynd3FgefDstwDxcRbDKnssM14iLiSjwri5VWdNP4KtsmmP2as63Xxc5MaVBbTjyk2i5/o8Rj852VMkBWPlskrlkBkliBwOTM4rIFXxZhSSwO2+gog==',
'bm_sv': '830B3A15206003312D12E0B6FB4A2696~GupjwX5n1ZUaBybPwNV8B+/mIEouVASaWGBxPDg0p/S9lbZ98ziLYDEUArV6w2sGEn7NdWMub6mV5tEsGLoEgI48TmNE1/TUwtEyJcmtg2SlGBlGzFi64B2XdCR6oL2xy92x6zdNb6kOL3U+8YaBhQxd5nutL7sFddcENkQOb3E=',
'DOT_COM_PHPSESSID': 'e4r4ekoramipe1qvahf0fp2630',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0',
}
params = {
'jsonParamObject': {
'page': 1,
'position': 'LF,CF,RF,ST,LW,LM,CAM,CDM,CM,RM,RW,LWB,LB,CB,RB,RWB'
}
}
r = requests.get(
'https://www.easports.com/fifa/ultimate-team/api/fut/item',
params=params,
cookies=cookies
)
items = r.json()['items']
data = [parse_item(item) for item in items]
The json is quite big, so I wrote a couple functions to help extract the desired data out of it.
data is a list of dicts. This is what a single element looks like:
>>> data[0]
{'name': 'Cristiano Ronaldo', 'type': 'TEAM OF THE YEAR', 'OVR': 99, 'POS': 'LW', 'PAC': 98, 'DRI': 98, 'SHO': 99, 'DEF': 50, 'PAS': 94, 'PHY': 95}
You may need to change the values on cookies to the ones set by your browser.

How to use mwapi library to get a wikipedia page?

I have been trying to figure out the documentation of the mwapi library (MediaWiki API) and I cannot figure out how to simply request a page based on a search query or keyword. I know I should use get() but filling in the parameters with keywords yield errors. Does anyone know how this works to look up something like "Earth Wind and Fire"?
Documentation can be found here:
http://pythonhosted.org/mwapi
and here is the only example they have of get() being used
import mwapi
session = mwapi.Session('https://en.wikipedia.org')
print(session.get(action='query', meta='userinfo'))
{'query': {'userinfo': {'anon': '', 'name': '75.72.203.28', 'id': 0}}, 'batchcomplete': ''}
print(session.get(action='query', prop='revisions', revids=32423425))
{'query': {'pages': {'1429626': {'ns': 0, 'revisions': [{'user': 'Wknight94', 'parentid': 32276615, 'comment': '/* References */ Removing less-specific cat', 'revid': 32423425, 'timestamp': '2005-12-23T00:07:17Z'}], 'title': 'Grigol Ordzhonikidze', 'pageid': 1429626}}}, 'batchcomplete': ''}
Maybe this code will help you understand the API:
import json # Used only to pretty-print dictionaries.
import mwapi
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.6) Gecko/2009011913 Firefox'
session = mwapi.Session('https://en.wikipedia.org', user_agent=USER_AGENT)
query = session.get(action='query', titles='Earth Wind and Fire')
print('query returned:')
print(json.dumps(query, indent=4))
pages = query['query']['pages']
if pages:
print('\npages:')
for pageid in pages:
data = session.get(action='parse', pageid=pageid, prop='text')
print(json.dumps(data, indent=4))
Output:
query returned:
{
"batchcomplete": "",
"query": {
"pages": {
"313370": {
"pageid": 313370,
"ns": 0,
"title": "Earth Wind and Fire"
}
}
}
}
pages:
{
"parse": {
"title": "Earth Wind and Fire",
"pageid": 313370,
"text": {
"*": "<div class=\"redirectMsg\"><p>Redirect to:</p><ul class=\"redirectText\"><li>Earth, Wind & Fire</li></ul></div><div class=\"mw-parser-output\">\n\n<!-- \nNewPP limit report\nParsed by mw1279\nCached time: 20171121014700\nCache expiry: 1900800\nDynamic content: false\nCPU time usage: 0.000 seconds\nReal time usage: 0.001 seconds\nPreprocessor visited node count: 0/1000000\nPreprocessor generated node count: 0/1500000\nPost\u2010expand include size: 0/2097152 bytes\nTemplate argument size: 0/2097152 bytes\nHighest expansion depth: 0/40\nExpensive parser function count: 0/500\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00% 0.000 1 -total\n-->\n</div>\n<!-- Saved in parser cache with key enwiki:pcache:idhash:313370-0!canonical and timestamp 20171121014700 and revision id 16182229\n -->\n"
}
}
}

Using mechanize bing search returns blank page

I am using mechanize to perform a bing search and then I will process the results with beautiful soup. I have successfully performed google and yahoo searches with this same method but when I do a bing search all I get is a blank page.
I am thoroughly confused why this is the case and if anyone can shed any light on the matter that would be greatly appreciated. Here is a sample of the code I'm using:
from BeautifulSoup import BeautifulSoup
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.bing.com/search?count=100&q=cheese")
content = br.response()
content = content.read()
soup = BeautifulSoup(content, convertEntities=BeautifulSoup.ALL_ENTITIES)
print soup
The result is a blank line printed.
You probably got response that answer is already in your browser cache. Try changing a little you query string, for example decrease count to 50.
You can also add some debugging code and see headers returned by server:
br.open("http://www.bing.com/search?count=50&q=cheese")
response = br.response()
headers = response.info()
print headers
content = response.read()
EDIT:
I have tried this query with count=100 with Firefox and Opera browsers and it seems that bing do not like such a "big" count. When I decrease count then it works. So this is not mechanize or other Python library fault, but your query is problematic to bing. It also seems that browser can query bing with count=100 but it must first query bing with some smaller count. Strange!
Another way to achieve this is by using requests with beautifulsoup
Code and example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
def get_organic_results():
html = requests.get('https://www.bing.com/search?q=nfs', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
bing_data = []
for result in soup.find_all('li', class_='b_algo'):
title = result.h2.text
try:
link = result.h2.a['href']
except:
link = None
displayed_link = result.find('div', class_='b_attribution').text
try:
snippet = result.find('div', class_='b_caption').p.text
except:
snippet = None
for inline in soup.find_all('div', class_='b_factrow'):
try:
inline_title = inline.a.text
except:
inline_title = None
try:
inline_link = inline.a['href']
except:
inline_link = None
bing_data.append({
'title': title,
'link': link,
'displayed_link': displayed_link,
'snippet': snippet,
'inline': [{'title': inline_title, 'link': inline_link}]
})
print(json.dumps(bing_data, indent = 2))
# part of the created json output:
'''
[
{
"title": "Need for Speed Video Games - Official EA Site",
"link": "https://www.ea.com/games/need-for-speed",
"displayed_link": "https://www.ea.com/games/need-for-speed",
"snippet": "Need for Speed Forums Buy Now All Games Forums Buy Now Learn More Buy Now Hit the gas and tear up the roads in this legendary action-driving series. Push your supercar to its limits and leave the competition in your rearview or shake off a full-scale police pursuit \u2013 it\u2019s all just a key-turn away.",
"inline": [
{
"title": null,
"link": null
}
]
}
]
'''
Alternatively, you can do the same thing using Bing Organic Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Code to integrate:
from serpapi import GoogleSearch
import os
def get_organic_results():
params = {
"api_key": os.getenv('API_KEY'),
"engine": "bing",
"q": "nfs most wanted"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
try:
snippet = result['snippet']
except:
snippet = None
try:
inline = result['sitelinks']['inline']
except:
inline = None
print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n{inline}\n')
# part of the output:
'''
Need for Speed: Most Wanted - Car Racing Game - Official ...
https://www.ea.com/games/need-for-speed/need-for-speed-most-wanted
https://www.ea.com/games/need-for-speed/need-for-speed-most-wanted
Jun 01, 2017 · To be Most Wanted, you’ll need to outrun the cops, outdrive your friends, and outsmart your rivals. With a relentless police force gunning to take you down, you’ll need to make split-second decisions. Use the open world to …
[{'title': 'Need for Speed No Limits', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-no-limits'}, {'title': 'Buy Now', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-heat/buy'}, {'title': 'Need for Speed Undercover', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-undercover'}, {'title': 'Need for Speed The Run', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-the-run'}, {'title': 'News', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-payback/news'}]
'''
Disclaimer, I work for SerpApi.

Categories