Python 3 Urllib Posting Issues - python

I was hoping someone could possibly help me with urllib posting. My goal for this program is to post an IP address and to obtain its relative location. I know there are many APIs and such out there, but my school isn't too keen on having any of their computers modified in any way (this is for a comp sci class). So as of right now the code below gets me my location as my computer IP is already sensed by the website (I'm guessing in a header?), but what I'd like to do is just input an IP and have a returned location. ipStr is just the IP string (and in this case it's Time Warner Cable's IP in NYC). I tried setting values and submitting the data but no matter what I set the values to, it just returns my own computers location. Any ideas?
ipStr = "72.229.28.185"
url = "https://www.iplocation.net/"
values = {'value': ipStr}
headers = {}
headers ['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url,data=data, headers = headers)
resp = urllib.request.urlopen(req)
page = str(resp.read())
npattern = "Google Map for"
nfound = re.search(npattern,page)
ns = nfound.start()
ne = nfound.end()
location = ""
while page[ne:ne +1] != "(":
location += page[ne:ne+1]
ne += 1

You just need to change the parameter name from value to query, for example:
values = {'query': ipStr}
If you look at the name of the input field on the page (https://www.iplocation.net/) you'll see the field's name is query.

Related

Unable to receive desired results from POST in Python

I am attempting to get data from ITC TradeMap (I have selected the page at random so don't give that too much thought) using Requests and then clean (have not done this yet) and export it using Pandas, however I am facing difficulties getting the full datasets.
import pandas as pd
import requests as rq
#Pandas Settings
pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 999)
#Request Settings
url = 'https://www.trademap.org/Country_SelProductCountry_TS.aspx?nvpm=1%7c643%7c%7c%7c%7c36%7c%7c%7c2%7c1%7c1%7c2%7c2%7c1%7c2%7c1%7c1%7c1'
headers = {
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
payload = {
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_OutputMode': 'T',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_PageSizeTab': '300',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_NumTimePeriod': '10',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_ReferencePeriod': '2019'
}
#Output Settings
output_file = 'ITC_Test.xlsx'
#Work
request = rq.post(url, verify=False, headers= headers, data=payload)
table = pd.read_html(request.content)
table[8].to_excel(output_file)
print(table[8])
So far I am in the testing stage of this and solving issues as they arise (e.g. If requested without verify = False, it throws a severside ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997) error), but that's besides the point.
The real problem I am facing is that while most data needed for successful queries is contained in the url itself and I will simply for loop through it when time comes, the view settings are not, and without them I am limited to retrieving only 25 rows and 5 columns of data (top 25 trade partners over the last 5 years).
Those settings are located in dropdown windows which seem to be fed into a aspnetForm, I have tried to use data parameter of post to feed it with those values:
payload = {
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_OutputMode': 'T',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_PageSizeTab': '300',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_NumTimePeriod': '10',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_ReferencePeriod': '2019'
}
However the output does not seem to be affected and it still only returns 25 rows and 5 columns of data instead of 300 rows and 10 columns I would expect it to.
Thanks!
I have seen some questions here which seemed similar and tries to implement those ideas, however most likely due to the fact that I haven't worked with those libraries and my knowledge of Python in general is rather basic I was unable to resolve the issues so any help would be much appreciated.
Thanks!
I found 3 problems:
it has to use Session() to send cookies
it has to send all values in payload - so first I GET page to get all values from <input>, <select>
it sends new values in different variables.
You used variables which described current state but it send in variables with DropDownList in name
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'] = '20'
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'] = '300'
EDIT:
I saved payload generated by code and payload from browser and I used program Meld to compare files to see differences.
I had to correct code which gets values from <select> because it needs to search <option> with selected
For some addresses it needed to manually some values because normally JavaScript was adding these values
And it needed to skip DropDownList_Product_Group
Full working code:
import pandas as pd
import requests as rq
from bs4 import BeautifulSoup as BS
# hide SSL warnings
from requests.packages.urllib3.exceptions import InsecureRequestWarning
rq.packages.urllib3.disable_warnings(InsecureRequestWarning)
# request settings
#url = 'https://www.trademap.org/Country_SelProduct_TS.aspx?nvpm=1%7c%7c%7c%7c%7c88%7c%7c%7c2%7c1%7c1%7c1%7c2%7c1%7c2%7c1%7c1%7c1'
url = 'https://www.trademap.org/Country_SelProductCountry_TS.aspx?nvpm=1%7c616%7c%7c%7c%7cTOTAL%7c%7c%7c2%7c1%7c1%7c1%7c2%7c1%7c2%7c1%7c%7c1'
print('url:', url.replace('%7c', '|'))
headers = {
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
s = rq.Session()
# --- GET ---
print('sending GET ...')
response = s.get(url, verify=False, headers=headers)
soup = BS(response.content, 'html.parser')
form = soup.find('form') #.find('div', {'id': 'div_nav_combo'})
payload = {}
#print('--- inputs ---')
inputs = form.find_all('input')
for item in inputs:# + select:
name = item.get('name', '')
#print('name:', name)
value = item.get('value', '')
#print('value:', value)
if name: #and name != 'pg_goal' and 'button' not in name.lower():
payload[name] = value
#print(name, '=', value)
#print('--- selects ---')
selects = form.find_all('select')
for item in selects:
name = item.get('name', '')
#print('name:', name)
value = item.get('value', '')
#print('value:', value)
if name:
value = item.find('option', {'selected': True}) or ""
if value:
value = value['value']
payload[name] = value
#print(name, '=', value)
#print('--- textareas ---')
#textareas = form.find_all('textarea')
#for item in textareas:
# name = item.get('name', '')
# print('name:', name)
# value = item.get('value', '')
# print('value:', value)
# --- POST ---
print('sending POST ...')
#payload['ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'] = '20'
#payload['__EVENTTARGET'] = 'ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'] = '300'
#payload['__EVENTTARGET'] = 'ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'
#payload['ctl00$MenuControl$DDL_Language'] = 'en'
# added by JavaScript
payload['ctl00$NavigationControl$DropDownList_Country_Group'] = '-2'
payload['ctl00$NavigationControl$DropDownList_Partner'] = '-2'
payload['ctl00$NavigationControl$DropDownList_Partner_Group'] = '-2'
# has to remove it for `PageSize` (at least for some addresses)
del payload['ctl00$NavigationControl$DropDownList_Product_Group']
response = s.post(url, verify=False, headers=headers, data=payload)
#print(response.content[:1000])
# --- rest ---
# pandas settings
pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 999)
# output settings
output_file = 'ITC_Test.xlsx'
all_tables = pd.read_html(response.content)
table = all_tables[8]
table.to_excel(output_file)
print('len(table):', len(table))
#print(table)
Result: table with ~230 rows and ~20 columns

Use if to judge why the result of two identical parameters is 'False'

This is my code. I can't get the data I want.I found that although my 'address' and the 'address' obtained from the data look the same, they are not equal
import requests
import json
address = '0xF5565F298D47C95DE222d0e242A69D2711fE3E89'
url = f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page=1&offset=10000&sort=asc'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
}
rsp = requests.get(url, headers=headers)
result = json.loads(rsp.text)
result = result['result']
gas_list = []
val_list = []
for item in result:
if 'from' in item:
index = result.index(item)
if result[index]['from'] == address:
wei = 10 ** 18
gasPrice = result[index]['gasPrice']
gasUsed = result[index]['gasUsed']
value = result[index]['value']
gas = gasPrice * gasUsed / wei
v = value / wei
gas_list.append(gas)
val_list.append(v)
total_gas = sum(gasl)
total_val = sum(val)
print(total_gas)
print(total_val)
I guess it may be the problem of id(), but I don't know how to solve it. I've been wondering for a long time. I'm a novice in Python. Please help me
if result[index]['from'] == address:
Your address has uppercase letters while the returned from addresses have lowercase letters, therefore the strings don't match.
Try testing them with
result[index]['from'].upper() == address
or switching yours to be lowercase.
I found this out by visiting the URL you gave and having a look at what was returned. A useful technique for services like this that provide JSON responses, as it's all plain text and readable in a browser.

How do I fix the code to scrape Zomato website?

I wrote this code but got this as the error "IndexError: list index out of range" after running the last line. Please, how do I fix this?
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div",attrs={"class": "sc-bblaLu dOXFUL"})
list_tr = top_rest[0].find_all("div",attrs={"class": "sc-gTAwTn cKXlHE"})
list_rest =[]
for tr in list_tr:
dataframe ={}
dataframe["rest_name"] = (tr.find("div",attrs={"class": "res_title zblack bold nowrap"})).text.replace('\n', ' ')
dataframe["rest_address"] = (tr.find("div",attrs={"class": "nowrap grey-text fontsize5 ttupper"})).text.replace('\n', ' ')
dataframe["cuisine_type"] = (tr.find("div",attrs={"class":"nowrap grey-text"})).text.replace('\n', ' ')
list_rest.append(dataframe)
list_rest
You are receiving this error because top_rest is empty when you attempt to get the first element of it "top_rest[0]". The reason for that is because the first class your attempting to reference is dynamically named. You will notice if you refresh the page the same location of that div will not be named the same. So when you attempt to scrape you get empty results.
An alternative would be to scrape ALL divs, then narrow in on the elements you want, be mindful of the dynamic div naming schema so from one request to another you will get different results:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div")
list_tr = top_rest[0].find_all("div",attrs={"class": "bke1zw-1 eMsYsc"})
list_tr
I recently did a project that made me research scraping the Zomato's website in Manila, Philippines. I used Geolibrary to get the longitude and latitude values of Manila City, then scraped the restaurants' details using this information.
ADD: You can get your own API key on zomato website to make up to 1000 calls in a day.
# Use geopy library to get the latitude and longitude values of Manila City.
from geopy.geocoders import Nominatim
address = 'Manila City, Philippines'
geolocator = Nominatim(user_agent = 'Makati_explorer')
location = geolocator.geocode(address)
latitude = location.lenter code hereatitude
longitude = location.longitude
print('The geographical coordinate of Makati City are {}, {}.'.format(latitude, longitude))
# Use Zomato's API to make call
headers = {'user-key': '617e6e315c6ec2ad5234e884957bfa4d'}
venues_information = []
for index, row in foursquare_venues.iterrows():
print("Fetching data for venue: {}".format(index + 1))
venue = []
url = ('https://developers.zomato.com/api/v2.1/search?q={}' +
'&start=0&count=1&lat={}&lon={}&sort=real_distance').format(row['name'], row['lat'], row['lng'])
try:
result = requests.get(url, headers = headers).json()
except:
print("There was an error...")
try:
if (len(result['restaurants']) > 0):
venue.append(result['restaurants'][0]['restaurant']['name'])
venue.append(result['restaurants'][0]['restaurant']['location']['latitude'])
venue.append(result['restaurants'][0]['restaurant']['location']['longitude'])
venue.append(result['restaurants'][0]['restaurant']['average_cost_for_two'])
venue.append(result['restaurants'][0]['restaurant']['price_range'])
venue.append(result['restaurants'][0]['restaurant']['user_rating']['aggregate_rating'])
venue.append(result['restaurants'][0]['restaurant']['location']['address'])
venues_information.append(venue)
else:
venues_information.append(np.zeros(6))
except:
pass
ZomatoVenues = pd.DataFrame(venues_information,
columns = ['venue', 'latitude',
'longitude', 'price_for_two',
'price_range', 'rating', 'address'])
Using Web Scraping Language I was able to write this:
GOTO https://www.zomato.com/bangalore/top-restaurants
EXTRACT {'rest_name': '//div[#class="res_title zblack bold nowrap"]',
'rest_address': '//div[#class="nowrap grey-text fontsize5 ttupper',
'cusine_type': '//div[#class="nowrap grey-text"]'} IN //div[#class="bke1zw-1 eMsYsc"]
This will iterate over each record element with class bke1zw-1 eMsYsc and pull
each restaurant information.

Parsing Google contacts feed in Python

I'm trying to obtain the name of every contact in Google contact feed using python and based on Retrieve all contacts from gmail using python.
social = request.user.social_auth.get(provider='google-oauth2')
url = 'https://www.google.com/m8/feeds/contacts/default/full' + '?access_token=' + social.tokens + '&max-results=10000'
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
contacts = urllib2.urlopen(req).read()
contacts_xml = etree.fromstring(contacts)
contacts_list = []
n = 0
for entry in contacts_xml.findall('{http://www.w3.org/2005/Atom}entry'):
n = n + 1
for name in entry.findall('{http://schemas.google.com/g/2005}name'):
fullName = name.attrib.get('fullName')
contacts_list.append(fullName)
I'm able to obtain the number of contacts n, but no luck obtaining the fullName. Any help is appreciated!
In case someone needs it, I found the solution to obtain the name from Google Contacts feed:
for entry in contacts_xml.findall('{http://www.w3.org/2005/Atom}entry'):
for title in entry.findall('{http://www.w3.org/2005/Atom}title'):
name = title.text
contacts_list.append(name)

Python3 - how to login to web form with hidden values?

I am trying to write a python script to login to the following site in order to automatically keep on eye on some of our merchant account details:
https://secure.worldpay.com/sso/public/auth/login.html?serviceIdentifier=merchantadmin
The credentials I am using are read-only, so cannot be used for anything nefarious, but something isn't quite working correctly.
My code so far:
import urllib
from requests import session
LOGIN_URL = "https://secure.worldpay.com/sso/public/auth/login.html?serviceIdentifier=merchantadmin"
_page = urllib.urlopen(LOGIN_URL)
_contents = _page.read()
_jlbz_index = _contents.find("jlbz")
_jlbz_start_index = _jlbz_index + 5
_jlbz_end_index = _jlbz_start_index + 41
jlbz = _contents[_jlbz_start_index:_jlbz_end_index]
fdt = _contents.find("formDisplayTime")
fdt_start_index = fdt + 23
fdt_end_index = fdt_start_index + 13
form_display_time = _contents[fdt_start_index:fdt_end_index]
fsh = _contents.find("formSubmitHash")
fsh_start_index = fsh + 22
fsh_end_index = fsh_start_index + 41
form_submit_hash = _contents[fsh_start_index:fsh_end_index]
post_auth_url = "https://secure-test.worldpay.com/merchant/common/start.html?jlbz={0}".format(jlbz)
payload = {
"action": "j_security_check",
"username": "USERNAME",
"password": "PASSWORD",
"jlbz": jlbz,
"maiversion": "version1",
"formDisplayTime": form_display_time,
"formSubmitHash": form_submit_hash
}
with session() as c:
c.post(LOGIN_URL, data=payload)
request = c.get(post_auth_url)
print(request.headers)
print(request.text)
I know it is currently a little long-winded but I find it easier to write a little verbosely when first trying something, and then refining later.
jlbz, formDisplayTime and formSubmitHash are all hidden input values from the page source - I am scraping this from the page, but obviously when I get to c.post, I'm opening the URL AGAIN, so these values are changing and are no longer valid? However, I'm unsure how to rewrite the c.post line to ensure that I extract the correct hidden values for submission?
I don't think that this is only relevant to this site, but for any site with hidden random values?
import requests
from bs4 import BeautifulSoup
user='xyzmohsin'
passwd='abcpasswd'
s=requests.Session()
headers={"User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}
s.headers.update(headers)
r=s.get("https://secure.worldpay.com/sso/public/auth/login.html?serviceIdentifier=merchantadmin")
soup=BeautifulSoup(r.content)
jlbz=soup.find("input",{"name":"jlbz"})['value']
maiversion=soup.find(id="maiversion")['value']
formDisplayTime=soup.find("input",{"name":"formDisplayTime"})['value']
formSubmitHash=soup.find("input",{"name":"formSubmitHash"})['value']
data={"jlbz":jlbz,
"username":user,
"password":passwd,
"maiversion":maiversion,
"formDisplayTime":formDisplayTime,
"formSubmitHash":formSubmitHash}
headers={"Content-Type":"application/x-www-form-urlencoded",
"Host":"secure.worldpay.com",
"Origin":"https://secure.worldpay.com",
"Referer":"https://secure.worldpay.com/sso/public/auth/login.html?serviceIdentifier=merchantadmin"}
login_url="https://secure.worldpay.com/sso/public/auth/j_security_check"
r=s.post(login_url,headers=headers,data=data)
I don't have the ID and password, hence I don't know which headers will work.
But if this doesn't work then please remove Host, Origin and Referer from the last s.post request's header
Hope that helps :-)

Categories