Parsing Google contacts feed in Python - python

I'm trying to obtain the name of every contact in Google contact feed using python and based on Retrieve all contacts from gmail using python.
social = request.user.social_auth.get(provider='google-oauth2')
url = 'https://www.google.com/m8/feeds/contacts/default/full' + '?access_token=' + social.tokens + '&max-results=10000'
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
contacts = urllib2.urlopen(req).read()
contacts_xml = etree.fromstring(contacts)
contacts_list = []
n = 0
for entry in contacts_xml.findall('{http://www.w3.org/2005/Atom}entry'):
n = n + 1
for name in entry.findall('{http://schemas.google.com/g/2005}name'):
fullName = name.attrib.get('fullName')
contacts_list.append(fullName)
I'm able to obtain the number of contacts n, but no luck obtaining the fullName. Any help is appreciated!

In case someone needs it, I found the solution to obtain the name from Google Contacts feed:
for entry in contacts_xml.findall('{http://www.w3.org/2005/Atom}entry'):
for title in entry.findall('{http://www.w3.org/2005/Atom}title'):
name = title.text
contacts_list.append(name)

Related

Use if to judge why the result of two identical parameters is 'False'

This is my code. I can't get the data I want.I found that although my 'address' and the 'address' obtained from the data look the same, they are not equal
import requests
import json
address = '0xF5565F298D47C95DE222d0e242A69D2711fE3E89'
url = f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page=1&offset=10000&sort=asc'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
}
rsp = requests.get(url, headers=headers)
result = json.loads(rsp.text)
result = result['result']
gas_list = []
val_list = []
for item in result:
if 'from' in item:
index = result.index(item)
if result[index]['from'] == address:
wei = 10 ** 18
gasPrice = result[index]['gasPrice']
gasUsed = result[index]['gasUsed']
value = result[index]['value']
gas = gasPrice * gasUsed / wei
v = value / wei
gas_list.append(gas)
val_list.append(v)
total_gas = sum(gasl)
total_val = sum(val)
print(total_gas)
print(total_val)
I guess it may be the problem of id(), but I don't know how to solve it. I've been wondering for a long time. I'm a novice in Python. Please help me
if result[index]['from'] == address:
Your address has uppercase letters while the returned from addresses have lowercase letters, therefore the strings don't match.
Try testing them with
result[index]['from'].upper() == address
or switching yours to be lowercase.
I found this out by visiting the URL you gave and having a look at what was returned. A useful technique for services like this that provide JSON responses, as it's all plain text and readable in a browser.

How do I fix the code to scrape Zomato website?

I wrote this code but got this as the error "IndexError: list index out of range" after running the last line. Please, how do I fix this?
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div",attrs={"class": "sc-bblaLu dOXFUL"})
list_tr = top_rest[0].find_all("div",attrs={"class": "sc-gTAwTn cKXlHE"})
list_rest =[]
for tr in list_tr:
dataframe ={}
dataframe["rest_name"] = (tr.find("div",attrs={"class": "res_title zblack bold nowrap"})).text.replace('\n', ' ')
dataframe["rest_address"] = (tr.find("div",attrs={"class": "nowrap grey-text fontsize5 ttupper"})).text.replace('\n', ' ')
dataframe["cuisine_type"] = (tr.find("div",attrs={"class":"nowrap grey-text"})).text.replace('\n', ' ')
list_rest.append(dataframe)
list_rest
You are receiving this error because top_rest is empty when you attempt to get the first element of it "top_rest[0]". The reason for that is because the first class your attempting to reference is dynamically named. You will notice if you refresh the page the same location of that div will not be named the same. So when you attempt to scrape you get empty results.
An alternative would be to scrape ALL divs, then narrow in on the elements you want, be mindful of the dynamic div naming schema so from one request to another you will get different results:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div")
list_tr = top_rest[0].find_all("div",attrs={"class": "bke1zw-1 eMsYsc"})
list_tr
I recently did a project that made me research scraping the Zomato's website in Manila, Philippines. I used Geolibrary to get the longitude and latitude values of Manila City, then scraped the restaurants' details using this information.
ADD: You can get your own API key on zomato website to make up to 1000 calls in a day.
# Use geopy library to get the latitude and longitude values of Manila City.
from geopy.geocoders import Nominatim
address = 'Manila City, Philippines'
geolocator = Nominatim(user_agent = 'Makati_explorer')
location = geolocator.geocode(address)
latitude = location.lenter code hereatitude
longitude = location.longitude
print('The geographical coordinate of Makati City are {}, {}.'.format(latitude, longitude))
# Use Zomato's API to make call
headers = {'user-key': '617e6e315c6ec2ad5234e884957bfa4d'}
venues_information = []
for index, row in foursquare_venues.iterrows():
print("Fetching data for venue: {}".format(index + 1))
venue = []
url = ('https://developers.zomato.com/api/v2.1/search?q={}' +
'&start=0&count=1&lat={}&lon={}&sort=real_distance').format(row['name'], row['lat'], row['lng'])
try:
result = requests.get(url, headers = headers).json()
except:
print("There was an error...")
try:
if (len(result['restaurants']) > 0):
venue.append(result['restaurants'][0]['restaurant']['name'])
venue.append(result['restaurants'][0]['restaurant']['location']['latitude'])
venue.append(result['restaurants'][0]['restaurant']['location']['longitude'])
venue.append(result['restaurants'][0]['restaurant']['average_cost_for_two'])
venue.append(result['restaurants'][0]['restaurant']['price_range'])
venue.append(result['restaurants'][0]['restaurant']['user_rating']['aggregate_rating'])
venue.append(result['restaurants'][0]['restaurant']['location']['address'])
venues_information.append(venue)
else:
venues_information.append(np.zeros(6))
except:
pass
ZomatoVenues = pd.DataFrame(venues_information,
columns = ['venue', 'latitude',
'longitude', 'price_for_two',
'price_range', 'rating', 'address'])
Using Web Scraping Language I was able to write this:
GOTO https://www.zomato.com/bangalore/top-restaurants
EXTRACT {'rest_name': '//div[#class="res_title zblack bold nowrap"]',
'rest_address': '//div[#class="nowrap grey-text fontsize5 ttupper',
'cusine_type': '//div[#class="nowrap grey-text"]'} IN //div[#class="bke1zw-1 eMsYsc"]
This will iterate over each record element with class bke1zw-1 eMsYsc and pull
each restaurant information.

Unstable Results when scraping Amazon

I am new in the web scraping field. So hopefully this question is clear.
I found a tutorial on the internet to scrape Amazon data, based on a given ASIN (unique Amazon number). See : https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python/
When running this code (I adjusted a bit of the code) I faced the issue that I received every time different results (even when running 5 seconds later). In my example, one time the Titles are found, but 5 seconds later the result is NULL.
I think the reason is because I searched the XPATH via Google Chrome, and in the beginning of the code, there is the
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
My question: how can I scrape the content on a stable way? (e.g.: getting the real results of the pages, by using ASIN numbers)
Below the code for reproducing. You can run the script via the command line:
python script_name.py
Thanks a lot for your help!
The script:
from lxml import html
import csv,os,json
import requests
#from exceptions import ValueError
from time import sleep
def AmzonParser(url):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url,headers=headers)
while True:
sleep(5)
try:
doc = html.fromstring(page.content)
# Title
XPATH_NAME = '//*[#id="productTitle"]/text()'
XPATH_NAME1 = doc.xpath(XPATH_NAME)
TITLE = ' '.join(''.join(XPATH_NAME1).split()) if XPATH_NAME1 else None
#XPATH_SALE_PRICE = '//span[contains(#id,"ourprice") or contains(#id,"saleprice")]/text()'
#XPATH_ORIGINAL_PRICE = '//td[contains(text(),"List Price") or contains(text(),"M.R.P") or contains(text(),"Price")]/following-sibling::td/text()'
#XPATH_CATEGORY = '//a[#class="a-link-normal a-color-tertiary"]//text()'
#XPATH_AVAILABILITY = '//div[#id="availability"]//text()'
#RAW_NAME = doc.xpath(XPATH_NAME)
#RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
#RAW_CATEGORY = doc.xpath(XPATH_CATEGORY)
#RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
#RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)
#NAME = ' '.join(''.join(RAW_NAME).split()) if RAW_NAME else None
#SALE_PRICE = ' '.join(''.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
#CATEGORY = ' > '.join([i.strip() for i in RAW_CATEGORY]) if RAW_CATEGORY else None
#ORIGINAL_PRICE = ''.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
#AVAILABILITY = ''.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None
#if not ORIGINAL_PRICE:
# ORIGINAL_PRICE = SALE_PRICE
if page.status_code!=200:
raise ValueError('captha')
data = {
'TITLE':TITLE
#'SALE_PRICE':SALE_PRICE,
#'CATEGORY':CATEGORY,
#'ORIGINAL_PRICE':ORIGINAL_PRICE,
#'AVAILABILITY':AVAILABILITY,
#'URL':url,
}
return data
except Exception as e:
print(e)
def ReadAsin():
# AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),"Asinfeed.csv")))
AsinList = [
'B00AEINQ9K',
'B00JWP8F3I']
extracted_data = []
for i in AsinList:
url = "http://www.amazon.com/dp/"+i
print ("Processing: "+url)
extracted_data.append(AmzonParser(url))
sleep(5)
f=open('data_scraped_data.json','w')
json.dump(extracted_data,f,indent=4)
if __name__ == "__main__":
ReadAsin()

How to Webscraping Instagram Profile link BeautifulSoup?

I'm just starting to learn how to web scrape using BeautifulSoup and want to write a simple program that will get the profile links (instagram url) of my idol via FullName in Instagram.
Example: I have FullName list stored in file fullname.txt as follow:
#cat fullname.txt
Cristiano Ronaldo
David Beckham
Michael Jackson
My result desire is:
https://www.instagram.com/cristiano/
https://www.instagram.com/davidbeckham/
https://www.instagram.com/michaeljackson/
Can you give me some suggestions?
This worked for all 3 names, and a few others I added to fullname.txt
It uses the Requests library and a Bing search to find the correct link, then uses regular expressions to parse the link out of the returned packet.
import requests, re
def bingsearch(searchfor):
link = 'https://www.bing.com/search?q={}&ia=web'.format(searchfor)
ua = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'}
payload = {'q': searchfor}
response = requests.get(link, headers=ua, params=payload)
try:
found = re.search('Search Results(.+?)</a>', response.text).group(1)
iglink = re.search('a href="(.+?)"', found).group(1)
except AttributeError:
iglink = "link not found"
return iglink
with open("fullname.txt", "r") as f:
names = f.readlines()
for name in names:
name = name.strip().replace(" ", "+")
searchterm = name + "+instagram"
IGLink = bingsearch(searchterm)
print(IGLink)

Python 3 Urllib Posting Issues

I was hoping someone could possibly help me with urllib posting. My goal for this program is to post an IP address and to obtain its relative location. I know there are many APIs and such out there, but my school isn't too keen on having any of their computers modified in any way (this is for a comp sci class). So as of right now the code below gets me my location as my computer IP is already sensed by the website (I'm guessing in a header?), but what I'd like to do is just input an IP and have a returned location. ipStr is just the IP string (and in this case it's Time Warner Cable's IP in NYC). I tried setting values and submitting the data but no matter what I set the values to, it just returns my own computers location. Any ideas?
ipStr = "72.229.28.185"
url = "https://www.iplocation.net/"
values = {'value': ipStr}
headers = {}
headers ['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url,data=data, headers = headers)
resp = urllib.request.urlopen(req)
page = str(resp.read())
npattern = "Google Map for"
nfound = re.search(npattern,page)
ns = nfound.start()
ne = nfound.end()
location = ""
while page[ne:ne +1] != "(":
location += page[ne:ne+1]
ne += 1
You just need to change the parameter name from value to query, for example:
values = {'query': ipStr}
If you look at the name of the input field on the page (https://www.iplocation.net/) you'll see the field's name is query.

Categories