I am new to python programming and I am getting my hands dirty by working on a pet project.
I have tried a lot to avoid these nested for loops, but no success.
Avoiding nested for loops
Returns values from a for loop in python
import requests
import json
r = requests.get('https://api.coinmarketcap.com/v1/ticker/')
j = r.json()
for item in j:
item['id']
n = item['id']
url = 'https://api.coinmarketcap.com/v1/ticker/%s' %n
req = requests.get(url)
js = req.json()
for cool in js:
print n
print cool['rank']
Please let me know if more information is needed.
Question
I have too many loops in loops and want a python way of cleaning it up
Answer
Yes, there is a python way of cleaning up loops-in-loops to make it look better but there will still be loops-in-loops under-the-covers.
import requests
import json
r = requests.get('https://api.coinmarketcap.com/v1/ticker/')
j = r.json()
id_list = [item['id'] for item in j]
for n in id_list:
url = 'https://api.coinmarketcap.com/v1/ticker/%s' %n
req = requests.get(url)
js = req.json()
print "\n".join([ n+"\n"+item['rank'] for item in js ])
Insight from running this
After running this specific code, I realize that your are actually first retrieving the list of tickers in order of rank using
r = requests.get('https://api.coinmarketcap.com/v1/ticker/')
and then using
url = 'https://api.coinmarketcap.com/v1/ticker/%s' %n
to get the rank.
So long as the https://api.coinmarketcap.com/v1/ticker/ continues to return the items in order of rank you could simplify your code like so
import requests
import json
r = requests.get('https://api.coinmarketcap.com/v1/ticker/')
j = r.json()
id_list = [item['id'] for item in j]
result = zip(id_list,range(1,len(id_list)+1) )
for item in result :
print item[0]
print item[1]
Answer to addition question
Addition question : What if I want one more parameter say price_usd? ..... for cool in js: print n print cool['rank'] print cool['price_usd']
Answer :
change the line
print "\n".join([ n+"\n"+item['rank'] for item in js ])
to
print "\n".join([ n+"\n"+item['rank']+"\n"+cool['price_usd'] for item in js ])
Your first request already gets you everything you need.
import requests
import json
response = requests.get('https://api.coinmarketcap.com/v1/ticker/')
coin_data = response.json()
for coin in coin_data:
print coin['id'] # "bitcoin", "ethereum", ...
print coin['rank'] # "1", "2", ...
print coin['price_usd'] # "2834.75", "276.495", ...
Related
I'm scraping from the World Bank for a paper and I'm trying to make a loop of the web scraping of different indicators but I can't seem to make it work until a certain part of the code. Hope someone can help please?
#Single Code for each indicator
indcator = 'SP.POP.TOTL?date=2000:2020'
url = "http://api.worldbank.org/v2/countries/all/indicators/%s&format=json&per_page=5000" % indicator
response = requests.get(url)
print(response)
result = response.content
result = json.loads(result)
pop_total_df = pd.DataFrame.from_dict(result[1])
This is the loop i'm trying to build but I got an error in the last part of below code:
#indicator list
indicator = {'FP.CPI.TOTL.ZG?date=2000:2020','SP.POP.TOTL?date=2000:2020'}
#list of urls with the indicators
url_list = []
for i in indicator:
url = "http://api.worldbank.org/v2/countries/all/indicators/%s&format=json&per_page=5000" % i
url_list.append(url)
result_list = []
for i in url_list:
response = requests.get(i)
print(response)
result_list.append(response.content)
#Erroneous code
result_json = []
for i in range(3):
result_json.append(json.loads(result_list[i])))
As you are making 2 requests (FP.CPI.TOTL.ZG?date=2000:2020 and SP.POP.TOTL?date=2000:2020) your result_list length is 2, so its index are 0 and 1. Use range(2) or range(len(result_list)) instead:
import requests, json
#indicator list
indicator = {'FP.CPI.TOTL.ZG?date=2000:2020','SP.POP.TOTL?date=2000:2020'}
#list of urls with the indicators
url_list = []
for i in indicator:
url = "http://api.worldbank.org/v2/countries/all/indicators/%s&format=json&per_page=5000" % i
url_list.append(url)
result_list = []
for i in url_list:
response = requests.get(i)
print(response)
result_list.append(response.content)
#Erroneous code
result_json = []
for i in range(len(result_list)):
result_json.append(json.loads(result_list[i]))
I'm requesting 590 pages from the Meetup API. I've iterated with a while loop to get the pages. Now that I have the pages I need to request this pages and format them correctly as python in order to place into a Pandas dataframe.
This is how it looks when you do it for one url :
url = ('https://api.meetup.com/2/groups?offset=1&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3')
r = requests.get(url).json()
data = pd.io.json.json_normalize(r['results'])
But because I have so many pages I want to do this automatically and iterate through them all.
That's how nested while loops came to mind and this is what I tried:
urls = 0
offset = 0
url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
r = requests.get(urls%d = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3').json()
while urlx < 591:
new_url = r % urls % offset
print(new_url)
offset += 1
However, it isn't working and I'm receiving many errors including this one:
SyntaxError: keyword can't be an expression
Not sure what you're trying to do, and the code has lots of issues.
But if you just want to loop through 0 to 591 and fetch URLs, then here's the code:
import requests
import pandas as pd
dfs = []
base_url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
for i in range(0, 592):
url = base_url % i
r = requests.get(url).json()
print("Fetching URL: %s\n" % url)
# do something with r here
# here I'll append it to a list of dfs
dfs.append(pd.io.json.json_normalize(r['results']))
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"col _2-gKeQ"})
page_nr=soup.find_all("a",{"class":"_33m_Yg"})[-1].text
print(page_nr,"number of pages were found")
#all[0].find("div",{"class":"_1vC4OE _2rQ-NK"}).text
l=[]
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(0,int(page_nr)*10,10):
print( )
r=requests.get(base_url+str(page)+".html")
c=r.content
#c=r.json()["list"]
soup=BeautifulSoup(c,"html.parser")
for item in all:
d ={}
#price
d["Price"] = item.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
#Name
d["Name"] = item.find("div",{"class":"_3wU53n"}).text
for li in item.find_all("li",{"class":"_1ZRRx1"}):
if " EMI" in li.text:
d["EMI"] = li.text
else:
d["EMI"] = None
for li1 in item.find_all("li",{"class":"_1ZRRx1"}):
if "Special " in li1.text:
d["Special Price"] = li1.text
else:
d["Special Price"] = None
for val in item.find_all("li",{"class":"tVe95H"}):
if "Display" in val.text:
d["Display"] = val.text
elif "Warranty" in val.text:
d["Warrenty"] = val.text
elif "RAM" in val.text:
d["Ram"] = val.text
l.append(d)
import pandas
df = pandas.DataFrame(l)
This might work on standard pagination
i = 1
items_parsed = set()
loop = True
base_url = "https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page={}&q=laptop&sid=6bo%2Fb5g&viewType=list"
while True:
page = requests.get(base_url.format(i))
items = requests.get(#yourelements#)
if not items:
break
for item in items:
#Scrap your item and once you sucessfully done the scrap, return the url of the parsed item into url_parsed (details below code) for example:
url_parsed = your_stuff(items)
if url_parsed in items_parsed:
loop = False
items_parsed.add(url_parsed)
if not loop:
break
i += 1
I formatted your URL where ?page=X with base_url.format(i) so it can iterate until you have no items found on the page OR sometimes you return on page 1 when you reached max_page + 1.
If above the maximum page you get the items you already parsed on the first page you can declare a set() and put the URL of every items you parsed and then check if you already parsed them.
Note that this is just an idea.
Since the page number in the URL is almost in the middle I'd apply a similar change to your code:
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page="
end_url ="&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(1, page_nr + 1):
r=requests.get(base_url+str(page)+end_url+".html")
You have access to only first 10 pages from initial URL.
You can make a loop from "&page=1" to "&page=26".
I need help with turning a for loop into a while loop, which only prints/logs in differences / changes to an xml.
this is the current code i have thus far.
import requests
from bs4 import BeautifulSoup
url = "https://www.ruvilla.com/media/sitemaps/sitemap.xml"
r = requests.get(url)
soup = BeautifulSoup(r.content)
for url in soup.find_all("url"):
titlenode = url.find("loc")
if titlenode:
title = titlenode.text
loc = url.find("loc").text
lastmod = url.find("lastmod").text
print title + "\n" + lastmod
For your current use case, a for loop works best. However, if you really want to make into a while loop, you can do that like so:
urls = soup.find_all("url")
counter = 0
while counter < len(urls)-1:
counter += 1
url = urls[counter]
#Your code here
If I understood your question properly, you are trying to log only the urls which has lastmod attribute associated. For this case for loop works best instead of while because it automatically ends iteration when the end of the list is reached. As in case of while loop you have to explicitly handle with check like i < len(size). You can consider the below:
while True:. # Loop infinitely
r = requests.get(url)
soup = BeautifulSoup(r.content)
for url in soup.find_all('url'):
lastmod = url.find("lastmod").text
if not lastmod:
continue
loc = url.find("loc").text
titlenode = url.find("loc")
if titlenode:
title = titlenode.text
time.sleep(1)
The try-except block is to ensure that the lastmod if exists print the details. Else just ignore and go to next URL. Hope this helps. Cheers.
I am working on a web scraper for the first time, and I am using Beautiful Soup to parse a JSON file and return several attributes that I send to a CSV.
The status variable, in the JSON array, is a binary value (0/1). I'd like to return only arrays that have a 0 for status. Is it feasible to do that?
"""soup = BeautifulSoup(html)
table = soup.find()
print soup.prettify()"""
js_data = json.loads(html)
Attraction = []
event = []
status = []
for doc in js_data["response"]["docs"]:
Attraction.append(doc["Attraction"])
event.append(doc["PostProcessedData"]["Onsales"]["event"]["date"])
status.append(doc["PostProcessedData"]["Onsales"]["status"])
with open("out.csv","w") as f:
datas = zip(Attraction,event,status)
keys = ["Attraction","event","status"]
f.write(";".join(keys))
for data in datas:
f.write(",".join([str(k).replace(",",";").replace("<br>"," ") for k in data]))
f.write("\n")
I might be missing something, but maybe this helps:
for doc in js_data["response"]["docs"]:
if doc["PostProcessedData"]["Onsales"]["status"] == "0":
Attraction.append(doc["Attraction"])
event.append(doc["PostProcessedData"]["Onsales"]["event"]["date"])
status.append(doc["PostProcessedData"]["Onsales"]["status"])