In python3, I want to load this_file, which is a json format.
Basically, I want to do something like [pseudocode]:
>>> read_from_url = urllib.some_method_open(this_file)
>>> my_dict = json.load(read_from_url)
>>> print(my_dict['some_key'])
some value
You were close:
import requests
import json
response = json.loads(requests.get("your_url").text)
Just use json and requests modules:
import requests, json
content = requests.get("http://example.com")
json = json.loads(content.content)
Or using the standard library:
from urllib.request import urlopen
import json
data = json.loads(urlopen(url).read().decode("utf-8"))
So you want to be able to reference specific values with inputting keys? If i think i know what you want to do, this should help you get started. You will need the libraries urlllib2, json, and bs4. just pip install them its easy.
import urllib2
import json
from bs4 import BeautifulSoup
url = urllib2.urlopen("https://www.govtrack.us/data/congress/113/votes/2013/s11/data.json")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
newDictionary=json.loads(str(soup))
I used a commonly used url to practice with.
Related
So basically I am stuck on the problem where I don't know how to the url from the extracted data from a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')
soup = BeautifulSoup(req.content, "html.parser")
print(soup.prettify())
I get a lot of information on output, but the only thing I need is the url, I hope someone can help me.
P.S:
It gives me this information:
{"response":{"items":[{"url":"https:\/\/2ch.hk\/b\/src\/262671212\/16440825183970.webm","type":"video\/webm","filesize":"20259","width":1280,"height":720,"name":"1521967932778.webm","board":"b","thread":"262671212"},{"url":"https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm","type":"video\/webm","filesize":"12055","width":1280,"height":720,"name":"1526793203110.webm","board":"b","thread":"261549765"}...
But i only need this part out of all the things
https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm (Not exactly this url, but just as an example)
You can do it this way:
url_array = []
for item in soup['response']['items']:
url_array.append(item['url'])
I guess if the API returns JSON data then it should be better to just parse it directly.
The url produces json data. Beautifulsoup can't grab json data and to grab json data, you can follow the next example.
import requests
import json
data = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1').json()
url= data['response']['items'][0]['url']
if url:
url=url.replace('.webm','.mp4')
print(url)
Output:
https://2ch.hk/b/src/263361969/16451225633240.mp4
The problem is you are telling BeautifulSoup to parse JSON data as HTML. You can get the URL you need more directly with the following code
import json
import requests
from bs4 import BeautifulSoup
req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')
data = json.loads(req.content)
my_url = data['response']['items'][0]['url']
I have a web-page and I want to get the <div class="password"> element using urllbi2 in Python without using Beautiful Soup.
My code so far:
import urllib.request as urllib2
link = "http://www.chiquitooenterprise.com/password"
response = urllib2.urlopen('http://www.chiquitooenterprise.com/')
contents = response.read('password')
It gives an error.
You need to decode() the response with utf-8 as it states in the Network tab:
Hence:
import urllib.request as urllib2
link = "http://www.chiquitooenterprise.com/password"
response = urllib2.urlopen('http://www.chiquitooenterprise.com/')
output = response.read().decode('utf-8')
print(output)
OUTPUT:
YOIYEDGXPU
You don't want bs4 you say but you could use requests
import requests
r = requests.get('http://www.chiquitooenterprise.com/password')
print(r.text)
from urllib.request import urlopen
from bs4 import BeautifulSoup
apikey='*****d2deb67f650f022ae13d07*****'
first='http://api.ipstack.com/'
ip='134.201.250.155'
third='?access_key='
print(first+ip+third+apikey)
#html=urlopen(first+ip+third+apikey)
soup=BeautifulSoup(html,"html.parser")
print(soup)
i had to hide the first,last 5 digits of my apikey,anyway this gives
{"ip":"134.201.250.155","type":"ipv4","continent_code":"NA","continent_name":"North America","country_code":"US","country_name":"United States","region_code":"CA","region_name":"California","city":"La Jolla","zip":"92037","latitude":32.8455,"longitude":-117.2521,"location":{"geoname_id":5363943,"capital":"Washington D.C.","languages":[{"code":"en","name":"English","native":"English"}],"country_flag":"http:\/\/assets.ipstack.com\/flags\/us.svg","country_flag_emoji":"\ud83c\uddfa\ud83c\uddf8","country_flag_emoji_unicode":"U+1F1FA U+1F1F8","calling_code":"1","is_eu":false}}
this is giving me a soup object,what do i i need to add to get the country_name,geoname_id,ip in a list so i can write them later in .json file
This seems like a json response
you need to parse it from json liberary
import json
parsed_json = json.loads(str(soup))
geoname_id = parsed_json['location']['geoname_id']
country_name = parsed_json['country_name']
ip = parsed_json['ip']
A better solution while dealing with REST apis that return json responses would be:
import requests
apikey='*****d2deb67f650f022ae13d07*****'
first='http://api.ipstack.com/'
ip='134.201.250.155'
query_string = {'access_key': apikey}
res = requests.get(first+ip+third, params=query_string)
res.raise_for_status()
ip = res.json()['ip']
The documentation is very helpful here - what you need to do is in there:
soup = BeautifulSoup(html,"html.parser")
print(soup.ip)
>>> "134.201.250.155"
Let me know if you need further help!
I am trying to import a list of urls and grab pn2 and main1. I can run it without importing the file so I know it works but I just have no idea what to do with the import. Here is what I have tried most recent and below it is a small portion of the urls. Thanks in advance.
import urllib
import urllib.request
import csv
from bs4 import BeautifulSoup
csvfile = open("ecco1.csv")
csvfilelist = csvfile.read()
theurl="csvfilelist"
soup = BeautifulSoup(theurl,"html.parser")
for row in csvfilelist:
for pn in soup.findAll('td',{"class":"productText"}):
pn2.append(pn.text)
for main in soup.find_all('div',{"class":"breadcrumb"}):
main1 = main.text
print (main1)
print ('\n'.join(pn2))
Urls:
http://www.eccolink.com/products/productresults.aspx?catId=2458
http://www.eccolink.com/products/productresults.aspx?catId=2464
http://www.eccolink.com/products/productresults.aspx?catId=2435
http://www.eccolink.com/products/productresults.aspx?catId=2446
http://www.eccolink.com/products/productresults.aspx?catId=2463
From what I see, you are opening a CSV file and using BeautifulSoup to parse it.
That should not be the way.
BeautifulSoup parses html files, not CSV.
Looking at your code, it seems correct if you were passing in html code to Bs4.
from bs4 import BeautifulSoup
import requests
links = []
file = open('links.txt')
html = requests.get('http://www.example.com')
soup = BeautifulSoup(html, 'html.parser')
for x in soup.find_all('a',"class":"abc"):
links.append(x)
file.write(x)
file.close()
Above is a very basic implementation of how I could get a target element in the html code and write it to a file/ or append it to a list. Use Requests rather than urllib. It is a better library and more modern.
If you want to input your data as CSV, my best option is to use csv reader as import.
Hope that helps.
New to Python, have a simple, situational question:
Trying to use BeautifulSoup to parse a series of pages.
from bs4 import BeautifulSoup
import urllib.request
BeautifulSoup(urllib.request.urlopen('http://bit.ly/'))
Traceback ...
html.parser.HTMLParseError: expected name token at '<!=KN\x01...
Working on Windows 7 64-bit with Python 3.2.
Do I need Mechanize? (which would entail Python 2.X)
If that URL is correct, you're asking why an HTML parser throws an error parsing an MP3 file. I believe the answer to this to be self-evident...
If you were trying to download that MP3, you could do something like this:
import urllib2
BLOCK_SIZE = 16 * 1024
req = urllib2.urlopen("http://bit.ly/xg7enD")
#Make sure to write as a binary file
fp = open("someMP3.mp3", 'wb')
try:
while True:
data = req.read(BLOCK_SIZE)
if not data: break
fp.write(data)
finally:
fp.close()
if you want to download a file in python you can use this as well
import urllib
urllib.urlretrieve("http://bit.ly/xg7enD","myfile.mp3")
and it will save your file in the current working directory with "myfile.mp3" name.
i am able to download all types of files through it.
hope it may help !
instead of urllib.request i suggest use requests, and from this lib use get()
from requests import get
from bs4 import BeautifulSoup
soup = BeautifulSoup(
get(url="http://www.google.com").content,
'html.parser'
)