I have a code like this :
import requests
user_agent_url = 'http://www.user-agents.org/allagents.xml'
xml_data = requests.get(user_agent_url).content
Which will parse a online xml file into xml_data. How can I parse it from a local disk file? I tried replacing with path to local disk,but got an error:
raise InvalidSchema("No connection adapters were found for '%s'" % url)
InvalidSchema: No connection adapters were found
What has to be done?
Note that the code you quote does NOT parse the file - it simply puts the XML data into xml_data. The equivalent for a local file doesn't need to use requests at all: simply write
with open("/path/to/XML/file") as f:
xml_data = f.read()
If you are determined to use requests then see this answer for how to write a file URL adapter.
You can read the file content using open method and then use elementtree module XML function to parse it.
It returns an etree object which you can loop through.
Example
Content = open("file.xml").read()
From xml.etree import XML
Etree = XML(Content)
Print Etree.text, Etree.value, Etree.getchildren()
Related
I want to download text files using python, how can I do so?
I used requests module's urlopen(url).read() but it gives me the bytes representation of file.
For me, I had to do the following (Python 3):
from urllib.request import urlopen
data = urlopen("[your url goes here]").read().decode('utf-8')
# Do what you need to do with the data.
You can use multiple options:
For the simpler solution you can use this
file_url = 'https://someurl.com/text_file.txt'
for line in urllib.request.urlopen(file_url):
print(line.decode('utf-8'))
For an API solution
file_url = 'https://someurl.com/text_file.txt'
response = requests.get(file_url)
if (response.status_code):
data = response.text
for line in enumerate(data.split('\n')):
print(line)
When downloading text files with python I like to use the wget module
import wget
remote_url = 'https://www.google.com/test.txt'
local_file = 'local_copy.txt'
wget.download(remote_url, local_file)
If that doesn't work try using urllib
from urllib import request
remote_url = 'https://www.google.com/test.txt'
file = 'copy.txt'
request.urlretrieve(remote_url, file)
When you are using the request module you are reading the file directly from the internet and it is causing you to see the text in byte format. Try to write the text to a file then view it manually by opening it on your desktop
import requests
remote_url = 'test.com/test.txt'
local_file = 'local_file.txt'
data = requests.get(remote_url)
with open(local_file, 'wb')as file:
file.write(data.content)
I've been reviewing examples of how to read in HTML from websites using XPass and lxml. For some reason when I try with a local file I keep running into this error.
AttributeError: 'str' object has no attribute 'content'
This is the code
with open(r'H:\Python\Project\File','r') as f:
file = f.read()
f.close()
tree = html.fromstring(file.content)
You have a few problems with your code. It looks like you are modifying code that is parsing html from an http/https request. In that case using .content() extracts the bytes from the response object.
However, when reading from a file, you are already reading in the contents of the file in your with context. Also, you don't need to use .close(), the context manager takes care of that for you.
Try this:
with open(r'H:\Python\Project\File','r') as f:
tree = html.fromstring(f.read())
Try encoding='utf-8'
f1 = open(new_file + '.html', 'r', encoding="utf-8")
With Python 3, I want to read an XML web page and save it in my local drive.
Also, if the file already exist, it must overwrite it.
I tested some script like :
import urllib.request
xml = urllib.request.urlopen('URL')
data = xml.read()
file = open("file.xml","wb")
file.writelines(data)
file.close()
But I have an error :
TypeError: a bytes-like object is required, not 'int'
First suggestion: do what even the official urllib docs says and don't use urllib, use requests instead.
Your problem is that you use .writelines() and it expects a list of lines, not a bytes objects (for once in Python the error message is not very helpful). Use .write() instead
import requests
resp = requests.get('URL')
with open('file.xml', 'wb') as foutput:
foutput.write(resp.content)
I found a solution :
from urllib.request import urlopen
xml = open("import.xml", "r+")
xml.write(urlopen('URL').read().decode('utf-8'))
xml.close()
Thanks for your help.
I need to read the remote file content using python but here I am facing some challenges. My code is below:
import subprocess
path = 'http://securityxploded.com/remote-file-inclusion.php'
subprocess.Popen(["rsync", host-ip+path],stdout=subprocess.PIPE)
for line in ssh.stdout:
line
Here I am getting the error NameError: name 'host' is not defined. I could not know what should be the host-ip value because I am running my Python file using terminal(python sub.py). Here I need to read the content of the http://securityxploded.com/remote-file-inclusion.php remote file.
You need the urllib library. Also you are using parameters which you don't use.
Try something like this:
import urllib.request
fp = urllib.request.urlopen("http://www.stackoverflow.com")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
Note: this is for python 3
For python 2.7 use this:
import urllib
fp = urllib.urlopen("http://www.stackoverflow.com")
myfile = fp.read()
print myfile
if you want to read remote content via http.
requests or urllib2 are both good choice.
for Python2, use requests.
import requests
resp = requests.get('http://example.com/')
print resp.text
will work.
This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.