.content in Python - python

I am new to Python development and Python requests.
I have this code:
import requests
from pattern import web
import re
import pandas as pd
def list_of_prices(url):
html = requests.get(url).text
dom = web.DOM(html)
list = []
for person in dom('.freelancer-list-item .medium.price-tag'):
currency = person('sup')
amount = person('span')
list.append([currency[0].content if currency else 'na',
amount[0].content if amount else 'na'])
return list
list_of_prices('http://www.peopleperhour.com/freelance/data+analyst#page=2')
When I run this code i get an error like module pattern not found, but that's not what I'm asking for help with.
Where does .content come from? Is it used only with Python requests?

You need to install the module pattern:
pip install pattern
requests has a content property and so does pattern.
If you don't have pip installed, download the zip here, run run the setup.py file in the directory with python setup.py install.
Note Pattern is written for Python 2.5+ (no support for Python 3 yet).

Related

How can I list the extra features of a Python package

Some Python packages have extra features that can be installed by putting them into brackets such as the security extra for the requests package:
pip install requests[security]
Is there a way to list all the extras of a given package ?
I cannot find anything like that in the pip documentation.
There are two open feature requests in pip about this:
#3797 - pip show doesn't handle extras_requires
#4824 - Add support for outputting a list of extras and their requirements.
In the meantime, a workaround using importlib_metadata's API and working for already installed packages has been provided by jaraco.
Copy-pasting it below:
An even better alternative would be to use importlib_metadata, which has an API.
>>> import importlib_metadata
>>> importlib_metadata.metadata('xonsh').get_all('Provides-Extra')
['linux', 'mac', 'proctitle', 'ptk', 'pygments', 'win']
>>> importlib_metadata.metadata('xonsh').get_all('Requires-Dist')
["distro; extra == 'linux'", "gnureadline; extra == 'mac'", "setproctitle; extra == 'proctitle'", "prompt-toolkit; extra == 'ptk'", "pygments (>=2.2); extra == 'pygments'", "win-unicode-console; extra == 'win'"]
And use packaging to parse them:
>>> req = next(map(packaging.requirements.Requirement, importlib_metadata('xonsh').get_all('Requires-Dist')))
>>> req.name
'distro'
>>> req.specifier
<SpecifierSet('')>
>>> req.extras
set()
>>> req.marker
<Marker('extra == "linux"')>

Memory error when using androguard module in Yara Rules

I tried installing Yara 3.8.1 with androguard module. During the installation, I faced this issue, so I applied the patch given by #reox to the androguard.c file and it solved the problem. After that I tried a simple Yara rule with import "androguard" using command-line and it worked perfectly. Then I tried to use Yara rules inside my python app so I installed yara-python and used it in this way:
import yara
dex_path = './classes.dex'
my_rule = './rule.yar'
json_data = load_json_data()
rule = yara.compile(my_rule)
matches = rule.match(filepath=dex_path, modules_data={'androguard': json_data})
print(matches)
The match function works good when using Yara rules without import "androguard" module but when I want to apply a rule which imports androguard, the match function gives an error :
yara.Error: could not map file "./classes.dex" into memory
I'm applying a simple rule to an small file, in order of KB. I think that the problem is with the androguard module since when I remove the import "androguard", it works correctly. Any idea?
I had the same mistake with androguard, I solve the problem installing yara-python in the version 3.8.0
https://github.com/VirusTotal/yara-python/releases/tag/v3.8.0

Python lxml and xslt issue

I have some problem with lxml and python.
I have this code:
import lxml.etree as ET
xml_dom = ET.parse(xml_path)
xslt_dom = ET.parse(xslt_path)
print('transforming...')
transform = ET.XSLT(xslt_dom)
print('transformed: ', transform)
parsed_xml = transform(xml_dom)
print('all good!')
On my local environment, all works good (python 3.6.5 on a virtualenv with lxml 3.6.0).
The problem is, i have this code on a Centos 7 server, with the exact same specs (Python 3.6.5 and lxml 3.6.0), if i execute it from command line, all is good, when i put this code inside a Django (2.0) project, it "freeze" on this part:
transform = ET.XSLT(xslt_dom)
No exceptions, no errors, nothing. The print below that line never executes.
I changed permissions of the files, to apache group, set read permissions, and nothig works.
The weird thing is, from console works nice, from "apache + Django", don't.
Any suggestion?
Thanks.

`document.lastModified` in Python

In python, by using an HTML parser, is it possible to get the document.lastModified property of a web page. I'm trying to retrieve the date at which the webpage/document was last modified by the owner.
A somewhat related question "I am downloading a file using Python urllib2. How do I check how large the file size is?", suggests that the following (untested) code should work:
import urllib2
req = urllib2.urlopen("http://example.com/file.zip")
total_size = int(req.info().getheader('last-modified'))
You might want to add a default value as the second parameter to getheader(), in case it isn't set.
You can also look for a last-modified date in the HTML code, most notably in the meta-tags. The htmldate module does just that.
Here is how it could work:
1. Install the package:
pip/pip3/pipenv (your choice) -U htmldate
2. Retrieve a web page, parse it and output the date:
from htmldate import find_date
find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
(disclaimer: I'm the author)

urllib2 download HTML file

Using urllib2 in Python 2.7.4, I can readily download an Excel file:
output_file = 'excel.xls'
url = 'http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
This results in the expected file that I can use as I wish.
However, trying to download just an HTML file gives me an empty file:
output_file = 'webpage.html'
url = 'http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
I had the same results using urllib. There must be something simple I'm missing or don't understand. How do I download an HTML file from a URL? Why doesn't my code work?
If you want to download files or simply save a webpage you can use urlretrieve(from urllib library)instead of use read and write.
import urllib
urllib.urlretrieve("http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html","doc.html")
#urllib.urlretrieve("url","save as..")
If you need to set a timeout you have to put it at the start of your file:
import socket
socket.setdefaulttimeout(25)
#seconds
It also Python 2.7.4 in my OS X 10.9, and the codes work well on it.
So I think there maybe other problems prevent its working. Can you open "http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls" in your browser?
This may not directly answer the question, but if you're working with HTTP and have sufficient privileges to install python packages, I'd really recommend doing this with 'requests'. There's a related answered here - https://stackoverflow.com/a/13137873/45698

Categories