I am trying to parse emails from web page.
my code:
import urllib2,cookielib
import re
site= "http://www.traidnt.net/vb/traidnt207743"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
page = urllib2.urlopen(req)
content = page.read()
links = re.findall('mailto:.+?#.+.', content)
for link in links:
print link[7:-1]
and the result come like:
email1#
email2#
email3#
...
but i need to get all emails with complete form.
Please how i can do that to get complete form of all emails.
Thank you!
I just added the following code to your code and it works perfectly:
regexp = re.compile(("mailto:([a-z0-9!#$%&'*+\/=?^_`{|}~-]+#[a-z0-9]+\.[a-zA-Z0-9-.]+)"))
links = re.findall(regexp, content)
print links
Output:
['njm-kwt#hotmail.com', 'fnan-ksa#hotmail.com', 'k-w-t7#hotmail.com', 'coool-uae#hotmail.com', 'qsd#hotmail.de', 'o1ooo#hotmail.de', 'm-p-3#hotmail.de', 'ya7oo#hotmail.de', 'g5x#hotmail.de', 'f7t#hotmail.de', 'm2y#hotmail.de', 's2udi#hotmail.de', 'q2tar#hotmail.de', 'kuw2it#hotmail.de', 's2udi#hotmail.fr', 'qxx#hotmail.de', 'y-e-s#hotmail.de', 'y-a#hotmail.de', 'qqj#hotmail.de', 'qjj#hotmail.de', 'admin_vb#hotmail.de', 'eng-vb#hotmail.com', 'a3lantk#hotmail.com', 'a3lnkm#hotmail.com', 't7t#hotmail.de', 'mohamed_fathy41#hotmail.com', 'ox-9#hotmail.com', 'ox-9#hotmail.com']
You shold use special library like that
https://pypi.python.org/pypi/urlinfo
and contribute and create issue to make Python better ;)
Related
I'm trying to write a script to scrape some data off a Zillow page (https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/). Obviously I'm just trying to gather data from every listing. However I cannot grab the data from every listing as it only finds 9 instances of the class I'm searching for ('list-card-addr') even though I've checked the html from listings it does not find and the class exists. Anyone have any ideas for why this is? Here's my simple code
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
url="https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/"
response = requests.get(url, headers=req_headers)
data = response.text
soup = BeautifulSoup(data,'html.parser')
address = soup.find_all(class_='list-card-addr')
print(len(address))
Data is stored within a comment. You can regex it out easily as string defining JavaScript object you can handle with json
import requests, re, json
r = requests.get('https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22savedSearchEnrollmentId%22%3A%22X1-SSc26hnbsm6l2u1000000000_8e08t%22%2C%22mapBounds%22%3A%7B%22west%22%3A-77.65840518457031%2C%22east%22%3A-77.11870181542969%2C%22south%22%3A38.13250414385234%2C%22north%22%3A38.444339281260426%7D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22mostrecentchange%22%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
print(data['cat1'])
print(data['cat1']['searchList'].keys())
Within this are details on pagination and the next url, if applicable, to get all results. You have only asked for page 1 here.
For example, print addresses
for i in data['cat1']['searchResults']['listResults']:
print(i['address'])
I am trying to create a simple weather script. The idea is to look in this website and extract the current temperature and the icon to its left.
I know how to extract the html code using request and extract the temperature. But I have no clue on how to get the icon. I am not sure on how to get the icon/image path (not sure even if is possible).
Here my code
import requests
import urllib
from bs4 import BeautifulSoup
url = "https://www.meteored.com.ar/tiempo-en_La+Plata-America+Sur-Argentina-Provincia+de+Buenos+Aires-SADL-1-16930.html"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
a=urllib.request.Request(url,headers = hdr)
response = urllib.request.urlopen(a).read()
soup = BeautifulSoup(response,"lxml")
data = soup.find_all(class_ = "dato-temperatura changeUnitT")
temperature = data[0].text
My requirement is to check multiple filehashes's reputation on Virustotal using python. I do not want to use Virustotal's Public API since there is a cap of 4 requests/min. I thought of using requests module and beautiful soup to get this done.
Please check the link below:
https://www.virustotal.com/gui/file/f8ee4c00a3a53206d8d37abe5ed9f4bfc210a188cd5b819d3e1f77b34504061e/summary
I need to capture 54/69 for this file. I have a list of filehashes in an excel which I can loop for detection status once I can get it done for this one hash.
But I am not able to get the specific count of engines detected the filehash as malicious. The CSS selector for the count is giving me only a blank list. Please help. Please check the code I have written below:
import requests
from bs4 import BeautifulSoup
filehash='F8EE4C00A3A53206D8D37ABE5ED9F4BFC210A188CD5B819D3E1F77B34504061E'
filehash_lower = filehash.lower()
URL = 'https://www.virustotal.com/gui/file/' +filehash+'/detection'
response = requests.get(URL)
print(response)
soup = BeautifulSoup(response.content,'html.parser')
detection_details = soup.select('div.detections')
print(detection_details)
Here is an approach using the ajax calls :
import requests
import json
headers = {
'pragma': 'no-cache',
'x-app-hostname': 'https://www.virustotal.com/gui/',
'dnt': '1',
'x-app-version': '20190611t171116',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,la;q=0.6,mt;q=0.5',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'accept': 'application/json',
'cache-control': 'no-cache',
'authority': 'www.virustotal.com',
'referer': 'https://www.virustotal.com/',
}
response = requests.get('https://www.virustotal.com/ui/files/f8ee4c00a3a53206d8d37abe5ed9f4bfc210a188cd5b819d3e1f77b34504061e', headers=headers)
data = json.loads(response.content)
malicious = data['data']['attributes']['last_analysis_stats']['malicious']
undetected = data['data']['attributes']['last_analysis_stats']['undetected']
print(malicious, 'malicious out of', malicious + undetected)
output:
54 malicious out of 69
As an exercise, I am trying to scrape data from a dynamic graph using Python. The graph can be found at this link (let's say I want the data from the first one).
Now, I was thinking of doing something like:
src = 'https://marketchameleon.com/Overview/WFT/IV/#_ABSTRACT_RENDERER_ID_11'
import json
import urllib.request
with urllib.request.urlopen(src) as url:
data = url.read()
reply = json.loads(data)
However, I receive an error message on the last line of the code, saying:
JSONDecodeError: Expecting value
"data" is not empty, so I believe there is a problem with the format of the information within it. Does someone have an idea to solve this issue? Thanks!
I opened that link and see that the site loads data from another URL - https://marketchameleon.com/charts/histStockChartData?p=747&m=12&_=1534060722519
You can use json.loads() function twice and do some hacks with headers (urllib2.Request is your friend in case of Python 2) since server returns HTTP 500 when you don't imitate browser
src = 'https://marketchameleon.com/charts/histStockChartData?p=747&m=12'
import json
import urllib.request
user_agent = {
'Host': 'marketchameleon.com',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,kk;q=0.6'
}
request = urllib.request.Request(src, headers=user_agent)
data = urllib.request.urlopen(request).read()
print(data)
reply = json.loads(data)
table = json.loads(reply['GTable'])
print(table)
I'm trying to use urllib + bs4 to scrape naturebox's website.
I want to create a voting webapp so my office can vote for new snacks from naturebox each month.
Ideally I'd like to pull the snacks from naturebox's website as they change frequently.
Here's my code:
import urllib2,cookielib
site = 'http://www.naturebox.com/browse/'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
but the output is just javascript and none of the data for the actual snacks (picture, name, description, etc.)
Obviously I'm a noob - how can I get at this data?
You can use the Selenium Webdriver. This will allow you to load the site, including the javascript/ajax, and access the html from there. Also, try using PhantomJS so that it can run invisibly in the background.