fetching information with scrapy(Python)

fetching information with scrapy(Python) - python

when I want to capture the following information:
<td>But<200g/M2</td>
name = fila.select('.//td[2]/text()').extract()
I capture the following
"But"
apparently there is a conflict with these characters "< /"

escape special characters with a '\', so :
But\<200g\/M2
note that creating a file with those characters wouldn't be so easy

Here is an approach that uses BeautifulSoup, in case you have more luck with a different library:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<html><head><title>StackOverflow-Question</title></head><body>
<table>
<tr>
<td>Ifs</td>
<td>Ands</td>
<td>But<200g/M2</td>
</tr>
</table>
</body></html>""")
print soup.find_all('td')[2].get_text()
The output of this is:
But<200g/M2
If you wanted to use XPath you could also use The ElementTree XML API. Here I'm using BeautifulSoup to take HTML and convert it to valid XML so I can run an XPath query against it:
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
html = """<html><head><title>StackOverflow-Question</title></head><body>
<table>
<tr>
<td>Ifs / Ands / Or</td>
<td>But<200g/M2</td>
</tr>
</table>
</body></html>"""
soup = BeautifulSoup(html)
root = ET.fromstring(soup.prettify())
print root.findall('.//td[2]')[0].text
The output of this is the same (note that the HTML is slightly different, this is because XPath arrays start at one while Python arrays start at 0).

Related

How to extract specific string on a web page using Python

Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv
As you can see, this is the page source of mbasic.facebook.com.
What I'm trying to do is scrape all the anchor tags that have a pattern like this:
Example
<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">
Example with wild card.
<a class="cf" href="*">
so I decided to add a wild card identifier after href="*" since the value are dynamic.
Here's my (not working) Python Code.
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))
Note that in the page, there are several patterns like this so I need to capture all and print it.
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=lADKURnNsk4AX8WTS1F&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=96f40cb2f95acbcfe9f6e4dc6cb31161&oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=Z2daQ-qGgpsAX8BmLKr&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=22f2b487166a7cd06e4ff650af4f7a7b&oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>
My goal is to print or findall the anchor tags and display it in terminal. Appreciate your help on this. Thank you!
Tried another set of code but no luck :)
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)

I think your wildcard match needs a dot in front like .*
I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.

You should use a parsing library, such as BeautifulSoup or requests-html. If you want to do it manually, then build on the second attempt you posted. The first won't get you what you want because you are compiling the entire page as a regular expression.
import re
s = """<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">"""
patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)
Output
>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">',
'<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">']

As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages. To import beautiful soup and other libraries use the following commands
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
Post this the below set of commands should solve your purpose
req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')
url in the above command is the link you want to scrape and headers argument takes by browser specs/version

parsing ajax response html/xml with lxml changes < > charcters to

I`m trying to parse in Python a webpage, a ajax response which basically looks like this
xml:
<table class="tab02">
<tr>
<th>Skrót</th>
<th>Pełna nazwa</th>
</tr>
<tr>
<td>1AT</td>
<td>ATAL SPÓŁKA AKCYJNA</td>
</tr>
</table>
Link: http://www.gpw.pl/ajaxindex.php?action=GPWCompanySearch&start=listForLetter&letter=A&listTemplateName=GPWCompanySearch%2FajaxList_PL
If I provide this code in python file as variable with use of simple code & lxml library (see below) I successfully parse everything, and whole result is well formated:
from lxml import etree
root = etree.fromstring(xml)
print etree.tounicode(root) # print etree.tostring(root)
Problem happens while parsing data from webpage (see example code below)
magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
root = etree.parse(link2page, magical_parser)
print etree.tounicode(root)
In result all characters < > from table are changed to < and >
<response>
<html>
<table class="tab02">
<tr>
<th>Skrót</th>
<th>Pełna nazwa</th>
</tr>
etc.
I`ve tried also with first treating link with urlib, with parsing it as html but i fail all the time. Can anyone provide me a hint please?

Can't get a regex pattern to work in Python

I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.
<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>
I can get the first value by using
match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)
But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match
match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
'<td width="65.+?value="(.+?)"></td>').findall(html_source_det)
Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.
What I am doing wrong?
The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.
I am obtaining the html_source like this:
new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()

Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:
from bs4 import BeautifulSoup
html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''
soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']
Or more simply:
print soup.find('input', attrs={'name': 'T1'})['value']

Printing certain HTML Python Mechanize

Im making a small python script for auto logon to a website. But i'm stuck.
I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:
<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>
But how do I extract and print just the name, John Appleseed?
I'm using Pythons' Mechanize on a mac, by the way.

Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)
Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:
html = ... # this is the html you've fetched
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element

As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.
But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?
You can obtain the xpath for an element using "inspect element" feature of firefox.
For ex, if you want to find the XPATH for username in stackoverflow site.
Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
Keep your mouse over tag or right click it and "Copy xpath".

You can use a parser to extract any information in a document. I suggest you to use lxml module.
Here you have an example:
from lxml import etree
from StringIO import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO("""<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>"""),parser)
>>> tree.xpath("string()").strip()
u'John Appleseed'
More information about lxml here

Problems with regex matching

I'm currently learning Python and I'm trying to make a small scraper but I'm running in to problems with Beautiful Soup and regex.
I am trying to match all links in a site that has the following html:
<td>
Place Number 1
</td>
<td width="100">
California </td>
<td>
Place Number 2
</td>
<td width="100">
Florida </td>
I want to get all the following links : "/lxxxx/Place+Number+x"
I am using python and beautifulsoup for this:
import BeautifulSoup
import urllib2
import re
address = 'http://www.example.com'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
for tag in soup.findAll('a', id = re.compile('l[0-9]*')):
print tag['href']
The regex part in the soup.findAll I found on some example code because I can't seem to get the example from the BeautifulSoup documentation to work.Without the regex part, I got all the links on the page, but I only want the "lxxx" ones
What am I doing wrong with my regex? Maybe there's a way to do this wthout regexes, but I can't seem to find a way.

Shouldn't you be trying to do the regex match on href and not id?
for tag in soup.findAll('a', href = re.compile('l[0-9]*')):
print tag['href']

I would suggest
for tag in soup.findAll('a', href = re.compile('^/l[0-9]+/.*$')):
print tag['href']
for avoiding tags looking like but not exactly what you are look like

Apart from check href not id
re.compile(r'^\/l[0-9]{4}/Place\+Number\+[0-9]+')
match seems to assume your regex starts with "^".
>>> m = re.compile(r"abc")
>>> m.match("eabc")
>>> m.match("abcd")
<_sre.SRE_Match object at 0x7f23192318b8>
So adding the \/ allows the first slash to be matched. Also I'm using {4} to match four numbers rather than * which will match zero or more numbers.
>>> m = re.compile(r'\/l[0-9]*')
>>> m.match("/longurl/somewhere")
<_sre.SRE_Match object at 0x7f2319231850>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

fetching information with scrapy(Python) - python

when I want to capture the following information: <td>But<200g/M2</td> name = fila.select('.//td[2]/text()').extract() I capture the following "But" apparently there is a conflict with these characters "< /"

escape special characters with a '\', so : But\<200g\/M2 note that creating a file with those characters wouldn't be so easy

Related

How to extract specific string on a web page using Python

parsing ajax response html/xml with lxml changes < > charcters to

Can't get a regex pattern to work in Python

Printing certain HTML Python Mechanize

Problems with regex matching

Categories

Resources