I am working on lxml for fetching the html page.
I want to fetch the html table which have the class name as 'class1'.
I have done something like this :
for span in doc.xpath('//table[#class="class1"]'):
print span
But,
after this I found that there are 4 tables in HTML page which have class name as 'class1'.
for example :
table A
table B
table C
table D
these all 4 tables have the same class name.
how I can fetch only table B?
You can just get the second item of list:
result = doc.xpath('//table[#class="class1"]')
if len(result) > 1:
print result[1]
Or if your table has id, you can get it via xpath:
print doc.xpath('//table[#id="you id"]')[0]
I think what you might want here is...
doc.xpath('//table[#class="class1"]')[1]
Related
Need help in request making on SQL or SQLAlchemy
First table named as Rows
sid
unit_sid
ROW_UUID1
UNIT_UUID1
ROW_UUID2
UNIT_UUID1
ROW_UUID3
UNIT_UUID
Second table with name Records
row_sid (==SID from ROWS)
item_sid
content (str)
ROW_UUID1
ITEM_UUID1
Decription 1
ROW_UUID1
ITEM_UUID2
Decription 1
ROW_UUID2
ITEM_UUID1
Description 3
ROW_UUID2
ITEM_UUID2
Description 2
ROW_UUID3
ITEM_UUID1
Description 5
ROW_UUID3
ITEM_UUID2
Description 1
I need an example of a SQL query, where I can specify a search for several content values for different item_sid
For example I need all ROWS where
item_sid == ITEM_UUID1 and content == Description 1
item_sid == ITEM_UUID2 and content == Description 1
Request like bellow will not work for me, because I need search in two item_sid in same time for receiving unique ROWS
select row_sid
from rows
left join record on rows.sid = record.row_sid
where (item_sid = '877aeeb4-c68e-4942-b259-288e7aa3c04b' and
content like '%TEXT%')
and (item_sid = 'cc22f239-db6c-4041-92c6-8705cb621525' and
content like '%TEXT2%') GROUP BY row_sid
Solved like
select row_sid
from rows
left join record on rows.sid = record.row_sid
where (item_sid = '877aeeb4-c68e-4942-b259-288e7aa3c04b' and
content like '%TEXT%')
or (item_sid = 'cc22f239-db6c-4041-92c6-8705cb621525' and
content like '%TEXT2%') GROUP BY row_sid having count(row_sid) = 2
But maybe there are more beautiful solution? I want to request different number of item_sids (2-5) in the same time
I am new using python and I am trying to get some values from a table in a webpage, I need to get the values in yellow from the web page:
I have this code, it is getting all the values in the "Instruments" column but I don't know how to get the specific values:
body = soup.find_all("tr")
for Rows in body:
RowValue = Rows.find_all('th')
if len(RowValue) > 0:
CellValue = RowValue[0]
ThisWeekValues.append(CellValue.text)
any suggestion?
ids = driver.find_elements_by_xpath('//*[#id]')
if 'Your element id` in ids:
Do something
One of the ways could be this, since only id is different.
I am embedding links in one column of a Pandas dataframe (table, below) and writing the dataframe to hmtl.
Links in the dataframe table are formatted as shown (indexing first link in table):
In: table.loc[0,'Links']
Out: u'I6'
If I view (rather than index a specific row) the dataframe (in notebook), the link text is truncated:
<a href="http://xxx.xx.xxx.xxx/browser/I6.html...
I write the dataframe to html:
table_1=table.to_html(classes='table',index=False,escape=False)
But, the truncated link (rather than the full text) is written to the html table:
<td> <a href="http://xxx.xx.xxx.xxx/browser/I6.html...</td>\n
I probably need an additional parameter for to_html().
Look at the documentation now, but advice appreciated:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_html.html
Thanks!
So there is probably a pandas-specific explanation, but you could also work around the problem by (a) replacing the links with a key value, (b) writing the html table string, and then (c) replacing the keys with the appropriate links.
For example, replace each link with a key, storing the keys in a dict:
map = {}
for i in df.index:
counter = 0
if df.ix[i]['Links'] in map:
df.ix[i, 'Links'] = map[df.ix[i]['Links']]
else:
map[df.ix[i, 'Links']] = 'href' + str(counter)
counter += 1
df.ix[i, 'Links'] = map[df.ix[i]['Links']]
Write the table:
table_1 = df.to_html(classes='table',index=False,escape=False)
Re-write the links:
for key, value in map.iteritems():
table_1 = table_1.replace(value, key)
I am trying to scrape form field IDs using Beautiful Soup like this
for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
if link.has_key('id'):
print link['id']
Lets us assume that it returns something like
username
email
password
passwordagain
terms
button_register
I would like to write this into Sqlite3 DB.
What I will be doing down the line in my application is... Use these form fields' IDs and try to do a POST may be. The problem is.. there are plenty of sites like this whose form field IDs I have scraped. So the relation is like this...
Domain1 - First list of Form Fields for this Domain1
Domain2 - Second list of Form Fields for this Domain2
.. and so on
What I am unsure here is... How should I design my column for this kind of purpose? Will it be OK if I just create a table with two columns - say
COL 1 - Domain URL (as TEXT)
COL 2 - List of Form Field IDs (as TEXT)
One thing to be remembered is... Down the line in my application I will need to do something like this...
Pseudocode
If Domain is "http://somedomain.com":
For ever item in the COL2 (which is a list of form field ids):
Assign some set of values to each of the form fields & then make a POST request
Can any one guide, please?
EDITed on 22/07/2011 - Is My Below Database Design Correct?
I have decided to have a solution like this. What do you guys think?
I will be having three tables like below
Table 1
Key Column (Auto Generated Integer) - Primary Key
Domain as TEXT
Sample Data would be something like:
1 http://url1.com
2 http://url2.com
3 http://url3.com
Table 2
Domain (Here I will be using the Key Number from Table 1)
RegLink - This will have the registeration link (as TEXT)
Form Fields (as Text)
Sample Data would be something like:
1 http://url1.com/register field1
1 http://url1.com/register field2
1 http://url1.com/register field3
2 http://url2.com/register field1
2 http://url2.com/register field2
2 http://url2.com/register field3
3 http://url3.com/register field1
3 http://url3.com/register field2
3 http://url3.com/register field3
Table 3
Domain (Here I will be using the Key Number from Table 1)
Status (as TEXT)
User (as TEXT)
Pass (as TEXT)
Sample Data would be something like:
1 Pass user1 pass1
2 Fail user2 pass2
3 Pass user3 pass3
Do you think this table design is good? Or are there any improvements that can be made?
There is a normalization problem in your table.
Using 2 tables with
TABLE domains
int id primary key
text name
TABLE field_ids
int id primary key
int domain_id foreign key ref domains
text value
is a better solution.
Proper database design would suggest you have a table of URLs, and a table of fields, each referenced to a URL record. But depending on what you want to do with them, you could pack lists into a single column. See the docs for how to go about that.
Is sqlite a requirement? It might not be the best way to store the data. E.g. if you need random-access lookups by URL, the shelve module might be a better bet. If you just need to record them and iterate over the sites, it might be simpler to store as CSV.
Try this to get the ids:
ids = (link['id'] for link in
BeautifulSoup(content, parseOnlyThese=SoupStrainer('input'))
if link.has_key('id'))
And this should show you how to save them, load them, and do something to each. This uses a single table and just inserts one row for each field for each domain. It's the simplest solution, and perfectly adequate for a relatively small number of rows of data.
from itertools import izip, repeat
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table domains
(domain text, linkid text)''')
domain_to_insert = 'domain_name'
ids = ['id1', 'id2']
c.executemany("""insert into domains
values (?, ?)""", izip(repeat(domain_to_insert), ids))
conn.commit()
domain_to_select = 'domain_name'
c.execute("""select * from domains where domain=?""", (domain_to_select,))
# this is just an example
def some_function_of_row(row):
return row[1] + ' value'
fields = dict((row[1], some_function_of_row(row)) for row in c)
print fields
c.close()
I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into Open Office and it says that it is Table #11.
It seems like BeautifulSoup is the preferred choice, but can anyone tell me how to grab a particular table and all the rows? I have looked at the module documentation, but can't get my head around it. Many of the examples that I have found online appear to do more than I need.
This should be pretty straight forward if you have a chunk of HTML to parse with BeautifulSoup. The general idea is to navigate to your table using the findChildren method, then you can get the text value inside the cell with the string property.
>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = """
... <html>
... <body>
... <table>
... <th><td>column 1</td><td>column 2</td></th>
... <tr><td>value 1</td><td>value 2</td></tr>
... </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
... cells = row.findChildren('td')
... for cell in cells:
... value = cell.string
... print("The value in this cell is %s" % value)
...
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>>
If you ever have nested tables (as on the old-school designed websites), the above approach might fail.
As a solution, you might want to extract non-nested tables first:
html = '''<table>
<tr>
<td>Top level table cell</td>
<td>
<table>
<tr><td>Nested table cell</td></tr>
<tr><td>...another nested cell</td></tr>
</table>
</td>
</tr>
</table>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]
Alternatively, if you want to extract content of all the tables, including those that nest other tables, you can extract only top-level tr and th/td headers. For this, you need to turn off recursion when calling the find_all method:
soup = BeautifulSoup(html, 'lxml')
tables = soup.find_all('table')
cnt = 0
for my_table in tables:
cnt += 1
print ('=============== TABLE {} ==============='.format(cnt))
rows = my_table.find_all('tr', recursive=False) # <-- HERE
for row in rows:
cells = row.find_all(['th', 'td'], recursive=False) # <-- HERE
for cell in cells:
# DO SOMETHING
if cell.string: print (cell.string)
Output:
=============== TABLE 1 ===============
Top level table cell
=============== TABLE 2 ===============
Nested table cell
...another nested cell
The recursive is a great trick if you don't have nested tables, but if you do, then you need to do things one level at a time.
The one HTML variation that could bite you is the following where the tbody and or thead elements are also used.
html = '
<table class="fancy">
<thead>
<tr><th>Nested table cell</th></tr>
</thead>
<tbody>
<tr><td><table id=2>...another nested cell</table></td></tr>
</tbody>
</table>
</table>
in this situation, you will need to do the following
table = soup.find_all("table", {"class": "fancy"})[0]
thead = table.find_all('thead', recursive=False)
header = thead[0].findChildren('th')
tbody = table.find_all('tbody', recursive=False)
rows = tbody[0].find_all('tr', recursive=False)
now you have the head and rows