Use beautifulsoup to scrape a table within a webpage? - python

I am scraping a county website that posts emergency calls and their locations. I have found success webscraping basic elements, but am having trouble scraping the rows of the table.
(Here is an example of what I am working with codewise)
location = list.find('div', class_='listing-search-item__sub-title')
Im not sure how to specifically webscrape the rows of the table. Can anyone explain how to dig into the sublevels of html to look for these records ? I'm not sure if I need to dig into tr, table, tbody, td, etc. Could use some guidance on which division or class to assign to dig into the data.

For extracting specific nested elements, I often prefer to use .select, which uses css selectors (bs4 doesn't seem to have any support for xpath but you can also check out these solutions using the lxml library), so for your case you could use something like
soup.select_one('table[id="form1:tableEx1"]').select('tbody tr')
but the results might look a bit weird since the columns might not be separated - to have separated columns/cells, you could get the of rows as tuples instead with
tableRows = [
tuple([c.text.strip() for c in r.find_all(['th', 'td'])]) for r
in BeautifulSoup(tHtml).select_one(
'table[id="form1:tableEx1"]'
).select('tbody tr')
]
(Note that you can't use the .select(#id) format when the id contains a ":".)
As one of the comments mentioned, you can use pandas.read_html(htmlString) to get a list of tables in the html; if you want a specific table, use the attrs argument:
# import pandas
pandas.read_html(htmlString, attrs={'id': 'form1:tableEx1'})[0]
but you will get the whole table - not just what's in tbody; and this will flatten any tables that are nested inside (see results with table used from this example).
And the single-statement method I showed at first with select cannot be used at all with nested tables since the output will be scrambled. Instead, if you want to preserve any nested inner tables without flattening, and if you are likely to be scraping tables often, I have the following set of functions which can be used in general:
first define two other function that the main table extractor depends on:
# get a list of tagNames between a tag and its ancestor
def linkAncestor(t, a=None):
aList = []
while t.parent != a or a is None:
t = t.parent
if t is None:
if a is not None: aList = None
break
aList.append(t.name)
return aList
# if a == t.parent: return []
# if a is None, return tagNames of ALL ancestors
# if a not in t.parents: return None
def getStrings_table(xSoup):
# not perfect, but enough for me so far
tableTags = ['table', 'tr', 'th', 'td']
return "\n".join([
c.get_text(' ', strip=True) for c in xSoup.children
if c.get_text(' ', strip=True) and (c.name is None or (
c.name not in tableTags and not c.find(tableTags)
))
])
then, you can define the function for extracting the tables as python dictionaries:
def tablesFromSoup(mSoup, mode='a', simpleOp=False):
typeDict = {'t': 'table', 'r': 'row', 'c': 'cell'}
finderDict = {'t': 'table', 'r': 'tr', 'c': ['th', 'td']}
refDict = {
'a': {'tables': 't', 'loose_rows': 'r', 'loose_cells': 'c'},
't': {'inner_tables': 't', 'rows': 'r', 'loose_cells': 'c'},
'r': {'inner_tables': 't', 'inner_rows': 'r', 'cells': 'c'},
'c': {'inner_tables': 't', 'inner_rows': 'r', 'inner_cells': 'c'}
}
mode = mode if mode in refDict else 'a'
# for when simpleOp = True
nextModes = {'a': 't', 't': 'r', 'r': 'c', 'c': 'a'}
mainCont = {
'a': 'tables', 't': 'rows', 'r': 'cells', 'c': 'inner_tables'
}
innerContent = {}
for k in refDict[mode]:
if simpleOp and k != mainCont[mode]:
continue
fdKey = refDict[mode][k] # also the mode for recursive call
innerSoups = [(
s, linkAncestor(s, mSoup)
) for s in mSoup.find_all(finderDict[fdKey])]
innerSoups = [s for s, la in innerSoups if not (
'table' in la or 'tr' in la or 'td' in la or 'th' in la
)]
# recursive call
kCont = [tablesFromSoup(s, fdKey, simpleOp) for s in innerSoups]
if simpleOp:
if kCont == [] and mode == 'c': break
return tuple(kCont) if mode == 'r' else kCont
# if not empty, check if header then add to output
if kCont:
if 'row' in k:
for i in range(len(kCont)):
if 'isHeader' in kCont[i]: continue
kCont[i]['isHeader'] = 'thead' in innerSoups[i][1]
if 'cell' in k:
isH = [(c[0].name == 'th' or 'thead' in c[1]) for c in innerSoups]
if sum(isH) > 0:
if mode == 'r':
innerContent['isHeader'] = True
else:
innerContent[f'isHeader_{k}'] = isH
innerContent[k] = kCont
if innerContent == {} and mode == 'c':
innerContent = mSoup.get_text(' ', strip=True)
elif mode in typeDict:
if innerContent == {}:
innerContent['innerText'] = mSoup.get_text(' ', strip=True)
else:
innerStrings = getStrings_table(mSoup)
if innerStrings:
innerContent['stringContent'] = innerStrings
innerContent['type'] = typeDict[mode]
return innerContent
With the same example as before, this function gives this output; if the simpleOp argument is set to True, it results in a simpler output, but then the headers are no longer differentiated and some other peripheral data is also excluded.

Related

python set dict not exist, how can I handle it?

import SimpleITK as sitk
reader = sitk.ImageFileReader()
reader.SetFileName(filePath)
reader.ReadImageInformation()
img = reader.Execute()
meta = {
"a": reader.GetMetaData('0'), <- if not exist return 'undeinfed'
"b": reader.GetMetaData('1'),
"c": reader.GetMetaData('2'),
}
I am javascript developer.
I want to set meta dict and it shows error which is 'Key '0' does not exist'.
It can be not exist how can I set meta in this case?
From the docs, the ImageFileReader class has a HasMetaDataKey() boolean function. So you should be able to do something like this:
meta = {
"a": reader.GetMetaData('0') if reader.HasMetaDataKey('0') else 'undefined',
"b": reader.GetMetaData('1') if reader.HasMetaDataKey('1') else 'undefined',
"c": reader.GetMetaData('2') if reader.HasMetaDataKey('2') else 'undefined',
}
And you could do in one (long) line:
meta = {m: reader.GetMetaData(k) if reader.HasMetaDataKey(k) else 'undefined'
for m, k in zip(['a', 'b', 'c'], ['0', '1', '2'])}
you can use default dict
from collections import defaultdict
d = defaultdict(lambda : 'xx') #<- Whatever value you want
d[10] #no value passed value automatically assinged to xx
d[11]=12 #value 12 assinged
#to get value you can use d.get(key)
print(d[10]) #prints 'xx'
print(d)
outputs
defaultdict(<function <lambda> at 0x000001557B4B03A8>, {10: 'xx', 11: 12})
you get the idea you can modify according to your need

Prompting user to enter column names from a csv file (not using pandas framework)

I am trying to get the column names from a csv file with nearly 4000 rows. There are about 14 columns.
I am trying to get each column and store it into a list and then prompt the user to enter themselves at least 5 columns they want to look at.
The user should then be able to type how many results they want to see (they should be the smallest results from that column).
For example, if they choose clothing_brand, "8", the 8 least expensive brands are displayed.
So far, I have been able to use "with" and get a list that contains each column, but I am having trouble prompting the user to pick at least 5 of those columns.
You can very well use the Python input to get the input from user, if you want to prompt no. of times, use the for loop to get inputs. Check Below code:
def get_user_val(no_of_entries = 5):
print('Enter {} inputs'.format(str(no_of_entries)))
val_list = []
for i in range(no_of_entries):
val_list.append(input('Enter Input {}:'.format(str(i+1))))
return val_list
get_user_val()
I hope I didn't misunderstand what you mean, the code below is what you want?
You can put the data into the dict then sorted it.
Solution1
from io import StringIO
from collections import defaultdict
import csv
import random
import pprint
def random_price():
return random.randint(1, 10000)
def create_test_data(n_row=4000, n_col=14, sep=','):
columns = [chr(65+i) for i in range(n_col)] # A, B ...
title = sep.join(columns)
result_list = [title]
for cur_row in range(n_row):
result_list.append(sep.join([str(random_price()) for _ in range(n_col)]))
return '\n'.join(result_list)
def main():
if 'load CSV':
test_content = create_test_data(n_row=10, n_col=5)
dict_brand = defaultdict(list)
with StringIO(test_content) as f:
rows = csv.reader(f, delimiter=',')
for idx, row in enumerate(rows):
if idx == 0: # title
columns = row
continue
for i, value in enumerate(row):
dict_brand[columns[i]].append(int(value))
pprint.pprint(dict_brand, indent=4, compact=True, width=120)
user_choice = input('input columns (brand)')
number_of_results = 5 # input('...')
watch_columns = user_choice.split(' ') # D E F
for col_name in watch_columns:
cur_brand_list = dict_brand[col_name]
print(sorted(cur_brand_list, reverse=True)[:number_of_results])
# print(f'{col_name} : {sorted(cur_brand_list)}') # ASC
# print(f'{col_name} : {sorted(cur_brand_list, reverse=True)}') # DESC
if __name__ == '__main__':
main()
defaultdict(<class 'list'>,
{ 'A': [9424, 6352, 5854, 5870, 912, 9664, 7280, 8306, 9508, 8230],
'B': [1539, 1559, 4461, 8039, 8541, 4540, 9447, 512, 7480, 5289],
'C': [7701, 6686, 1687, 3134, 5723, 6637, 6073, 1925, 4207, 9640],
'D': [4313, 3812, 157, 6674, 8264, 2636, 765, 2514, 9833, 1810],
'E': [139, 4462, 8005, 8560, 5710, 225, 5288, 6961, 6602, 4609]})
input columns (brand)C D
[9640, 7701, 6686, 6637, 6073]
[9833, 8264, 6674, 4313, 3812]
Solution2: Using Pandas
def pandas_solution(test_content: str, watch_columns= ['C', 'D'], number_of_results=5):
with StringIO(test_content) as f:
df = pd.read_csv(StringIO(f.read()), usecols=watch_columns,
na_filter=False) # it can add performance (ignore na)
dict_result = defaultdict(list)
for col_name in watch_columns:
dict_result[col_name].extend(df[col_name].sort_values(ascending=False).head(number_of_results).to_list())
df = pd.DataFrame.from_dict(dict_result)
print(df)
C D
0 9640 9833
1 7701 8264
2 6686 6674
3 6637 4313
4 6073 3812

Parsing Erlang data to Python dictionary

I have an erlang script from which I would like to get some data and store it in python dictionary.
It is easy to parse the script to get string like this:
{userdata,
[{tags,
[#dt{number=111},
#mp{id='X23.W'}]},
{log,
'LG22'},
{instruction,
"String that can contain characters like -, _ or numbers"}
]
}.
desired result:
userdata = {"tags": {"dt": {"number": 111}, "mp": {"id": "X23.W"}},
"log": "LG22",
"instruction": "String that can contain characters like -, _ or numbers"}
# "#" mark for data in "tags" is not required in this structure.
# Also value for "tags" can be any iterable structure: tuple, list or dictionary.
But I am not sure how to transfer this data into a python dictionary. My first idea was to use json.loads but it requires many modifications (putting words into quotes marks, replacing "," with ":" and many more).
Moreover, keys in userdata are not limited to some pool. In this case, there are 'tags', 'log' and 'instruction', but there can be many more eg. 'slogan', 'ids', etc.
Also, I am not sure about the order. I assume that the keys can appear in random order.
My code (it is not working for id='X23.W' so I removed '.' from input):
import re
import json
in_ = """{userdata, [{tags, [#dt{number=111}, #mp{id='X23W'}]}, {log, 'LG22'}, {instruction, "String that can contain characters like -, _ or numbers"}]}"""
buff = in_.replace("{userdata, [", "")[:-2]
re_helper = re.compile(r"(#\w+)")
buff = re_helper.sub(r'\1:', buff)
partition = buff.partition("instruction")
section_to_replace = partition[0]
replacer = re.compile(r"(\w+)")
match = replacer.sub(r'"\1"', section_to_replace)
buff = ''.join([match, '"instruction"', partition[2]])
buff = buff.replace("#", "")
buff = buff.replace('",', '":')
buff = buff.replace("}, {", "}, \n{")
buff = buff.replace("=", ":")
buff = buff.replace("'", "")
temp = buff.split("\n")
userdata = {}
buff = temp[0][:-2]
buff = buff.replace("[", "{")
buff = buff.replace("]", "}")
userdata .update(json.loads(buff))
for i, v in enumerate(temp[1:]):
v = v.strip()
if v.endswith(","):
v = v[:-1]
userdata .update(json.loads(v))
print(userdata)
Output:
{'tags': {'dt': {'number': '111'}, 'mp': {'id': 'X23W'}}, 'instruction': 'String that can contain characters like -, _ or numbers', 'log': 'LG22'}
import json
import re
in_ = """{userdata, [{tags, [#dt{number=111}, #mp{id='X23.W'}]}, {log, 'LG22'}, {instruction, "String that can contain characters like -, _ or numbers"}]}"""
qouted_headers = re.sub(r"\{(\w+),", r'{"\1":', in_)
changed_hashed_list_to_dict = re.sub(r"\[(#.*?)\]", r'{\1}', qouted_headers)
hashed_variables = re.sub(r'#(\w+)', r'"\1":', changed_hashed_list_to_dict)
equality_signes_replaced_and_quoted = re.sub(r'{(\w+)=', r'{"\1":', hashed_variables)
replace_single_qoutes = equality_signes_replaced_and_quoted.replace('\'', '"')
result = json.loads(replace_single_qoutes)
print(result)
Produces:
{'userdata': [{'tags': {'dt': {'number': 111}, 'mp': {'id': 'X23.W'}}}, {'log': 'LG22'}, {'instruction': 'String that can contain characters like -, _ or numbers'}]}

BeautifulSoup fill missing information with "NA" in csv

I am working on a web scraper that creates a .csv file of all chemicals on the Sigma-Aldrich website. The .csv file would have the chemical name followed by variables such as product number, cas number, molecular weight and chemical formula. 1 chemical + info per row.
The issue I'm having is that not all chemicals have all their fields, many only have product and cas numbers. This results in my .csv file being offset and chemical rows having incorrect info associated with another chemical.
To right this wrong, I want to add 'N/A' if the field is empty.
Here is my scraping method:
def scraap(urlLi):
for url in urlLi:
content = requests.get(url).content
soup = BeautifulSoup(content, 'lxml')
containers = soup.find_all('div', {'class': 'productContainer-inner'})
for c in containers:
sub = c.find_all('div', {'class': 'productContainer-inner-content'})
names = c.find_all('div', {'class': 'searchResultSubstanceBlock clearfix'})
for n in names:
hope = n.find("h2").text
print(hope)
nombres.append(hope.encode('utf-8'))
for s in sub:
info = s.find_all('ul', {'class': 'nonSynonymProperties'})
proNum = s.find_all('div', {'class': 'product-listing-outer'})
for p in proNum:
ping = p.find_all('div', {'class': 'row clearfix'})
for po in ping:
pro = p.find_all('li', {'class': 'productNumberValue'})
pnPp = []
for pri in pro:
potus = pri.get_text()
pnPp.append(potus.encode('utf-8'))
ProductNumber.append(pnPp)
print(pnPp)
for i in info:
c = 1
for gling in i:
print(gling.get_text())
if c == 1:
formu.append(gling.get_text().encode('utf-8'))
elif c == 2:
molWei.append(gling.get_text().encode('utf-8'))
else:
casNum.append(gling.get_text().encode('utf-8'))
c += 1
c == 1
print("---")
here is my writing method:
def pipeUp():
with open('sigma_pipe_out.csv', mode='wb') as csv_file:
fieldnames = ['chem_name', 'productNum', 'formula', 'molWei', 'casNum']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
# writer.writeheader()
# csv_file.write(' '.join(fieldnames))
for n, p, f, w, c in zip(nombres, ProductNumber, formu, molWei, casNum):
# writer.writerow([n, p, f, w, c])
writer.writerow({'chem_name': n, 'productNum': p, 'formula': f, 'molWei': w, 'casNum': c})
The issue arises in the get i from info: section. The formu, molWei and casNum list are off.
How can I add "N/a" if formu and molWei are missing information?
I'm assuming get_text() returns an empty string if there's no information on the formula and molecular weight etc. In that case you can just add:
if not molWei:
molWei = "N/A"
Which updates molWei to be N/A if the string is empty.
you cannot use index as value checking (if c == 1:), use string check before adding to the list
replace:
for i in info:
....
....
print("---")
with:
rowNames = ['formu', 'molWei', 'casNum']
for li in info[0].find_all('li'):
textVal = li.text.encode('utf-8')
#print(textVal)
if b'Formula' in textVal:
formu.append(textVal)
rowNames.remove('formu')
elif b'Molecular' in textVal:
molWei.append(textVal)
rowNames.remove('molWei')
else:
casNum.append(textVal)
rowNames.remove('casNum')
# add missing row here
if len(rowNames) > 1:
for item in rowNames:
globals()[item].append('NA')
print("---")

Search of file is returning none even though value is in the file

got a section of code that should search through a file to see it search phase is contained and then return data assigned to it, however it always returns none even though it is in the file and i can not see why it would fail
r Goblin500 IspSUjBIQ/LJ0k18VbKIO6mS1oo gorgBf6uW8d6we7ARt8aA6kgiV4 2014-08-12 06:11:58 82.26.108.68 9001 9030
s Fast HSDir Running V2Dir Valid
v Tor 0.2.4.23
w Bandwidth=21
p reject 1-65535
is the line of code i want to read
this is how i am trying to find the value :
def getRouter(nm):
for r in router.itervalues():
if r['nick'] == nm:
return r
return None
.
print getRouter("Goblin500")
and this is how the contents of the file is Parse the consensus into a dict:
# Parse the consensus into a dict
for l in consensus_txt.splitlines():
q = l.strip().split(" ")
if q[0] == 'r': #router descriptor
rfmt = ['nick', 'identity', 'digest', 'pubdate', 'pubtime', 'ip', 'orport', 'dirport']
data = dict(zip(rfmt, q[1:]))
idt= data['identity']
idt += "=" * (4-len(idt)%4) # pad b64 string
ident = data['identity'] = base64.standard_b64decode(idt)
data['identityhash'] = binascii.hexlify(ident)
data['identityb32'] = base64.b32encode(ident).lower()
router[ident] = data
curRouter = ident
if q[0] == 's': #flags description - add to tally totals too
router[curRouter]['flags'] = q[1:]
for w in q[1:]:
if flags.has_key(w):
flags[w]+=1
else:
flags[w] = 1
total += 1
if q[0] == 'v':
router[curRouter]['version'] = ' '.join(q[1:])
what have i missed ?
thanks
You have an error in your original string parsing. This prevents you from matching the values later. Sample code that proves the parsing is wrong:
q = 'r Goblin500 IspSUjBIQ/LJ0k18VbKIO6mS1oo gorgBf6uW8d6we7ARt8aA6kgiV4 2014-08-12 06:11:58 82.26.108.68 9001 9030'
rfmt = ['nick', 'identity', 'digest', 'pubdate', 'pubtime', 'ip', 'orport', 'dirport']
data = dict(zip(rfmt, q[1:]))
print(data)
# {'pubdate': 'b', 'dirport': '5', 'ip': 'i', 'orport': 'n', 'nick': ' ', 'identity': 'G', 'digest': 'o', 'pubtime': 'l'}
print(data['nick'])
# prints out a single space
Basically, the q[1:] portion of the zip statement is only grabbing the first character of the string. I think what you want is q.split()[1:] instead. This will split the string on spaces, convert it to a list and then ignore the first element.

Categories