Scraping with pandas, beautifulsoup, urllib: How to retrieve possibly embedded fields

Scraping with pandas, beautifulsoup, urllib: How to retrieve possibly embedded fields - python

I know there are questions like this, I was trying to follow them. I am trying to scrape the info in this page. Ideally I would like as much of the info as possible into a clean/easy to read tsv, but the essential parts to scrape are: ID, Name, Organism, Family, Classification, UniProt ID, Modifications, Sequence and PDB structure IDs (e.g. in this case, there is a list of PDB structures, the first is 1BAS and the last is 4OEG).
I wrote this in python3:
import urllib.request
import sys
import pandas as pd
import bs4
out = open('pdb.parsed.txt', 'a')
for i in range(1000,1005):
# try:
url = 'http://isyslab.info/StraPep/show_detail.php?id=BP' + str(i)
page = urllib.request.urlopen(url)
soup = pd.read_html(page)
print(soup)
I have attached my output here:
I have two questions:
You can see that some of the info that I require is missing (e.g. the sequence has NaN).
More importantly, I cannot see any field that correlates to the list of PDB IDs?
I was hoping to use pd.read_html if possible because in the past I have struggled with urllib/bs4, and I have found that I have been more successful with pd.read_html in recent scraping attempts. Can anyone explain how I could pull out the fields that I need?

I believe you were unable to scrape entries from certain rows such as the 'Sequence' row because these rows were populated by Javascript. The approach that worked for me was to use a combination of Selenium with a Firefox driver to grab the page's html code, and then use Beautiful Soup to parse that code.
Here's how I was able to scrape the pertinent info for the ID, Name, Organism, Family, Classification, UniProt ID, Modifications, Sequence and PDB structure IDs, for each page:
import urllib.request
import sys
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
pages = []
for page in range(1000,1005):
# try:
info_dict = {}
url = 'http://isyslab.info/StraPep/show_detail.php?id=BP' + str(page)
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
bs = BeautifulSoup(html, 'html.parser')
main_table = bs.find('table', attrs={'class': 'main_table'})
rows = main_table.findAll('tr')
for row in rows:
try: # We only want rows from a page where both row title and text are not null
row_header = row.find('th').text
row_text = row.find('td').text
except:
pass
else:
if row_header and row_text:
if row_header in ['ID', 'Name', 'Organism', 'Family', 'Classification', 'UniProt ID']:
info_dict[row_header] = row_text
elif row_header == 'Modification':
try: # Some pages have a null table entry for 'Modification'
mod_text = row.find('table').find('td').text
except:
pass
else:
if mod_text:
info_dict[row_header] = mod_text
else:
info_dict[row_header] = 'NA'
# Pass 'Sequence' and 'Structure' as space separated strings
elif row_header == 'Sequence':
seqs = ''
for i in row_text.split():
seqs += ' ' + i
info_dict[row_header] = seqs[1:]
elif row_header == 'Structure':
pdb_ids = ''
a = row.find('tbody').find_all('a')
for i in a:
if i.text != '[x]': pdb_ids += ' ' + i.text
info_dict[row_header] = pdb_ids[1:]
pages.append(info_dict)
keys = pages[0].keys()
with open('pdb.parsed.txt', 'a') as output_file:
writer = csv.DictWriter(output_file, keys, delimiter='\t')
writer.writeheader()
writer.writerows(pages) # Add a tab-delimited row for each page we scraped
I can then read in the .tsv file I just created as a dataframe if I want:
df = pd.read_csv('pdb.parsed.txt', delimiter='\t')
It looks like this:
Although the contents of columns containing longer strings (such as 'Sequence') are abbreviated, we can verify that the entire sequence is indeed present:
df.iloc[0]['Sequence']
'PALPEDGGSG AFPPGHFKDP KRLYCKNGGF FLRIHPDGRV DGVREKSDPH IKLQLQAEER GVVSIKGVCA NRYLAMKEDG RLLASKCVTD ECFFFERLES NNYNTYRSRK YTSWYVALKR TGQYKLGSKT GPGQKAILFL PMSAKS'
The contents of the saved tsv file look like this:
ID Name Organism Family Classification UniProt ID Modification Sequence Structure
BP1000 Fibroblast growth factor 2 Homo sapiens heparin-binding growth factors family Cytokine/Growth factor FGF2_HUMAN Phosphotyrosine; by TEC PALPEDGGSG AFPPGHFKDP KRLYCKNGGF FLRIHPDGRV DGVREKSDPH IKLQLQAEER GVVSIKGVCA NRYLAMKEDG RLLASKCVTD ECFFFERLES NNYNTYRSRK YTSWYVALKR TGQYKLGSKT GPGQKAILFL PMSAKS 1BAS 1BFB 1BFC 1BFF 1BFG 1BLA 1BLD 1CVS 1EV2 1FGA 1FQ9 1II4 1IIL 2BFH 2FGF 2M49 4FGF 4OEE 4OEF 4OEG
BP1001 Interleukin-2 Homo sapiens IL-2 family Cytokine/Growth factor IL2_HUMAN APTSSSTKKT QLQLEHLLLD LQMILNGINN YKNPKLTRML TFKFYMPKKA TELKHLQCLE EELKPLEEVL NLAQSKNFHL RPRDLISNIN VIVLELKGSE TTFMCEYADE TATIVEFLNR WITFCQSIIS TLT 1IRL 1M47 1M48 1M49 1M4A 1M4B 1M4C 1NBP 1PW6 1PY2 1QVN 1Z92 2B5I 2ERJ 3INK 3QAZ 3QB1 4NEJ 4NEM
BP1002 Insulin Bos taurus insulin family Hormone INS_BOVIN GIVEQCCASV CSLYQLENYC N 1APH 1BPH 1CPH 1DPH 1PID 2A3G 2BN1 2BN3 2INS 2ZP6 3W14 4BS3 4E7T 4E7U 4E7V 4I5Y 4I5Z 4IDW 4IHN 4M4F 4M4H 4M4I 4M4J 4M4L 4M4M
BP1003 Interleukin-1 beta Homo sapiens IL-1 family Cytokine/Growth factor IL1B_HUMAN APVRSLNCTL RDSQQKSLVM SGPYELKALH LQGQDMEQQV VFSMSFVQGE ESNDKIPVAL GLKEKNLYLS CVLKDDKPTL QLESVDPKNY PKKKMEKRFV FNKIEINNKL EFESAQFPNW YISTSQAENM PVFLGGTKGG QDITDFTMQF VSS 1HIB 1I1B 1IOB 1ITB 1L2H 1S0L 1T4Q 1TOO 1TP0 1TWE 1TWM 21BI 2I1B 2KH2 2NVH 31BI 3LTQ 3O4O 3POK 41BI 4DEP 4G6J 4G6M 4GAF 4GAI 4I1B 5BVP 5I1B 6I1B 7I1B 9ILB
BP1004 Lactoferricin-H Homo sapiens transferrin family Antimicrobial TRFL_HUMAN GRRRSVQWCA VSQPEATKCF QWQRNMRKVR GPPVSCIKRD SPIQCIQA 1Z6V 1XV4 1XV7 1Z6W 2GMC 2GMD
I used the following to Anaconda commands to install Selenium, and then the Firefox driver:
conda install -c conda-forge selenium
conda install -c conda-forge geckodriver

Related

Trouble with webscraping, how to NA when no results?

I have several URLs which link to Hotel pages and I would like to scrape some data from it.
I'm using the following this script, but I would like to update it:
data=[]
for i in range(0,10):
url = final_list[i]
driver2 = webdriver.Chrome()
driver2.get(url)
sleep(randint(10,20))
soup = BeautifulSoup(driver2.page_source, 'html.parser')
my_table2 = soup.find_all(class_=['title-2', 'rating-score body-3'])
review=soup.find_all(class_='reviews')[-1]
try:
price=soup.find_all('span', attrs={'class':'price'})[-1]
except:
price=soup.find_all('span', attrs={'class':'price'})
for tag in my_table2:
data.append(tag.text.strip())
for p in price:
data.append(p)
for r in review:
data.append(r)
But here's the problem, tag.text.strip() scrape rating numbers like here :
It will strip the number rating into alone value but some hotels don't have the same amout of ratings. Here's a hotel with 7 ratings, the default number is 8. Some have seven ratings, other six, and so on. So in the end, my dataframe is quite screwed. If the hotel doesn't have 8 ratings, the value will be shifted.
My question is : How to tell the script "if there is a value in this tag.text.strip(i) so put the value but if there isn't put None. And of course made that for the eight value.
I tried several things like :
for tag in my_table2:
for i in tag.text.strip()[i]:
if i:
data.append(i)
else:
data.append(None)
But unfortunately, that goes nowhere, so if you could help to figure out the answer, it would be awesome :)
If that could help you, I put link on Hotel that I'm scraping :
https://www.hostelworld.com/pwa/hosteldetails.php/Itaca-Hostel/Barcelona/1279?from=2020-11-21&to=2020-11-22&guests=1
The number ratings are at the end
Thank you.

A few suggestions:
Put your data in a dictionary. You don't have to assume that all tags are present and the order of the tags doesn't matter. You can get the labels and the corresponding ratings with
rating_labels = soup.find_all(class_=['rating-label body-3'])
rating_scores = soup.find_all(class_=['rating-score body-3'])
and then iterate over both lists with zip
move your driver outside of the loop, opening it once is enough
don't use wait but you use Selenium's wait functions. You can wait for a particular element to be present or populated with WebDriverWait(driver, 10).until(EC.presence_of_element_located(your_element)
https://selenium-python.readthedocs.io/waits.html
Cache your scraped HTML code to a file. It's faster for you and politer to the website you are scraping
import selenium
import selenium.webdriver
import time
import random
import os
from bs4 import BeautifulSoup
data = []
final_list = [
'https://www.hostelworld.com/pwa/hosteldetails.php/Itaca-Hostel/Barcelona/1279?from=2020-11-21&to=2020-11-22&guests=1',
'https://www.hostelworld.com/pwa/hosteldetails.php/Be-Ramblas-Hostel/Barcelona/435?from=2020-11-27&to=2020-11-28&guests=1'
]
# load your driver only once to save time
driver = selenium.webdriver.Chrome()
for url in final_list:
data.append({})
# cache the HTML code to the filesystem
# generate a filename from the URL where all non-alphanumeric characters (e.g. :/) are replaced with underscores _
filename = ''.join([s if s.isalnum() else '_' for s in url])
if not os.path.isfile(filename):
driver.get(url)
# better use selenium's wait functions here
time.sleep(random.randint(10, 20))
source = driver.page_source
with open(filename, 'w', encoding='utf-8') as f:
f.write(source)
else:
with open(filename, 'r', encoding='utf-8') as f:
source = f.read()
soup = BeautifulSoup(source, 'html.parser')
review = soup.find_all(class_='reviews')[-1]
try:
price = soup.find_all('span', attrs={'class':'price'})[-1]
except:
price = soup.find_all('span', attrs={'class':'price'})
data[-1]['name'] = soup.find_all(class_=['title-2'])[0].text.strip()
rating_labels = soup.find_all(class_=['rating-label body-3'])
rating_scores = soup.find_all(class_=['rating-score body-3'])
assert len(rating_labels) == len(rating_scores)
for label, score in zip(rating_labels, rating_scores):
data[-1][label.text.strip()] = score.text.strip()
data[-1]['price'] = price.text.strip()
data[-1]['review'] = review.text.strip()
The data can then be easily put in a nicely formatted table using Pandas
import pandas as pd
df = pd.DataFrame(data)
df
If some data is missing/incomplete, Pandas will replace it with 'NaN'
data.append(data[0].copy())
del(data[-1]['Staff'])
data[-1]['name'] = 'Incomplete Hostel'
pd.DataFrame(data)

Scraping Wikipedia information (table)

I would need to scrape information regarding Elenco dei comuni per regione on Wikipedia. I would like to create an array that can allow me to associate each comune to the corresponding region, i.e. something like this:
'Abbateggio': 'Pescara' -> Abruzzo
I tried to get information using BeautifulSoup and requests as follows:
from bs4 import BeautifulSoup as bs
import requests
with requests.Session() as s: # use session object for efficiency of tcp re-use
s.headers = {'User-Agent': 'Mozilla/5.0'}
r = s.get('https://it.wikipedia.org/wiki/Comuni_d%27Italia')
soup = bs(r.text, 'html.parser')
for ele in soup.find_all('h3')[:6]:
tx = bs(str(ele),'html.parser').find('span', attrs={'class': "mw-headline"})
if tx is not None:
print(tx['id'])
however it does not work (it returns me an empty list).
The information that I have looked at using Inspect of Google Chrome are the following:
<span class="mw-headline" id="Elenco_dei_comuni_per_regione">Elenco dei comuni per regione</span> (table)
Comuni dell'Abruzzo
(this field should change for each region)
then <table class="wikitable sortable query-tablesortes">
Could you please give me advice on how to get such results?
Any help and suggestion will be appreciated.
EDIT:
Example:
I have a word: comunediabbateggio. This word includes Abbateggio. I would like to know which region can be associated with that city, if it exists.
Information from Wikipedia needs to create a dataset that can allow me to check the field and associate to comuni/cities a region.
What I should expect is:
WORD REGION/STATE
comunediabbateggio Pescara
I hope this can help you. Sorry if it was not clear.
Another example for English speaker that might be slightly better for understanding is the following:
Instead of the Italian link above, you can also consider the following: https://en.wikipedia.org/wiki/List_of_comuni_of_Italy . For each region (Lombardia, Veneto, Sicily, ... ) I would need to collect information about the list of communes of the Provinces.
if you click in a link of List of Communes of ... , there is a table that list the comune, e.g. https://en.wikipedia.org/wiki/List_of_communes_of_the_Province_of_Agrigento.

import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
target = "https://en.wikipedia.org/wiki/List_of_comuni_of_Italy"
def main(url):
with requests.Session() as req:
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
provinces = [item.find_next("span").text for item in soup.findAll(
"span", class_="tocnumber", text=re.compile(r"\d[.]\d"))]
search = [item.replace(
" ", "_") if " " in item else item for item in provinces]
nested = []
for item in search:
for a in soup.findAll("span", id=item):
goes = [b.text.split("of ")[-1]
for b in a.find_next("ul").findAll("a")]
nested.append(goes)
dictionary = dict(zip(provinces, nested))
urls = [f'{url[:24]}{b.get("href")}' for item in search for a in soup.findAll(
"span", id=item) for b in a.find_next("ul").findAll("a")]
return urls, dictionary
def parser():
links, dics = main(target)
com = []
for link in tqdm(links):
try:
df = pd.read_html(link)[0]
com.append(df[df.columns[1]].to_list()[:-1])
except ValueError:
com.append(["N/A"])
com = iter(com)
for x in dics:
b = dics[x]
dics[x] = dict(zip(b, com))
print(dics)
parser()

How to scrape embedded links and tabular information

I'm trying to scrape information about the datasets available on this website.
I want to collect the URLs to the resources and at least the title of the dataset.
Using this resource as an example, I want to capture the URL embedded in "Go to resource" and the title listed in the table:
I have created a basic scraper, but it doesn't seem work:
import requests
import csv
from bs4 import BeautifulSoup
site = requests.get('https://data.nsw.gov.au/data/dataset');
data_list=[]
if site.status_code is 200:
content = BeautifulSoup(site.content, 'html.parser')
internals = content.select('.resource-url-analytics')
for url in internals:
title = internals.select=('.resource-url-analytics')[0].get_text()
link = internals.select=('.resource-url-analytics')[0].get('href')
new_data = {"title": title, "link": link}
data_list.append(new_data)
with open ('selector.csv','w') as file:
writer = csv.DictWriter(file, fieldnames = ["dataset", "link"], delimiter = ';')
writer.writeheader()
for row in data_list:
writer.writerow(row)
I would like to write the output to a CSV with columns for the URLs and the titles.
This is an example of the desired output
Greatly appreciative for any assistance

Have a look at the API for the datasets that will likely be the easiest way to do this.
In the meantime, here is how you can get the API links at id level from those pages and store the entire package info for all packages in one list, data_sets, and just the info of interest in another variable (results). Be sure to review the API documentation in case there is a better method - for example, it would be nice if ids could be submitted in batches rather than per id.
Answer below is taking advantage of the endpoint detailed in the documentation which is used to get a full JSON representation of a dataset, resource or other object
Taking the current first result on landing page of:
Vegetation of the Guyra 1:25000 map sheet VIS_ID 240.
We want the last child a of parent h3 with a parent having class .dataset-item. In the below, the spaces between selectors are descendant combinators.
.dataset-item h3 a:last-child
You can shorten this to h3 a:last-child for a small efficiency gain.
This relationship reliably selects all relevant links on page.
Continuing with this example, visiting that retrieved url for first listed item, we can find the id using api endpoint (which retrieves json related to this package), via an attribute=value selector with contains, *, operator. We know this particular api endpoint has a common string so we substring match on the href attribute value:
[href*="/api/3/action/package_show?id="]
The domain can vary and some retrieved links are relative so we have to test if relative and add the appropriate domain.
First page html for that match:
Notes:
data_sets is a list containing all the package data for each package and is extensive. I did this in case you are interest in looking at what is in those packages (besides reviewing the API documentation)
You can get total number of pages from soup object on a page via
num_pages = int(soup.select('[href^="/data/dataset?page="]')[-2].text)
You can alter the loop for less pages.
Session object is used for efficiency of re-using connection. I'm sure there are other improvements to be made. In particular I would look for any method which reduced the number of requests (why I mentioned looking for a batch id endpoint for example).
There can be none to more than one resource url within a returned package. See example here. You can edit code to handle this.
Python:
from bs4 import BeautifulSoup as bs
import requests
import csv
from urllib.parse import urlparse
json_api_links = []
data_sets = []
def get_links(s, url, css_selector):
r = s.get(url)
soup = bs(r.content, 'lxml')
base = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(url))
links = [base + item['href'] if item['href'][0] == '/' else item['href'] for item in soup.select(css_selector)]
return links
results = []
#debug = []
with requests.Session() as s:
for page in range(1,2): #you decide how many pages to loop
links = get_links(s, 'https://data.nsw.gov.au/data/dataset?page={}'.format(page), '.dataset-item h3 a:last-child')
for link in links:
data = get_links(s, link, '[href*="/api/3/action/package_show?id="]')
json_api_links.append(data)
#debug.append((link, data))
resources = list(set([item.replace('opendata','') for sublist in json_api_links for item in sublist])) #can just leave as set
for link in resources:
try:
r = s.get(link).json() #entire package info
data_sets.append(r)
title = r['result']['title'] #certain items
if 'resources' in r['result']:
urls = ' , '.join([item['url'] for item in r['result']['resources']])
else:
urls = 'N/A'
except:
title = 'N/A'
urls = 'N/A'
results.append((title, urls))
with open('data.csv','w', newline='') as f:
w = csv.writer(f)
w.writerow(['Title','Resource Url'])
for row in results:
w.writerow(row)
All pages
(very long running so consider threading/asyncio):
from bs4 import BeautifulSoup as bs
import requests
import csv
from urllib.parse import urlparse
json_api_links = []
data_sets = []
def get_links(s, url, css_selector):
r = s.get(url)
soup = bs(r.content, 'lxml')
base = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(url))
links = [base + item['href'] if item['href'][0] == '/' else item['href'] for item in soup.select(css_selector)]
return links
results = []
#debug = []
with requests.Session() as s:
r = s.get('https://data.nsw.gov.au/data/dataset')
soup = bs(r.content, 'lxml')
num_pages = int(soup.select('[href^="/data/dataset?page="]')[-2].text)
links = [item['href'] for item in soup.select('.dataset-item h3 a:last-child')]
for link in links:
data = get_links(s, link, '[href*="/api/3/action/package_show?id="]')
json_api_links.append(data)
#debug.append((link, data))
if num_pages > 1:
for page in range(1, num_pages + 1): #you decide how many pages to loop
links = get_links(s, 'https://data.nsw.gov.au/data/dataset?page={}'.format(page), '.dataset-item h3 a:last-child')
for link in links:
data = get_links(s, link, '[href*="/api/3/action/package_show?id="]')
json_api_links.append(data)
#debug.append((link, data))
resources = list(set([item.replace('opendata','') for sublist in json_api_links for item in sublist])) #can just leave as set
for link in resources:
try:
r = s.get(link).json() #entire package info
data_sets.append(r)
title = r['result']['title'] #certain items
if 'resources' in r['result']:
urls = ' , '.join([item['url'] for item in r['result']['resources']])
else:
urls = 'N/A'
except:
title = 'N/A'
urls = 'N/A'
results.append((title, urls))
with open('data.csv','w', newline='') as f:
w = csv.writer(f)
w.writerow(['Title','Resource Url'])
for row in results:
w.writerow(row)

For simplicity use selenium package:
from selenium import webdriver
import os
# initialise browser
browser = webdriver.Chrome(os.getcwd() + '/chromedriver')
browser.get('https://data.nsw.gov.au/data/dataset')
# find all elements by xpath
get_elements = browser.find_elements_by_xpath('//*[#id="content"]/div/div/section/div/ul/li/div/h3/a[2]')
# collect data
data = []
for item in get_elements:
data.append((item.text, item.get_attribute('href')))
Output:
('Vegetation of the Guyra 1:25000 map sheet VIS_ID 240', 'https://datasets.seed.nsw.gov.au/dataset/vegetation-of-the-guyra-1-25000-map-sheet-vis_id-2401ee52')
('State Vegetation Type Map: Riverina Region Version v1.2 - VIS_ID 4469', 'https://datasets.seed.nsw.gov.au/dataset/riverina-regional-native-vegetation-map-version-v1-0-vis_id-4449')
('Temperate Highland Peat Swamps on Sandstone (THPSS) spatial distribution maps...', 'https://datasets.seed.nsw.gov.au/dataset/temperate-highland-peat-swamps-on-sandstone-thpss-vegetation-maps-vis-ids-4480-to-4485')
('Environmental Planning Instrument - Flood', 'https://www.planningportal.nsw.gov.au/opendata/dataset/epi-flood')
and so on

Exporting to CSV File

I'm trying to export the results of this code to a CSV file. I copied 2 of the results further down below after the code. There are 14 items for each stock and I'd like to write to a CSV file and have a column for each of the 14 items and one row for each stock.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
This is the format of the results, 14 items/columns for each stock.
PTN
Palatin Technologies, Inc.
Healthcare
Diagnostic Substances
USA
240.46M
9.22
193.43M
2.23M
0.76
1.19
7.21%
1,703,285
3
LKM
Link Motion Inc.
Technology
Application Software
China
128.95M
-
50.40M
616.76K
1.73
1.30
16.07%
1,068,798
4
Tried this but couldn't get this to work.
TextWriter x = File.OpenWrite ("my.csv", ....);
x.WriteLine("Column1,Column2"); // header
x.WriteLine(coups.Cells[0].Text + "," + coups.Cells[1].Text);

This should work:
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)
Few things here:
I used "table-[light|dark]-row-cp" becouse all rows of intrest had one of those classes (and no other rows had them)
There are two sepearate parts: one is fetching data in correct structure, other - writing CSV file.
I used pandas CSV writer, because I'm familiar with it, but when you have rectangular data (named "data" here) you may use any other CSV writer
You should never name variables with reserved names, like 'sub' or 'link' : )
Hope that helps.

Why don't you use the built-in csv.writer?

Web table scraping: how do I find the column number of a cell in excel using python

I have an excel file with many Chinese names in the first row like this:
enter image description here
And what I am doing is to scrape some more Chinese names from a web table and the names are all at the 2nd col in each row (tr). I want to see if the names being scraped is already in my excel file. So I use a boolean have to keep track. It should return True if found. And I want to know the exact position (column number) of the found name, so I use name_position to keep track.
from lxml import html
from bs4 import BeautifulSoup
import requests
import openpyxl
from openpyxl.workbook import Workbook
wb=openpyxl.load_workbook('hehe.xlsx')
ws1=wb.get_sheet_by_name('Taocan')
page = requests.get(url)
tree = html.fromstring(page.text)
web = page.text
soup = BeautifulSoup(web, 'lxml')
table = soup.find('table', {'class': "tc_table"})
trs = table.find_all('tr')
for tr in trs:
ls = []
for td in tr.find_all('td'):
ls.append(td.text)
ls = [x.encode('utf-8') for x in ls]
try:
name = ls[1]
have = False
name_position = 1
for cell in ws1[1]:
if name == cell:
have = True
break
else:
name_position += 1
except IndexError:
print("there is an index error")
However, my code doesn't seem to work, and I think the problem is from the comparison of the names:
if name == cell
I changed to:
if name == cell.value
it still doesn't work.
Can anyone help me with this? thanks/:
Just to add on: the web page Im scraping is also in Chinese. So when I
print(ls)
it gives a list like this
['1', '\xe4\xb8\x80\xe8\x88\xac\xe6\xa3\x80\xe6\x9f\xa5', '\xe8\xba\xab\xe9\xab\x98\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe6\x8c\x87\xe6\x95\xb0\xe3\x80\x81\xe8\x85\xb0\xe5\x9b\xb4\xe3\x80\x81\xe8\x88\x92\xe5\xbc\xa0\xe5\x8e\x8b\xe3\x80\x81\xe6\x94\xb6\xe7\xbc\xa9\xe5\x8e\x8b\xe3\x80\x81\xe8\xa1\x80\xe5\x8e\x8b\xe6\x8c\x87\xe6\x95\xb0', '\xe9\x80\x9a\xe8\xbf\x87\xe4\xbb\xaa\xe5\x99\xa8\xe6\xb5\x8b\xe9\x87\x8f\xe4\xba\xba\xe4\xbd\x93\xe8\xba\xab\xe9\xab\x98\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe3\x80\x81\xe4\xbd\x93\xe8\x84\x82\xe8\x82\xaa\xe7\x8e\x87\xe5\x8f\x8a\xe8\xa1\x80\xe5\x8e\x8b\xef\xbc\x8c\xe7\xa7\x91\xe5\xad\xa6\xe5\x88\xa4\xe6\x96\xad\xe4\xbd\x93\xe9\x87\x8d\xe6\x98\xaf\xe5\x90\xa6\xe6\xa0\x87\xe5\x87\x86\xe3\x80\x81\xe8\xa1\x80\xe5\x8e\x8b\xe6\x98\xaf\xe5\x90\xa6\xe6\xad\xa3\xe5\xb8\xb8\xe3\x80\x81\xe4\xbd\x93\xe8\x84\x82\xe8\x82\xaa\xe6\x98\xaf\xe5\x90\xa6\xe8\xb6\x85\xe6\xa0\x87\xe3\x80\x82']
but if I
print(ls[1])
it gives Chinese name like "广州"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping with pandas, beautifulsoup, urllib: How to retrieve possibly embedded fields - python

Related

Trouble with webscraping, how to NA when no results?

Scraping Wikipedia information (table)

How to scrape embedded links and tabular information

Exporting to CSV File

Web table scraping: how do I find the column number of a cell in excel using python

Categories

Resources