Extracting tables from web

Extracting tables from web - python

I need to extract all tables from this web:(only the second column)
https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表
Well, the last three tables I don't need it...
However, my code only extract the second column from the first table.
import pickle
import requests
def save_china_tickers():
resp = requests.get('https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'class':'wikitable'})
tickers=[]
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[1].text
tickers.append(ticker)
with open('chinatickers.pickle','wb') as f:
pickle.dump(tickers,f)
return tickers save_china_tickers()

I have an easy method.
Get HTTP Response
Find all tables using RegEx
Parse HTML Table to list of lists
Iterate the over each list in list
Requirements
dashtable
Code
from urllib.request import urlopen
from dashtable import html2data # to convert html table to list of list
import re
url = "https://zh.wikipedia.org/wiki/%E4%B8%8A%E6%B5%B7%E8%AF%81%E5%88%B8%E4%BA%A4%E6%98%93%E6%89%80%E4%B8%8A%E5%B8%82%E5%85%AC%E5%8F%B8%E5%88%97%E8%A1%A8"
# Reading http content
data = urlopen(url).read().decode()
# now fetching all tables with the help of regex
tables = ["<table>{}</table>".format(table) for table in re.findall(r"<table .*?>(.*?)</table>", data, re.M|re.S|re.I)]
# parsing data
parsed_tables = [html2data(table)[0] for table in tables] # html2data returns a tuple with 0th index as list of lists
# lets take first table ie 600000-600099
parsed = parsed_tables[0]
# column names of first table
print(parsed[0])
# rows of first table 2nd column
for index in range(1, len(parsed)):
print(parsed[index][1])
"""
Output: All the rows of table 1, column 2 excluding the headers
"""

Related

Separate merged text and move one of the separated text in a new column in python

I'm extracting table data from a this url(https://www.tradingview.com/markets/stocks-india/market-movers-large-cap/) after making the adjustments here and there I finally get it to extract the data successfully except for one problem, just below the header row from second row onwards to the last row only the first column of text (ticker symbol and the company name) get merged turns out they have their own tag one is inside tag which contains the ticker name and the other is inside which contains the company name both are inside only one of the multiple tags in their respective tags, so how do I split those 2 texts and move one of those text in a new column named "company name", this column doesn't exist in the above link so I would want to know how to create it and add the data in it.
here's the code
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Create an URL object
url = 'https://www.tradingview.com/markets/stocks-india/market-movers-large-cap/'# Create object page
page = requests.get(url)
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
# Obtain information from tag <table>
table1 = soup.find('table', class_='table-DR3mi0GH')
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
# Create a dataframe
mydata = pd.DataFrame(columns=headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
screenshot:
The words in the ticker column appear merged, btw I'm using the python console in pycharm

Get tables from Wiki that match specific text

I'm quite new to Python and BeautifulSoup, and I've been trying to work this out for several hours...
Firstly, I want to extract all table data from below link with "general election" in the title:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
I do have another dataframe with names of each table (eg. "1961 general election", "1965 general election"), but am hoping to get away with just searching for "general election" on each table to confirm if it's what I need.
I then want to get all the names that are in Bold (which indicates they won) and finally I want another list of "Count 1" (or sometimes 1st Pref) in the original order, which I want to compare to the "Bold" list. I haven't even looked at this piece yet, as I haven't gotten past the first hurdle.
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)
Any help on this would be greatly appreciated...

This page requires some gymnastics, but it can be done:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)
The output should be all the general election tables on the page.

Python3 Read Html Table With Pandas

Need some help here. Plan to extract all the statistical data of this site https://lotostats.ro/toate-rezultatele-win-for-life-10-20
My issue is that I am not able to read the table. I can't do this nor for the first page.
Can someone pls help?
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
url='https://lotostats.ro/toate-rezultatele-win-for-life-10-20'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
# if len(T)!=10:
# break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
print(df)

Data is dynamically added. You can find the source, returning json, in network tab
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=20&search%5Bvalue%5D=&search%5Bregex%5D=false&_=1564996040879').json()
You can decode that and likely (investigate that) remove timestamp part (or simply replace with random number)
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=20&search[value]=&search[regex]=false&_=1').json()
To see the lottery lines:
print(r['data'])
The draw parameter seems to be related to page of draws e.g. 2nd page:
https://lotostats.ro/all-rez/win_for_life_10_20?draw=2&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=20&length=20&search[value]=&search[regex]=false&_=1564996040880
You can alter the length to retrieve more results. For example, I can deliberately oversize it to get all results
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=100000&search[value]=&search[regex]=false&_=1').json()
print(len(r['data']))
Otherwise, you can set the length param to a set number, do an initial request, and calculate the number of pages from the total (r['recordsFiltered']) records count divided by results per page.
import math
total_results = r['recordsFiltered']
results_per_page = 20
num_pages = math.ceil(total_results/results_per_page)
Then do a loop to get all results (remembering to alter draw param). Obviously, the less requests the better.

Scrape a table from multiple pages and store in a single dataframe

Problem: a website has c.80 pages, each of which contains a single table that is identically structured. I need to scrape each table and store the results in a single pandas dataframe. The table content is regularly updated, and therefore the exercise will need to be frequently repeated.
I can scrape the table from a single page but am struggling to do it for multiple pages. All of the examples I have found are for URLs that change iteratively, e.g. (www.example.com/page1, /page2 etc), rather than for a specified list of URLs.
I have tried the following for a subset of the URLs (ideally, I would like to read in the URLs from a csv list), but it only seems to scrape the final table into the dataframe (i.e. ZZ).
Apologies if this seems dim, I’m fairly new to Python and have mainly been using pandas for data analysis, reading in directly from csv. Any help would be gratefully appreciated.
How can I read the URLs from a csv list ? my current solution does not scrape the whole table as I expect.
from bs4 import BeautifulSoup
import requests
import pandas as pd
COLUMNS = ['ID', 'Serial', 'Aircraft', 'Notes']
urls = ['http://www.ukserials.com/results.php?serial=ZR',
'http://www.ukserials.com/results.php?serial=ZT',
'http://www.ukserials.com/results.php?serial=ZZ']
#scrape elements
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table") # Find the "table" tag in the page
rows = table.find_all("tr") # Find all the "tr" tags in the table
cy_data = []
for row in rows:
cells = row.find_all("td") # Find all the "td" tags in each row
cells = cells[0:4] # Select the correct columns
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
data = pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0)

Can you not add each dataframe into a list, and then merge the elements of that list right at the end?
...
dataframes = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table") # Find the "table" tag in the page
rows = table.find_all("tr") # Find all the "tr" tags in the table
cy_data = []
for row in rows:
cells = row.find_all("td") # Find all the "td" tags in each row
cells = cells[0:4] # Select the correct columns
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
data = pd.concat(dataframes)
Note: You might need to specify index offsets for each dataframe (before merging), as seen here: https://pandas.pydata.org/pandas-docs/stable/merging.html

Scraping table data from a list of URLs (each containing a unique table) for the purposes of appending it all to a single list/dataframe?

I am scraping data from a list of hundreds of URLs, each one containing a table with statistical baseball data. Within each unique URL in the list, there is a table for all of the seasons of a single baseball player's career, like this:
https://www.baseball-reference.com/players/k/killeha01.shtml
I have successfully created a script to append the data from a single URL into a single list/dataframe. However, here is my question:
How should I adjust my code to scrape a full list of hundreds of URLs from this domain and then append all of the table rows from all of the URLs into a single list/dataframe?
My general format for scraping a single URL is as follows:
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
url_baseball_players = ['https://www.baseball-reference.com/players/k/killeha01.shtml']
def scrape_baseball_data(url_parameter):
html = urlopen(url_parameter)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
column_headers = [SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING COLUMN HEADERS]
table_rows = soup.select(SCRAPING COMMAND WITH CSS SELECTOR GADGET FOR GETTING ALL OF THE DATA FROM THE TABLES INCLUDING HTML CHARACTERS)
player_data = []
for row in table_rows:
player_list = [COMMANDS FOR SCRAPING HTML DATA FROM THE TABLES INTO AN ORGANIZED LIST]
if not player_list:
continue
player_data.append(player_list)
return player_data
list_baseball_player_data = scrape_baseball_data(url_baseball_players)
df_baseball_player_data = pd.DataFrame(list_baseball_player_data)

If url_baseball_players is a list of all the URLs you want to scrape, and your expected output is one data frame (where you append, row-wise, each new URL's data), then just keep adding with concat() as you iterate over URLs:
df = pd.DataFrame()
for url in url_baseball_players:
df = pd.concat([df, pd.DataFrame(scrape_baseball_data(url))])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting tables from web - python

Related

Separate merged text and move one of the separated text in a new column in python

Get tables from Wiki that match specific text

Python3 Read Html Table With Pandas

Scrape a table from multiple pages and store in a single dataframe

Scraping table data from a list of URLs (each containing a unique table) for the purposes of appending it all to a single list/dataframe?

Categories

Resources