Scraping table data after applying filter - python

I am trying to scrape data from this website: http://nepalstock.com/indices . I am not being able to scrape table data after changing value of 'Select Indices or Sub Indices'
https://i.stack.imgur.com/J0WMn.png
Since the url does not change, I do not know how to proceed on this.
So far, this is my code:
import requests
import pandas
from bs4 import BeautifulSoup
html=requests.get("http://nepalstock.com/indices")
soup=BeautifulSoup(html.content,"html.parser")
lst=[]
for row in soup.find_all("tr"):
l=[]
for td in row.find_all("td"):
l.append(td.get_text())
lst.append(l)
df=pandas.DataFrame(lst[1:])
df.head()

Related

Issues with extracting 1 column from https://www.sbstransit.com.sg/fares-and-concessions

I tried using web scarping to extract only one column from this website
df = pd.read_html('https://www.sbstransit.com.sg/fares-and-concessions')
df
from urllib.request import urlopen
# from Beautifulsoup4 import BeautifulSoup
# or if you're using BeautifulSoup4:
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('https://www.sbstransit.com.sg/fares-and-concessions').read())
for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
tds = row('td')
print(tds[0].string, tds[1].string)
I seriously need help,been trying this for hours already, its so hard just to extract 1 column :[
What you have to do is navigate through the web site try this
from urllib.request import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('https://www.sbstransit.com.sg/fares-and-concessions').read())
# get the first table body on the accordion
table = soup("ul", id="accordion")[0].li.table.tbody
for row in table("tr"):
# get the 7th column of each row
print(row("td")[6].text)
I prefer to use scrapy we use it in my job, but if your are going to start on web scraping I recommend you to learn xpath it will help you to navigate.
What about using pandas.read_html and selecting needed table by index from list of tables:
pd.read_html('https://www.sbstransit.com.sg/fares-and-concessions', header=1)[1]
and to get only results from the column:
pd.read_html('https://www.sbstransit.com.sg/fares-and-concessions', header=1)[1]['DTL/NEL']

Import table from website with BeautifulSoup

I am trying to import a table from a website and afterwards transform the data into a pandas dataframe.
The website is: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Thats my code so far:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
website_url = requests.get(
'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('table',{'class':'wikitable sortable'})
for x in soup.find_all('table',{'class':'wikitable sortable'}):
table = x.text
print(My_table)
print(table)
Output of print(My_table)
Output of print(table)
How do I convert this webpage table to a panda dataframe?
panda dataframe
have you tried
pd.read_html()
?
Also, since the table is very standard, why not directly copy the table into excel and import it as DataFrame?

How can I pull webpage data into my DataFrame by referencing a specific HTML class or id using pandas read_html?

I'm trying to pull the data from the table at this site and save it in a CSV with the column 'ticker' included. Right now my code is this:
import requests
import pandas as pd
url = 'https://www.biopharmcatalyst.com/biotech-stocks/company-pipeline-database#marketCap=mid|stages=approved,crl'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[0]
print (df)
df.to_csv('my data.csv')
and it results in a file that looks like this.
I want to have the 'ticker' column in my CSV file with the corresponding ticker listed for each company. The ticker is in the HTML here (class="ticker--small"). The output should look like this.
I'm totally stuck on this. I've tried doing it in BeautifulSoup too but I can't get it working. Any help would be greatly appreciated.
it has multiple table, use BeautifulSoup to extract and do loop to write the csv.
from bs4 import BeautifulSoup
import requests, lxml
import pandas as pd
url = 'https://www.biopharmcatalyst.com/biotech-stocks/company-pipeline-database#marketCap=mid|stages=approved,crl'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
tables = soup.findAll('table')
for table in tables:
df = pd.read_html(str(table))[0]
with open('my_data.csv', 'a+') as f:
df.to_csv(f)

Beautiful Soup Scraping table

I have this small piece of code to scrape table data from a web site and then display in a csv format. The issue is that for loop is printing the records multiple time . I am not sure if it is due to tag. btw I am new to Python. Thanks for your help!
#import needed libraries
import urllib
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import sys
import re
# read the data from a URL
url = requests.get("https://www.top500.org/list/2018/06/")
# parse the URL using Beauriful Soup
soup = BeautifulSoup(url.content, 'html.parser')
newtxt= ""
for record in soup.find_all('tr'):
tbltxt = ""
for data in record.find_all('td'):
tbltxt = tbltxt + "," + data.text
newtxt= newtxt+ "\n" + tbltxt[1:]
print(newtxt)
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.top500.org/list/2018/06/")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', attrs={'class':'table table-condensed table-striped'})
for i in table:
tr = i.find_all('tr')
for x in tr:
print(x.text)
Or the best way to parse table using pandas
import pandas as pd
table = pd.read_html('https://www.top500.org/list/2018/06/', attrs={
'class': 'table table-condensed table-striped'}, header = 1)
print(table)
It's printing much of the data multiple times because the newtext variable, which you are printing after getting the text of each <td></td>, is just accumulating all the values. Easiest way to get this to work is probably to just move the line print(newtxt) outside of both for loops - that is, leave it totally unindented. You should then see a list of all the text, with that from each row on a new line, and that from each individual cell in a row separated by commas.

How to get the contents under a particular column in a table from Wikipedia using soup & python

I need to get the href links that the contents point to under a particular column from a table in wikipedia. The page is "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015". On this page there are a few tables with class "wikitable". I need the links of the contents under the column Title for each row that they point to. I would like them to be copied onto an excel sheet.
I do not know the exact code of searching under a particular column but I came upto this far and I am getting a "Nonetype object is not callable". I am using bs4. I wanted to extract atleast somepart of the table so I could figure out narrowing to the href links under the Title column I want but I am ending up with this error. The code is as below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015').read())
for row in soup('table', {'class': 'wikitable'})[1].tbody('tr'):
tds = row('td')
print (tds[0].string, tds[0].string)
A little guidance appreciated. Anyone knows?
Figured out that the none type error might be related to the table filtering. Corrected code is as below:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))

Categories