I am new to python. I want to parse a data from a table in BSE site into python.
I tried using beautifulsoup module but I am unable to know which reference to use, so as to find the correct table. In fact even that particular table row is not getting displayed in python
The code that I tried was:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
page = 'https://www.bseindia.com/stock-share-price/itc-ltd/itc/500875/corp-actions/'
req = Request(page, headers = {'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
containers = page_soup.findAll("table", id = "tblinsidertrd")
This is giving a blank [ ] result.
Then I tried
containers = page_soup.findAll('td')
containers = page_soup.findAll('tr)
In both results I was unable to find the table or data I was looking for. I couldn't even find the table headings viz 'EX Date' and 'Amount'
The table that I want from BSE site is highlighted below:
Please help me as to where I am going wrong and why I am unable to view the dividend table data?
The content is dynamically generated. You can pull it form the api:
import pandas as pd
import requests
url = 'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode=500875'
headers = {'User-Agent': 'Mozilla/5.0'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData['Table'])
Output:
print(df)
Amount BCRD_from purpose_name
0 10.15 06 Jul 2020 Dividend
1 5.75 22 May 2019 Dividend
2 5.15 25 May 2018 Dividend
3 4.75 05 Jun 2017 Dividend
4 8.50 30 May 2016 Dividend
5 6.25 03 Jun 2015 Dividend
6 6.00 03 Jun 2014 Dividend
7 5.25 31 May 2013 Dividend
8 4.50 11 Jun 2012 Dividend
9 2.80 10 Jun 2011 Dividend
10 1.65 10 Jun 2011 Special Dividend
11 10.00 09 Jun 2010 Dividend
12 3.70 13 Jul 2009 Dividend
13 3.50 16 Jul 2008 Dividend
14 3.10 16 Jul 2007 Dividend
15 10.00 03 Jul 2001 Dividend
Related
I'm learning on how to webscrape using Python since I'm a novice. Right now, I attempted to webscrape Euros 2020 stats from this website https://theanalyst.com/na/2021/06/euro-2020-player-stats. After running my initial code (see below) to gather the html from the webpage, I cannot locate the table tag and its data-table class. I can see the table and its data-table when I inspected the website, but it is not shown when I print out the page_soup.
from urllib.request import urlopen as uReq # Web client
from bs4 import BeautifulSoup as soup # HTML data structure
url_page = 'https://theanalyst.com/na/2021/06/euro-2020-player-stats'
# Open connection & download the html from the url
uClient = uReq(url_page)
# Parses html into a soup data structure
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)
The table is loaded dynamically in JSON format via sending a GET request to:
https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json
Since we're dealing with JSON data, it's easier to use the requests library to get the data.
Here is an example using the pandas library to print the table into a DataFrame (you don't have to use the pandas library).
import pandas as pd
import requests
url = "https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json"
response = requests.get(url).json()
print(pd.json_normalize(response["data"]).to_string())
Output (truncated):
player_id team_id team_name player_first_name player_last_name player age position detailed_position mins_played np_shots np_sot np_goals np_xG op_chances_created op_assists op_xA op_passes op_pass_completion_rate tackles_won interceptions recoveries avg_carry_distance avg_carry_progress carry_w_shot carry_w_goal carry_w_chance_created carry_w_assist take_ons take_ons_success_rate goal_ending total_xG shot_ending team_badge
0 103955 114 England Raheem Sterling Raheem Sterling 26 Forward Second Striker 641 14 8 3 3.82 2 1 1.18 193 0.85 5 4 23 12.98 6.73 3 0 3 1 38 52.63 6 7.08 24 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
1 56979 114 England Jordan Henderson Jordan Henderson 31 Midfielder Central Midfielder 150 1 1 1 0.32 0 0 0.06 111 0.88 0 1 11 7.83 0.49 0 0 0 0 3 66.67 0 0.00 0 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
2 78830 114 England Harry Kane Harry Kane 27 Forward Striker 649 15 7 4 3.57 5 0 0.39 159 0.70 0 3 8 10.52 3.06 2 0 2 0 15 53.33 7 6.38 21 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
3 58621 114 England Kyle Walker Kyle Walker 31 Defender Full Back 599 0 0 0 0.00 2 0 0.18 352 0.87 0 8 37 11.66 5.09 0 0 0 0 1 100.00 3 2.54 10 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
The variable response is now a dictionary (dict) which you can access the keys/values. To view and prettify the data:
from pprint import pprint
print(type(response))
pprint(response)
Output (truncated):
<class 'dict'>
{'data': [{'age': 26,
'avg_carry_distance': 12.98,
'avg_carry_progress': 6.73,
'carry_w_assist': 1,
'carry_w_chance_created': 3,
'carry_w_goal': 0,
'carry_w_shot': 3,
'detailed_position': 'Second Striker',
I ran the following query:
win32clipboard.OpenClipboard()
df= win32clipboard.GetClipboardData().rstrip()
print(df)
When I want to see df, it looks like this:
\tDeal Type\tDeal #\tTrade Date\tValue Date\tTr\tCustomer\tPay Ccy\tPay Amount\tRec Ccy\tRec Amount\tRate\tUser Comments\tTime Option Start\t\r\n15\tFX\t2021062306\t23 Jun 2021\t24 Jun 2021\tEDL\txxx xxx\tCAD\t18,341.45\tUSD\t14,950.28\t1.2268300\t\t\t\r\n116\tFX\t2021021111\t11 Feb 2021\t30 Jul 2021\tAIA\txxx xxx xxxx\tUSD\t250,000.00\tCAD\t318,400.00\t1.2736000\t\t\t\r\n138\t
How can I convert the df into an actual data frame?
Try:
from io import StringIO
s = """\tDeal Type\tDeal #\tTrade Date\tValue Date\tTr\tCustomer\tPay Ccy\tPay Amount\tRec Ccy\tRec Amount\tRate\tUser Comments\tTime Option Start\t\r\n15\tFX\t2021062306\t23 Jun 2021\t24 Jun 2021\tEDL\txxx xxx\tCAD\t18,341.45\tUSD\t14,950.28\t1.2268300\t\t\t\r\n116\tFX\t2021021111\t11 Feb 2021\t30 Jul 2021\tAIA\txxx xxx xxxx\tUSD\t250,000.00\tCAD\t318,400.00\t1.2736000\t\t\t\r\n138\t"""
df = pd.read_csv(StringIO(s), sep=r"\t", engine="python", dtype=str)
print(df)
Prints:
Deal Type Deal # Trade Date Value Date Tr Customer Pay Ccy Pay Amount Rec Ccy Rec Amount Rate User Comments Time Option Start
0 15 FX 2021062306 23 Jun 2021 24 Jun 2021 EDL xxx xxx CAD 18,341.45 USD 14,950.28 1.2268300 None
1 116 FX 2021021111 11 Feb 2021 30 Jul 2021 AIA xxx xxx xxxx USD 250,000.00 CAD 318,400.00 1.2736000 None
2 138 None None None None None None None None None None None None
en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'
Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu
I am totally new to web scraping.
how can i scrape a website, whose url doesn't change with the page number?
suppose take this website- https://www.bseindia.com/corporates/Forth_Results.aspx
the url doesn't change with page number,
this is same as what i am asking, how can we do it using beautiful soup in python??
This script wi
import requests
from bs4 import BeautifulSoup
url = 'https://www.bseindia.com/corporates/Forth_Results.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
page = 1
while True:
print(page)
rows = soup.select('.TTRow')
if not rows:
break
# print some data to screen:
for tr in rows:
print(tr.get_text(strip=True, separator=' '))
# to get correct page, you have to do POST request with correct data
# the data is located in <input name="..." value=".."> tags
d = {}
for i in soup.select('input'):
d[i['name']] = i.get('value', '')
# some data parameters needs to be deleted:
if 'ctl00$ContentPlaceHolder1$btnSubmit' in d:
del d['ctl00$ContentPlaceHolder1$btnSubmit']
# set correct page:
page += 1
d['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$gvData'
d['__EVENTARGUMENT'] = 'Page${}'.format(page)
soup = BeautifulSoup(requests.post(url, headers=headers, data=d).content, 'html.parser')
Prints:
1
500002 ABB 23 Jul 2020
531082 ALANKIT 23 Jul 2020
535916 ALSL 23 Jul 2020
526662 ARENTERP 23 Jul 2020
500215 ATFL 23 Jul 2020
540611 AUBANK 23 Jul 2020
532523 BIOCON 23 Jul 2020
533167 COROENGG 23 Jul 2020
532839 DISHTV 23 Jul 2020
500150 FOSECOIND 23 Jul 2020
507488 GMBREW 23 Jul 2020
532855 HARYNACAP 23 Jul 2020
541729 HDFCAMC 23 Jul 2020
524342 INDOBORAX 23 Jul 2020
522183 ITL 23 Jul 2020
534623 JUPITERIN 23 Jul 2020
533192 KCPSUGIND 23 Jul 2020
542753 MAHAANIMP 23 Jul 2020
532525 MAHABANK 23 Jul 2020
523754 MAHEPC 23 Jul 2020
531680 MAYUR 23 Jul 2020
526299 MPHASIS 23 Jul 2020
532416 NEXTMEDIA 23 Jul 2020
502294 NILACHAL 23 Jul 2020
538772 NIYOGIN 23 Jul 2020
2
530805 OIVL 23 Jul 2020
538742 PANACHE 23 Jul 2020
531879 PIONDIST 23 Jul 2020
540173 PNBHOUSING 23 Jul 2020
533178 PRADIP 23 Jul 2020
...and so on.
EDIT: To save it as CSV, you can use this:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.bseindia.com/corporates/Forth_Results.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
page = 1
all_data = []
while True:
print(page)
rows = soup.select('.TTRow')
if not rows:
break
# print some data to screen:
for tr in rows:
row = tr.get_text(strip=True, separator='|').split('|')
all_data.append(row)
# to get correct page, you have to do POST request with correct data
# the data is located in <input name="..." value=".."> tags
d = {}
for i in soup.select('input'):
d[i['name']] = i.get('value', '')
# some data parameters needs to be deleted:
if 'ctl00$ContentPlaceHolder1$btnSubmit' in d:
del d['ctl00$ContentPlaceHolder1$btnSubmit']
# set correct page:
page += 1
d['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$gvData'
d['__EVENTARGUMENT'] = 'Page${}'.format(page)
soup = BeautifulSoup(requests.post(url, headers=headers, data=d).content, 'html.parser')
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Produces data.csv (screenshot from LibreOffice):
I have a text file that is output from a command that I ran with Netmiko to retrieve data from a Cisco WLC of things that are causing interference on our WiFi network. I stripped out just what I needed from the original 600k lines of code down to a couple thousand lines like this:
AP Name.......................................... 010-HIGH-FL4-AP04
Microwave Oven 11 10 -59 Mon Dec 18 08:21:23 2017
WiMax Mobile 11 0 -84 Fri Dec 15 17:09:45 2017
WiMax Fixed 11 0 -68 Tue Dec 12 09:29:30 2017
AP Name.......................................... 010-2nd-AP04
Microwave Oven 11 10 -61 Sat Dec 16 11:20:36 2017
WiMax Fixed 11 0 -78 Mon Dec 11 12:33:10 2017
AP Name.......................................... 139-FL1-AP03
Microwave Oven 6 18 -51 Fri Dec 15 12:26:56 2017
AP Name.......................................... 010-HIGH-FL3-AP04
Microwave Oven 11 10 -55 Mon Dec 18 07:51:23 2017
WiMax Mobile 11 0 -83 Wed Dec 13 16:16:26 2017
The goal is to end up with a csv file that strips out the 'AP Name ...' and puts what left on the same line as the rest of the information in the next line. The problem is some have two lines below the AP name and some have 1 or none. I have been at it for 8 hours and cannot find the best way to make this happen.
This is the latest version of code that I was trying to use, any suggestions for making this work? I just want something I can load up in excel and create a report with:
with open(outfile_name, 'w') as out_file:
with open('wlc-interference_raw.txt', 'r')as in_file:
#Variables
_ap_name = ''
_temp = ''
_flag = False
for i in in_file:
if 'AP Name' in i:
#write whatever was put in the temp file to disk because new ap now
#add another temp variable in case an ap has more than 1 interferer and check if new AP name
out_file.write(_temp)
out_file.write('\n')
#print(_temp)
_ap_name = i.lstrip('AP Name.......................................... ')
_ap_name = _ap_name.rstrip('\n')
_temp = _ap_name
#print(_temp)
elif '----' in i:
pass
elif 'Class Type' in i:
pass
else:
line_split = i.split()
for x in line_split:
_temp += ','
_temp += x
_temp += '\n'
I think your best option is to read all lines of the file, then split into sections starting with AP Name. Then you can work on parsing each section.
Example
s = """AP Name.......................................... 010-HIGH-FL4-AP04
Microwave Oven 11 10 -59 Mon Dec 18 08:21:23 2017
WiMax Mobile 11 0 -84 Fri Dec 15 17:09:45 2017
WiMax Fixed 11 0 -68 Tue Dec 12 09:29:30 2017
AP Name.......................................... 010-2nd-AP04
Microwave Oven 11 10 -61 Sat Dec 16 11:20:36 2017
WiMax Fixed 11 0 -78 Mon Dec 11 12:33:10 2017
AP Name.......................................... 139-FL1-AP03
Microwave Oven 6 18 -51 Fri Dec 15 12:26:56 2017
AP Name.......................................... 010-HIGH-FL3-AP04
Microwave Oven 11 10 -55 Mon Dec 18 07:51:23 2017
WiMax Mobile 11 0 -83 Wed Dec 13 16:16:26 2017"""
import re
class AP:
"""
A class holding each section of the parsed file
"""
def __init__(self):
self.header = ""
self.content = []
sections = []
section = None
for line in s.split('\n'): # Or 'for line in file:'
# Starting new section
if line.startswith('AP Name'):
# If previously had a section, add to list
if section is not None:
sections.append(section)
section = AP()
section.header = line
else:
if section is not None:
section.content.append(line)
sections.append(section) # Add last section outside of loop
for section in sections:
ap_name = section.header.lstrip("AP Name.") # lstrip takes all the characters given, not a literal string
for line in section.content:
print(ap_name + ",", end="")
# You can extract the date separately, if needed
# Splitting on more than one space using a regex
line = ",".join(re.split(r'\s\s+', line))
print(line.rstrip(',')) # Remove trailing comma from imperfect split
Output
010-HIGH-FL4-AP04,Microwave Oven,11,10,-59,Mon Dec 18 08:21:23 2017
010-HIGH-FL4-AP04,WiMax Mobile,11,0,-84,Fri Dec 15 17:09:45 2017
010-HIGH-FL4-AP04,WiMax Fixed,11,0,-68,Tue Dec 12 09:29:30 2017
010-2nd-AP04,Microwave Oven,11,10,-61,Sat Dec 16 11:20:36 2017
010-2nd-AP04,WiMax Fixed,11,0,-78,Mon Dec 11 12:33:10 2017
139-FL1-AP03,Microwave Oven,6,18,-51,Fri Dec 15 12:26:56 2017
010-HIGH-FL3-AP04,Microwave Oven,11,10,-55,Mon Dec 18 07:51:23 2017
010-HIGH-FL3-AP04,WiMax Mobile,11,0,-83,Wed Dec 13 16:16:26 2017
Tip:
You don't need Python to write the CSV, you can output to a file using the command line
python script.py > output.csv