Unable to scrape data from json page using Python

Unable to scrape data from json page using Python - python

I'm trying to pull information off of this web page (Which is providing an AJAX call to this page).
I'm able to print out the whole page, but the find_all function just returns a blank list. What am I doing wrong?
from bs4 import BeautifulSoup
import requests
url = "http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1653673850919"
def pageText():
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
return doc
specialNum = pageText()
print(specialNum)
specialNum = pageText().find_all('literally anything I am trying to pull off of the page')
print(specialNum) #This will always print a blank list
Apologies if this is a stupid question. I'm a bit of a beginner.

EDIT
as mentioned by #furas removing parameter and value callback=jsonp1653673850875 from url server will send pure JSON and you can get HTML directly via r.json()['componentData'].
Simplest approach in my opinion is to unwrap the JSON string and convert it with json.loads() to access the HTML.
From there you can go with beautifulsoup or pandas to scrape the content.
Example beautifulsoup
import json, requests
from bs4 import BeautifulSoup
r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1653673850919')
soup = BeautifulSoup(
json.loads(
r.text.split('(',1)[-1].rsplit(')',1)[0]
)['componentData']
)
for row in soup.select('table tr'):
...
Example pandas
import json, requests
import pandas as pd
r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1653673850919')
pd.read_html(json.loads(
r.text.split('(',1)[-1].rsplit(')',1)[0]
)['componentData']
)[0].dropna()
Output
Unnamed: 0
2012-09
2013-09
2014-09
2015-09
2016-09
2017-09
2018-09
2019-09
2020-09
2021-09
TTM
Revenue USD Mil
156508
170910
182795
233715
215639
229234
265595
260174
274515
365817
386017
Gross Margin %
43.9
37.6
38.6
40.1
39.1
38.5
38.3
37.8
38.2
41.8
43.3
Operating Income USD Mil
55241
48999
52503
71230
60024
61344
70898
63930
66288
108949
119379
Operating Margin %
35.3
28.7
28.7
30.5
27.8
26.8
26.7
24.6
24.1
29.8
30.9
Net Income USD Mil
41733
37037
39510
53394
45687
48351
59531
55256
57411
94680
101935
Earnings Per Share USD
1.58
1.42
1.61
2.31
2.08
2.3
2.98
2.97
3.28
5.61
6.15
Dividends USD
0.09
0.41
0.45
0.49
0.55
0.6
0.68
0.75
0.8
0.85
0.88
Payout Ratio % *
—
27.4
28.5
22.3
24.8
26.5
23.7
25.1
23.7
16.3
14.3
Shares Mil
26470
26087
24491
23172
22001
21007
20000
18596
17528
16865
16585
Book Value Per Share * USD
4.25
4.9
5.15
5.63
5.93
6.46
6.04
5.43
4.26
3.91
4.16
Operating Cash Flow USD Mil
50856
53666
59713
81266
65824
63598
77434
69391
80674
104038
116426
Cap Spending USD Mil
-9402
-9076
-9813
-11488
-13548
-12795
-13313
-10495
-7309
-11085
-10633
Free Cash Flow USD Mil
41454
44590
49900
69778
52276
50803
64121
58896
73365
92953
105793
Free Cash Flow Per Share * USD
1.58
1.61
1.93
2.96
2.24
2.41
2.88
3.07
4.04
5.57
—
Working Capital USD Mil
19111
29628
5083
8768
27863
27831
14473
57101
38321
9355
—

Related

Animate px.line line with plotly express

Learning plotly line animation and come across this question
My df:
Date
1Mo
2Mo
3Mo
6Mo
1Yr
2Yr
0
2023-02-12
4.66
4.77
4.79
4.89
4.50
4.19
1
2023-02-11
4.66
4.77
4.77
4.90
4.88
4.49
2
2023-02-10
4.64
4.69
4.72
4.88
4.88
4.79
3
2023-02-09
4.62
4.68
4.71
4.82
4.88
4.89
4
2023-02-08
4.60
4.61
4.72
4.83
4.89
4.89
How do I animate this dataframe so the frame has
x = [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr], and
y = the actual value on a date, eg y=df[df['Date']=="2023-02-08"], animation_frame = df['Date']?
I tried
plot = px.line(df, x=df.columns[1:], y=df['Date'], title="Treasury Yields", animation_frame=df_treasuries_yield['Date'])
No joy :(

I think the problem is you cannot pass multiple columns to the animation_frame parameter. But we can get around this by converting your df from wide to long format using pd.melt – for your data, we will want to take all of the values from [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr] and put them a new column called "value" and we will have a variable column called "variable" to tell us which column the value came from.
df_long = pd.melt(df, id_vars=['Date'], value_vars=['1Mo', '2Mo', '3Mo', '6Mo', '1Yr', '2Yr'])
This will look like the following:
Date variable value
0 2023-02-12 1Mo 4.66
1 2023-02-11 1Mo 4.66
2 2023-02-10 1Mo 4.64
3 2023-02-09 1Mo 4.62
4 2023-02-08 1Mo 4.60
...
28 2023-02-09 2Yr 4.89
29 2023-02-08 2Yr 4.89
Now can pass the argument animation_frame='Date' to px.line:
fig = px.line(df_long, x="variable", y="value", animation_frame="Date", title="Yields")

Missing Tables with Beautifulsoup web scraping

I've been trying to webscrape evolving-hockey.com website for team data and can only read as far as
using:
from bs4 import BeautifulSoup as bs
from bs4 import Comment
import requests
site = 'https://evolving-hockey.com/stats/team_standard/?_inputs_&std_tm_str=%225v5%22&std_tm_table=%22On-Ice%22&std_tm_team=%22All%22&std_tm_range=%22Seasons%22&std_tm_adj=%22Score%20%26%20Venue%22&std_tm_span=%22Regular%22&dir_ttbl=%22Stats%22&std_tm_type=%22Rates%22&std_tm_group=%22Season%22'
r = requests.get(site)
soup = bs(r.content, 'html.parser')
data = soup.find_all('table')
returns nothing even though the html code suggests there are tables within.
Why can't beautifulsoup find the table data? Are they linked to somewhere else?
Thanks for the help

To retrieve the dynamically load data, I used selenium
from bs4 import BeautifulSoup as bs
from selenium import webdriver
driver = webdriver.Chrome()
site = 'https://evolving-hockey.com/stats/team_standard/?_inputs_&std_tm_str=%225v5%22&std_tm_table=%22On-Ice%22&std_tm_team=%22All%22&std_tm_range=%22Seasons%22&std_tm_adj=%22Score%20%26%20Venue%22&std_tm_span=%22Regular%22&dir_ttbl=%22Stats%22&std_tm_type=%22Rates%22&std_tm_group=%22Season%22'
driver.get(site)
import time
time.sleep(5) # delay
soup = bs(driver.page_source, 'html.parser')
data = soup.find_all('tr')[1]
for d in data:
print(d.get_text(strip=True), end=' ')
data2 = soup.find_all('tr')[1:33]
for x in data2:
print(x.get_text(strip=True,separator=' '), end='\n')
driver.quit()
print
Name Team Season GP TOI GF% SF% FF% CF% xGF% GF/60 GA/60 SF/60 SA/60 FF/60 FA/60 CF/60 CA/60 xGF/60 xGA/60 G±/60 S±/60 F±/60 C±/60 xG±/60 Sh% Sv% Name Team Season GP TOI GF% SF% FF% CF% xGF% GF/60 GA/60 SF/60 SA/60 FF/60 FA/60 CF/60 CA/60 xGF/60 xGA/60 G±/60 S±/60 F±/60 C±/60 xG±/60 Sh% Sv%
1 Ducks ANA 19-20 71 3450.65 46.57 47.7 47.97 47.87 47.08 2.22 2.55 28.77 31.54 41.45 44.96 53.96 58.75 2.32 2.61 -0.33 -2.77 -3.51 -4.79 -0.29 7.73 91.91
2 Coyotes ARI 19-20 70 3405.93 50.08 49.72 48.57 48.6 49.61 2.24 2.24 31.12 31.47 42.56 45.07 56.02 59.24 2.33 2.36 0.01 -0.35 -2.51 -3.22 -0.04 7.21 92.9
3 Bruins BOS 19-20 70 3328.28 57.69 52.48 51.83 51.93 52.82 2.56 1.88 31.07 28.13 42.5 39.5 55.98 51.81 2.22 1.98 0.68 2.94 3 4.17 0.24 8.24 93.32
4 Sabres BUF 19-20 69 3393.5 49.11 47.9 48.37 48.81 47.54 2.29 2.37 27.93 30.38 38.75 41.36 50.16 52.62 2.05 2.26 -0.08 -2.45 -2.61 -2.45 -0.21 8.2 92.19
5 Hurricanes CAR 19-20 68 3217.15 50.97 52.96 53.66 54.42 52.37 2.63 2.53 32.38 28.75 45.33 39.15 60.05 50.29 2.76 2.51 0.1 3.62 6.18 9.76 0.25 8.12 91.2
6 Blue Jackets CBJ 19-20 70 3478.85 50.55 51.82 50.66 48.87 51.66 2.14 2.09 31.53 29.32 41.53 40.44 53.63 56.11 2.22 2.08 0.05 2.21 1.09 -2.48 0.14 6.78 92.87
7 Flames CGY 19-20 70 3429.05 47.34 48.91 49.42 49.91 50.74 2.31 2.57 30.24 31.58 43.03 44.05 57.22 57.41 2.47 2.4 -0.26 -1.35 -1.02 -0.2 0.07 7.64 91.86
etc.

How to read specific lines from text using a starting and ending condition?

I have a document.gca file that contains specific information that I need, I'm trying to extract certain information, in a part of text repeats the next sentences:
#Sta/Elev= xx
(here goes pair numbers)
#Mann
This part of text repeats several times. My goal is to catch (the pair numbers) that are in that interval, and repeat this process in my text. How can I extract that? Say I have this:
Sta/Elev= 259
0 2186.31 .3 2186.14 .9 2185.83 1.4 2185.56 2.5 2185.23
3 2185.04 3.6 2184.83 4.7 2184.61 5.6 2184.4 6.4 2184.17
6.9 2183.95 7.5 2183.69 7.6 2183.59 8 2183.35 8.6 2182.92
10.2 2181.47 10.8 2181.03 11.3 2180.63 11.9 2180.27 12.4 2179.97
13 2179.72 13.6 2179.47 14.1 2179.3 14.3 2179.21 14.7 2179.11
15.7 2178.9 17.4 2178.74 17.9 2178.65 20.1 2178.17 20.4 2178.13
20.4 2178.12 21.5 2177.94 22.6 2177.81 22.6 2177.8 22.9 2177.79
24.1 2177.78 24.4 2177.75 24.6 2177.72 24.8 2177.68 25.2 2177.54
Mann= 3 , 0 , 0
0 .2 0 26.9 .2 0 46.1 .2 0
Bank Sta=26.9,46.1
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2176.01,0.3, 56
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
Type RM Length L Ch R = 1 ,2655 ,11.2,11.1,10.5
XS GIS Cut Line=4
858341.2470677761196439.12427935858354.9998313071196457.53292637
858369.2753539641196470.40256485858387.8228168661196497.81690065
Node Last Edited Time=Aug/05/2019 11:42:02
Sta/Elev= 245
0 2191.01 .8 2190.54 2.5 2189.4 5 2187.76 7.2 2186.4
8.2 2185.73 9.5 2184.74 10.1 2184.22 10.3 2184.04 10.8 2183.55
12.8 2180.84 13.1 2180.55 13.3 2180.29 13.9 2179.56 14.2 2179.25
14.5 2179.03 15.8 2178.18 16.4 2177.81 16.7 2177.65 17 2177.54
17.1 2177.51 17.2 2177.48 17.5 2177.43 17.6 2177.4 17.8 2177.39
18.3 2177.37 18.8 2177.37 19.7 2177.44 20 2177.45 20.6 2177.45
20.7 2177.45 20.8 2177.44 21 2177.42 21.3 2177.41 21.4 2177.4
21.7 2177.32 22 2177.26 22.1 2177.21 22.2 2177.13 22.5 2176.94
22.6 2176.79 22.9 2176.54 23.2 2176.19 23.5 2175.88 23.9 2175.68
24.4 2175.55 24.6 2175.54 24.8 2175.53 24.9 2175.53 25.1 2175.54
25.7 2175.63 26 2175.71 26.3 2175.78 26.4 2175.8 26.4 2175.82
#Mann= 3 , 0 , 0
0 .2 0 22.9 .2 0 43 .2 0
Bank Sta=22.9,43
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2175.68,0.3, 51
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
But I want to select the numbers between Sta/Elev and Mann and save as a pair vectors, for each Sta/Elev right now I have this:
import re
with open('a.g01','r') as file:
file_contents = file.read()
#print(file_contents)
try:
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
except AttributeError:
found = '' # apply your error handling
print(found)
found is empty and I want to catch all the numbers in interval '#Sta/Elev and #Mann'

The problem is in your regex, try switching
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
to
found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
output:
>>> import re
>>> file_contents = 'Sta/ElevthisisatestMann'
>>> found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
>>> print(found)
thisisatest
Edit:
For multiline matching try adding the DOTALL parameter:
found = re.search('Sta/Elev=(.*)Mann',file_contents, re.DOTALL).group(1)
It was not clear to me on what is the separating string, since they are different in your examples, but for that you can just change it in the regex expression

Not performing calculation on blank field in dataframe

I have a data-frame (df) with the following structure:
date a b c d e f g
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.80 223716 790.8724 5.7916
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434
where columns a and g have data i would like to multiple them together using the following:
df["h"] = df["a"]*df["g"]
however as you can see from the timeseries above there is not always data with which to perform the calculation and I am being returned the following error:
KeyError: 'g'
Is there a way to check if the data exists before performing the calculation? I am trying to use :
df["h"] = np.where((df.a == blank)|(df.g == blank),"",df.a*df.g)
I would like to have returned:
date a b c d e f g h
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.8 223716 790.8724 5.7916 1.0618
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161 1.0239
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149 1.0288
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242 0.9772
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427 0.9672
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076 0.9985
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434 1.0148
but am unsure of the syntax for a blank data field. What should that be?

Read a table according to a certain number of characters

I want to extract the name of comets from my table held in a text file. However some comets are 1-worded, others are 2-worded, and some are 3-worded. My table looks like this:
9P/Tempel 1 1.525 0.514 10.5 5.3 2.969
27P/Crommelin 0.748 0.919 29.0 27.9 1.484
126P/IRAS 1.713 0.697 45.8 13.4 1.963
177P/Barnard 1.107 0.954 31.2 119.6 1.317
P/2008 A3 (SOHO) 0.049 0.984 22.4 5.4 1.948
P/2008 Y11 (SOHO) 0.046 0.985 24.4 5.3 1.949
C/1991 L3 Levy 0.983 0.929 19.2 51.3 1.516
However, I know that the name of the comets is from character 5 till character 37. How can I write a code to tell python that the first column is from character 5 till character 37?

data = """9P/Tempel 1 1.525 0.514 10.5 5.3 2.969
27P/Crommelin 0.748 0.919 29.0 27.9 1.484
126P/IRAS 1.713 0.697 45.8 13.4 1.963
177P/Barnard 1.107 0.954 31.2 119.6 1.317
P/2008 A3 (SOHO) 0.049 0.984 22.4 5.4 1.948
P/2008 Y11 (SOHO) 0.046 0.985 24.4 5.3 1.949
C/1991 L3 Levy 0.983 0.929 19.2 51.3 1.516""".split('\n')
To read the whole file you can use
f = open('data.txt', 'r').readlines()
It seems that you have columns that you can use.
If you're only interested in the first column then :
len("9P/Tempel 1 ")
It gives 33.
So,
Extract the first column :
for line in data:
print line[:33].strip()
Here what's printed :
9P/Tempel 1
27P/Crommelin
126P/IRAS
177P/Barnard
P/2008 A3 (SOHO)
P/2008 Y11 (SOHO)
C/1991 L3 Levy
If what you want is :
Tempel 1
Crommelin
IRAS
...
You have to use a regular expression.
Example :
reg = '.*?/[\d\s]*(.*)'
print re.match(reg, '27P/Crommelin').group(1)
print re.match(reg, 'C/1991 L3 Levy').group(1)
Here's the output :
Crommelin
L3 Levy
You can also take a glance to the read_fwf of the python pandas library.
It allows to parse your file specifying the number of characters per columns.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to scrape data from json page using Python - python

Related

Animate px.line line with plotly express

Missing Tables with Beautifulsoup web scraping

How to read specific lines from text using a starting and ending condition?

Not performing calculation on blank field in dataframe

Read a table according to a certain number of characters

Categories

Resources