Collecting data from rotten tomatoes using python API - python

Im new to python and API and i try to collect data from this link:
https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/
the data that i want is the first 25 movies and their info
and i have to use API
the code im tring is this:
result = requests.get('https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/').text
print(result)
and thats as far as i could get...(im very new)
the result is very long but this is an example from it
<td class="bold">8.</td>
<td>
<span class="tMeterIcon tiny">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore"> 92%</span>
</span>
</td>
<td>
<a href="/m/dunkirk_2017" class="unstyled articleLink">
Dunkirk (2017)</a>
</td>
<td class="right hidden-xs">461</td>
the info that i need from this is the rank(class="bold"), rating(class="tMeterScore") ,title(class="unstyled articleLink") and number of reviews(class="right hidden-xs").
so the problem is that i dont know how to get the data that i need from the result and i dont know if i even do it the right way (if there is a better way to get data)

For tables on simple web pages, pandas.read_html is great.
import pandas as pd
# Read all the page tables with a simple call
tables= pd.read_html('https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/')
# display the table shapes to manually select the corret one
print("Tables")
print('\n'.join([f"{i} shape:{t.shape}" for i, t in enumerate(tables)]))
# Selection of the table ('2' results from the manual observation, see previous comment)
table = tables[2]
# some data
print('\n'.join(["", "Table:", "",
"Columns types:", str(table.dtypes), "", "",
"5 first and last rows:", str(table), "", "",
"First row:", str(table.iloc[0])
]))
Output:
Tables
0 shape:(11, 2)
1 shape:(12, 2)
2 shape:(100, 4)
3 shape:(10, 3)
4 shape:(10, 3)
Table:
Columns types:
Rank float64
RatingTomatometer object
Title object
No. of Reviews int64
dtype: object
5 first and last rows:
Rank ... No. of Reviews
0 1.0 ... 525
1 2.0 ... 547
2 3.0 ... 437
3 4.0 ... 434
4 5.0 ... 392
.. ... ... ...
95 96.0 ... 93
96 97.0 ... 130
97 98.0 ... 324
98 99.0 ... 203
99 100.0 ... 66
[100 rows x 4 columns]
First row:
Rank 1.0
RatingTomatometer 96%
Title Black Panther (2018)
No. of Reviews 525
Name: 0, dtype: object

Related

Create new column with multiple values in Python

I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608
Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM

New to Beautiful Soup. Need to scrape tables from an online report

I want to scrape the following data using beautiful soup. I can figure out. Please help.
<TABLE WIDTH=100%>
<TD VALIGN="TOP" WIDTH="30%">
<TABLE BORDER="1" WIDTH="100%">
<TR>
<TH COLSPAN="3"><CENTER><B>SUMMARY</B></CENTER></TH>
</TR>
<TR><TD>Alberta Total Net Generation</TD><TD>9299</TD></TR>
<TR><TD>Net Actual Interchange</TD><TD>-386</TD></TR>
<TR><TD>Alberta Internal Load (AIL)</TD><TD>9685</TD></TR>
<TR><TD>Net-To-Grid Generation</TD><TD>6897</TD></TR>
<TR><TD>Contingency Reserve Required</TD><TD>518</TD></TR>
<TR><TD>Dispatched Contingency Reserve (DCR)</TD><TD>552</TD></TR>
<TR><TD>Dispatched Contingency Reserve -Gen</TD><TD>374</TD></TR>
<TR><TD>Dispatched Contingency Reserve -Other</TD><TD>178</TD></TR>
<TR><TD>LSSi Armed Dispatch</TD><TD>73</TD></TR>
<TR><TD>LSSi Offered Volume</TD><TD>73</TD></TR>
</TABLE>
This is the link I want to scrape.
http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet
I need the Summary, Generation and Interchange table separately. Any help would be great..
I'd use pd.read_html + beautifulsoup to read the data. Also, use html5lib parser when you parse the page (contains malformed tags):
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_summary(soup):
summary = soup.select_one(
"table:has(b:-soup-contains(SUMMARY)):not(:has(table))"
)
summary.tr.extract()
return pd.read_html(str(summary))[0]
def get_generation(soup):
generation = soup.select_one(
"table:has(b:-soup-contains(GENERATION)):not(:has(table))"
)
generation.tr.extract()
for td in generation.tr.select("td"):
td.name = "th"
return pd.read_html(str(generation))[0]
def get_interchange(soup):
interchange = soup.select_one(
"table:has(b:-soup-contains(INTERCHANGE)):not(:has(table))"
)
interchange.tr.extract()
for td in interchange.tr.select("td"):
td.name = "th"
return pd.read_html(str(interchange))[0]
url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(get_summary(soup))
print(get_generation(soup))
print(get_interchange(soup))
Prints:
0 1
0 Alberta Total Net Generation 9359
1 Net Actual Interchange -343
2 Alberta Internal Load (AIL) 9702
3 Net-To-Grid Generation 6946
4 Contingency Reserve Required 514
5 Dispatched Contingency Reserve (DCR) 552
6 Dispatched Contingency Reserve -Gen 374
7 Dispatched Contingency Reserve -Other 178
8 LSSi Armed Dispatch 78
9 LSSi Offered Volume 82
GROUP MC TNG DCR
0 GAS 10836 6801 79
1 HYDRO 894 270 233
2 ENERGY STORAGE 50 0 50
3 SOLAR 936 303 0
4 WIND 2269 448 0
5 OTHER 424 273 12
6 DUAL FUEL 0 0 0
7 COAL 1266 1264 0
8 TOTAL 16675 9359 374
PATH ACTUAL FLOW
0 British Columbia -230
1 Montana -113
2 Saskatchewan 0
3 TOTAL -343

How to extract only header names from table into a list

I'm trying to extract just the header values from a Wikipedia table into a list. The following code is what I have so far, but I can't get the output correctly.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find('table')
column_names = [item.get_text() for item in table.find_all('th')]
column_names[2:18]
# current output: ['Origin of name[2][3]\n', 'Group\n','Period\n', 'Block\n' ...]
# expected outout ['Atomic Number', 'Symbol', 'Name', 'Origin of name',
# 'Group', 'Period', 'Standard atomic weight', 'Density',
# 'Melting Point'...]
I believe you need to do some data cleaning based on how the html is structured. The table has a multiiindex structure, so you won't get a flat list as columns. Remember pandas has the from_html() function which allows you to pass a raw html string and it does the parsing for you, removing the need to use BeautifulSoup or do any html parsing.
Thinking pragmatically I believe for this particular case it's better to do it manually, otherwise you will need to do a lot of string manipulation to get a clean list of column names. It is faster to write it manually.
Given you have already done most of writing, for an easier and time efficient solution I recommend:
df = pd.read_html(page.text)[0]
column_names = ['Atomic Number', 'Symbol', 'Name', 'Origin of name', 'Group', 'Period','Block','Standard atomic weight', 'Density', 'Melting Point','Boiling Point','Specific heat capacity','Electro-negativity',"Abundance in Earth's crust",'Origin','Phase at r.t.']
df.columns = column_names
Which outputs a nice and readable:
Atomic Number Symbol ... Origin Phase at r.t.
0 1 H ... primordial gas
1 2 He ... primordial gas
2 3 Li ... primordial solid
3 4 Be ... primordial solid
4 5 B ... primordial solid
.. ... ... ... ... ...
113 114 Fl ... synthetic unknown phase
114 115 Mc ... synthetic unknown phase
115 116 Lv ... synthetic unknown phase
116 117 Ts ... synthetic unknown phase
117 118 Og ... synthetic unknown phase
Otherwise if you want to go for a fully-automated approach:
page = requests.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
df = pd.read_html(page.text)[0]
df.columns = df.columns.droplevel()
Outputs:
Element Origin of name[2][3] Group Period Block Standardatomicweight[a] Density[b][c] Melting point[d] Boiling point[e] Specificheatcapacity[f] Electro­negativity[g] Abundancein Earth'scrust[h] Origin[i] Phase at r.t.[j]
Atomic number.mw-parser-output .nobold{font-weight:normal}Z Symbol Name Unnamed: 3_level_2 Unnamed: 4_level_2 Unnamed: 5_level_2 Unnamed: 6_level_2 (Da) ('"`UNIQ--templatestyles-00000016-QINU`"'g/cm3) (K) (K) (J/g · K) Unnamed: 12_level_2 (mg/kg) Unnamed: 14_level_2 Unnamed: 15_level_2
0 1 H Hydrogen Greek elements hydro- and -gen, 'water-forming' 1.0 1 s-block 1.008 0.00008988 14.01 20.28 14.304 2.20 1400 primordial gas
1 2 He Helium Greek hḗlios, 'sun' 18.0 1 s-block 4.0026 0.0001785 –[k] 4.22 5.193 – 0.008 primordial gas
2 3 Li Lithium Greek líthos, 'stone' 1.0 2 s-block 6.94 0.534 453.69 1560 3.582 0.98 20 primordial solid
3 4 Be Beryllium Beryl, a mineral (ultimately from the name of ... 2.0 2 s-block 9.0122 1.85 1560 2742 1.825 1.57 2.8 primordial solid
4 5 B Boron Borax, a mineral (from Arabic bawraq) 13.0 2 p-block 10.81 2.34 2349 4200 1.026 2.04 10 primordial solid
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 114 Fl Flerovium Flerov Laboratory of Nuclear Reactions, part o... 14.0 7 p-block [289] (9.928) (200)[b] (380) – – – synthetic unknown phase
114 115 Mc Moscovium Moscow, Russia, where the element was first sy... 15.0 7 p-block [290] (13.5) (700) (1400) – – – synthetic unknown phase
115 116 Lv Livermorium Lawrence Livermore National Laboratory in Live... 16.0 7 p-block [293] (12.9) (700) (1100) – – – synthetic unknown phase
116 117 Ts Tennessine Tennessee, United States, where Oak Ridge Nati... 17.0 7 p-block [294] (7.2) (700) (883) – – – synthetic unknown phase
117 118 Og Oganesson Yuri Oganessian, Russian physicist 18.0 7 p-block [294] (7) (325) (450) – – – synthetic unknown phase
And the string cleaning needed to make it look nice and tidy is going to take a lot longer than writing a few column names.

How to grab a complete table hidden beyond 'Show all' by web scraping in Python

According to the reply I found in my previous question, I am able to grab the table by web scraping in Python from the URL: https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html But it only grabs partially until the row "Show all" is appeared.
How can I grab the complete table in Python which is hidden beyond "Show all" ?
Here is the code I am using:
import pandas as pd
import requests
from bs4 import BeautifulSoup
#
vaccineDF = pd.read_html('https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html')[0]
vaccineDF = vaccineDF.reset_index(drop=True)
print(vaccineDF.head(100))
The output only grabs 15 rows (until Show All):
Unnamed: 0_level_0 Doses administered ... Unnamed: 8_level_0 Unnamed: 9_level_0
Unnamed: 0_level_1 Per 100 people ... Unnamed: 8_level_1 Unnamed: 9_level_1
0 World 11 ... NaN NaN
1 Israel 116 ... NaN NaN
2 Seychelles 116 ... NaN NaN
3 U.A.E. 99 ... NaN NaN
4 Chile 69 ... NaN NaN
5 Bahrain 66 ... NaN NaN
6 Bhutan 63 ... NaN NaN
7 U.K. 62 ... NaN NaN
8 United States 61 ... NaN NaN
9 San Marino 60 ... NaN NaN
10 Maldives 59 ... NaN NaN
11 Malta 55 ... NaN NaN
12 Monaco 53 ... NaN NaN
13 Hungary 45 ... NaN NaN
14 Serbia 44 ... NaN NaN
15 Show all Show all ... Show all Show all
Below is the screen shot of the partial table until "Show all" in the web (left part) and corresponding inspect elements (right part):
You can't print whole data directly because you can see your full data after clicking the Show all Button. So, from this scenario, we can understand that first of all we have to first create an on click() event for clicking the Show all Button then we can fetch the whole table.
I have used Selenium Library for the on click event for pressing the Show all Button. For this particular scenario, I have used Firefox() Webdriver of Selenium for fetching all data from url. Kindly refer to the code given below for fetching the whole table of the given COVID Dataset URL:-
# Import all the Important Libraries
from selenium import webdriver # This module help to fetch data and on-click event purpose
from pandas.io.html import read_html # This module will help to read 'html' source. So, we can __scrape__ data from it
import pandas as pd # This Module will help to Convert Our Data into 'DataFrame'
# Create 'FireFox' Webdriver Object
driver = webdriver.Firefox()
# Get Website
driver.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
# Find 'Show all' Button Using 'XPath'
show_all_button = driver.find_element_by_xpath("/html/body/div[1]/main/article/section/div/div/div[4]/div[1]/div/table/tbody/tr[16]")
# Click 'Show all' Button
show_all_button.click()
# Get 'HTML' Content of Page
html_data = driver.page_source
After fetching the whole data, let's see how many tables are there in our COVID Dataset URL
covid_data_tables = read_html(html_data, attrs = {"class":"g-summary-table svelte-2wimac"}, header = None)
# Print Number of Tables Extracted
print ("\nExtracted {num} COVID Data Table".format(num = len(covid_data_tables)), "\n")
# Output of Above Cell:-
Extracted 1 COVID Data Table
Now, let's fetch the Data Table:-
# Print Table Data
covid_data_tables[0].head(20)
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Pct. of population
Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
0 World 11 877933955 – –
1 Israel 116 10307583 60% 56%
2 Seychelles 116 112194 68% 47%
3 U.A.E. 99 9489684 – –
4 Chile 69 12934282 41% 28%
5 Bahrain 66 1042463 37% 29%
6 Bhutan 63 478219 63% –
7 U.K. 62 41505768 49% 13%
8 United States 61 202282923 38% 24%
9 San Marino 60 20424 35% 25%
10 Maldives 59 303752 53% 5.6%
11 Malta 55 264658 38% 17%
12 Monaco 53 20510 30% 23%
13 Hungary 45 4416581 32% 14%
14 Serbia 44 3041740 26% 17%
15 Qatar 43 1209648 – –
16 Uruguay 38 1310591 30% 8.3%
17 Singapore 30 1667522 20% 9.5%
18 Antigua and Barbuda 28 27032 28% –
19 Iceland 28 98672 20% 8.1%
As you can see it was not showing show all in our dataset. So, Now we can Convert this Data Table to DataFrame. For doing this task we have to store this Data into CSV Format and we can reupload it and store it in DataFrame. The code for the Same was stated below:-
# HTML Table to CSV Format Conversion For COVID Dataset
covid_data_file = 'covid_data.csv'
covid_data_tables[0].to_csv(covid_data_file, sep = ',')
# Read CSV Data From Data Table for Further Analysis
covid_data = pd.read_csv("covid_data.csv")
So, after Storing all the Data into csv Format let's convert data into DataFrame Format and Print Whole data:-
# Store 'CSV' Data into 'DataFrame' Format
vaccineDF = pd.DataFrame(covid_data)
vaccineDF = vaccineDF.drop(columns=["Unnamed: 0"], axis = 1) # 'drop' Unneccesary Columns from the Dataset
# Print Whole Dataset
vaccineDF
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Doses administered.1 Pct. of population Pct. of population.1
0 Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
1 World 11 877933955 – –
2 Israel 116 10307583 60% 56%
3 Seychelles 116 112194 68% 47%
4 U.A.E. 99 9489684 – –
... ... ... ... ... ...
154 Syria <0.1 2500 <0.1% –
155 Papua New Guinea <0.1 1081 <0.1% –
156 South Sudan <0.1 947 <0.1% –
157 Cameroon <0.1 400 <0.1% –
158 Zambia <0.1 106 <0.1% –
159 rows × 5 columns
From above Output we can see that we have successfully fetched whole data table. Hope this Solution will help you.
OWID provides this data, which effectively comes from JHU
if you want latest vaccination data by country, it's simple to use CSV interface
import requests, io
dfraw = pd.read_csv(io.StringIO(requests.get("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv").text))
dfraw["date"] = pd.to_datetime(dfraw["date"])
dfraw.sort_values(["iso_code","date"]).groupby("iso_code", as_index=False).last()

Obtaining \r\n\r\n while scraping from web in Python

I am workin on scraping text using Python from the link; tournament link
Here is my code to get the tabular data;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr') ## find the table rows
Now, the goal is to obtain the data as a dataframe.
listnew=[]
for row in rows:
row_td = row.find_all('td')
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text() ##obtain text part
listnew.append(cleantext) ## append to list
df = pd.DataFrame(listnew)
df.head(10)
Then we get following output;
0 []
1 [Finishers:, 577]
2 [Male:, 414]
3 [Female:, 163]
4 []
5 [1, 814, \r\n\r\n JARED WIL...
6 [2, 573, \r\n\r\n NATHAN A ...
7 [3, 687, \r\n\r\n FRANCISCO...
8 [4, 623, \r\n\r\n PAUL MORR...
9 [5, 569, \r\n\r\n DEREK G O..
I don't know why there is a new line character and carriage return character; \r\n\r\n? how can I remove them and get a dataframe in the proper format? Thanks in advance.
Pandas can parse HTML tables, give this a try:
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_1_html = soup.find('table', attrs={'id': 'individualResults'})
t_1 = pd.read_html(table_1_html.prettify())[0]
print(t_1)
Output:
Place Bib Name ... Chip Pace Gun Time Team
0 1 814 JARED WILSON ... 5:51 36:24 NaN
1 2 573 NATHAN A SUSTERSIC ... 5:55 36:45 INTEL TEAM F
2 3 687 FRANCISCO MAYA ... 6:05 37:48 NaN
3 4 623 PAUL MORROW ... 6:13 38:37 NaN
4 5 569 DEREK G OSBORNE ... 6:20 39:24 INTEL TEAM F
.. ... ... ... ... ... ... ...
572 573 273 RACHEL L VANEY ... 15:51 1:38:34 NaN
573 574 467 ROHIT B DSOUZA ... 15:53 1:40:32 INTEL TEAM I
574 575 471 CENITA D'SOUZA ... 15:53 1:40:34 NaN
575 576 338 PRANAVI APPANA ... 16:15 1:42:01 NaN
576 577 443 LIBBY B MITCHELL ... 16:20 1:42:10 NaN
[577 rows x 10 columns]
Seems like some cells in the HTML code has a lot of leading and trailing spaces and new lines:
<td>
JARED WILSON
</td>
Use str.strip to remove all leading and trailing whitespace, like this:
BeautifulSoup(str_cells, "lxml").get_text().strip().
Well looking at the url you provided, you can see the new lines in the :
...
<td>814</td>
<td>
JARED WILSON
</td>
...
so that's what you get when you scrape. These can easily be removed by the very convenient .strip() string method.
Your DataFrame is not formatted correctly because you are giving it a list of lists, which are not all of the same size (see the first 4 lines), which come from another table located on the top right. One easy fix is to remove the first 4 lines, though it would be way more robust to select the table you want based on its id ("individualResults").
df = pd.DataFrame(listnew[4:])
df.head(10)
Have a look here: BeautifulSoup table to dataframe

Categories