I'm trying to scrape the positions, the artists and the songs from a ranking list on kworb. For example: https://kworb.net/spotify/country/us_weekly.html
I used the following script:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://kworb.net/spotify/country/us_weekly.html")
content = response.content
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.get_text())
And here is the output:
ITUNES
WORLDWIDE
ARTISTS
CHARTS
DON'T PRAY
RADIO
SPOTIFY
YOUTUBE
TRENDING
HOME
CountriesArtistsListenersCities
Spotify Weekly Chart - United States - 2023/02/09 | Totals
PosP+Artist and TitleWksPk(x?)StreamsStreams+Total
1
+1
SZA - Kill Bill
9
1(x5)
15,560,813
+247,052
148,792,089
2
-1
Miley Cyrus - Flowers
4
1(x3)
13,934,413
-4,506,662
75,009,251
3
+20
Morgan Wallen - Last Night
2
3(x1)
11,560,741
+6,984,649
16,136,833
...
How do I only get the positions, the artists and the songs separately and store it as an excel first?
expected output:
Pos Artist Songs
1 SZA Kill Bill
2 Miley Cyrus Flowers
3 Morgan Wallen Last Night
...
Best practice to scrape tables is using pandas.read_html() it uses BeautifulSoup under the hood for you.
import pandas as pd
#find table by id and select first index from list of dfs
df = pd.read_html('https://kworb.net/spotify/country/us_weekly.html', attrs={'id':'spotifyweekly'})[0]
#split the column by delimiter and creat your expected columns
df[['Artist','Song']]=df['Artist and Title'].str.split(' - ', n=1, expand=True)
#pick your columns and export to excel
df[['Pos','Artist','Song']].to_excel('yourfile.xlsx', index = False)
Alternative based on direct approach:
import requests
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/hk_weekly.html").content, 'html.parser')
data = []
for e in soup.select('#spotifyweekly tr:has(td)'):
data .append({
'Pos':e.td.text,
'Artist':e.a.text,
'Song':e.a.find_next_sibling('a').text
})
pd.DataFrame(data).to_excel('yourfile.xlsx', index = False)
Outputs
Pos
Artist
Song
1
SZA
Kill Bill
2
Miley Cyrus
Flowers
3
Morgan Wallen
Last Night
4
Metro Boomin
Creepin'
5
Lil Uzi Vert
Just Wanna Rock
6
Drake
Rich Flex
7
Metro Boomin
Superhero (Heroes & Villains) [with Future & Chris Brown]
8
Sam Smith
Unholy
...
I am attempting to scrape some data off of this website.
I would greatly appreciate any assistance with this.
There are 30 entries per page and I'm currently trying to scrape information from within each of the links on each page. Here is my code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
print(driver.title)
driver.get("https://www.businesslist.com.ng/category/farming")
time.sleep(1)
select = driver.find_element_by_id("listings")
page_entries = [i.find_element_by_tag_name("a").get_attribute("href")
for i in select.find_elements_by_tag_name("h4")]
columns = {"ESTABLISHMENT YEAR":[], "EMPLOYEES":[], "COMPANY MANAGER":[],
"VAT REGISTRATION":[], "REGISTRATION CODE":[]}
for i in page_entries:
print(i)
driver.get(i)
listify_subentries = [i.text.strip().replace("\n","") for i in
driver.find_elements_by_class_name("info")][:11]
Everything runs fine up to here.The problem is likely in the section below.
for i in listify_subentries:
for q in columns.keys():
if q in i:
item = i.replace(q,"")
print(item)
columns[q].append(item)
else:
columns[q].append("None given")
print("None given")
Here's a picture of the layout for one entry. Sorry I can't yet embed images.
I'm trying to scrape some of the information under the "Working Hours" box (i.e. establishment year, company manager etc) from every business's page. You can find the exact information under the columns variable.
Because the not all pages have the same amount of information under the "Working Hours" box (here is one with more details underneath it), I tried using dictionaries + text manipulation to look up the available sub-entries and obtain the relevant information to their right. That is to say, obtain the name of the company manager, the year of establishment and so on; and if a page did not have this, then it would simply be tagged as "None given" under the relevant sub-entry.
The idea is to collate all this information and export it to a dataframe later on. Inputting "None given" when a page is lacking a particular sub-entry allows me to preserve the integrity of the data structure so that the entries are sure to align.
However, when I run the code the output I receive is completely off.
Here is the outer view of the columns dictionary once the code has run.
And if I click on the 'COMPANY MANAGER' section, you can see that there are multiple instances of it saying "None given" before it gives the name of company manager on the page. This is repeated for every other sub-entry as you'll see if you run the code and scroll down. I'm not sure what went wrong, but it seems that the size of the list has been inflated by a factor of 10, with extra "None given"s littered here and there. The size of each list should be 30, but now its 330.
I would greatly appreciate any assistance with this. Thank you.
You can use next example how to iterate all enterprises on that page and save the various info into a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.businesslist.com.ng/category/farming"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for a in soup.select("h4 > a"):
u = "https://www.businesslist.com.ng" + a["href"]
print(u)
data = {"URL": u}
s = BeautifulSoup(requests.get(u).content, "html.parser")
for info in s.select("div.info:has(.label)"):
label = info.select_one(".label")
label.extract()
value = info.get_text(strip=True, separator=" ")
data[label.get_text(strip=True)] = value
all_data.append(data)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=None)
Prints:
URL Company name Address Phone Number Mobile phone Working hours Establishment year Employees Company manager Share this listing Location map Description Products & Services Listed in categories Keywords Website Registration code VAT registration E-mail Fax
0 https://www.businesslist.com.ng/company/198846/macmed-integrated-farms Macmed Integrated Farms 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria View Map 08033316905 09092245349 Monday: 8am-5pm Tuesday: 8am-5pm Wednesday: 8am-5pm Thursday: 8am-5pm Friday: 8am-5pm Saturday: 8am-5pm Sunday: 10am-4pm 2013 1-5 Engr. Marcus Awoh Show Map Expand Map 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria Get Directions Macmed Integrated Farms is into Poultry, Fish Farming (eggs, meat,day old chicks,fingerlings and grow-out) and animal Husbandry and sales of Farmlands land and facilities We also provide feasibility studies and business planning for all kind of businesses. Day old chicks WE are receiving large quantity of Day old Pullets, Broilers and cockerel in December 2016.\nInterested buyers are invited. PRICE: N 100 - N 350 Investors/ Partners We Macmed Integrated Farms a subsidiary of Macmed Cafe Limited RC (621444) are into poultry farming situated at Iponsinyi, behind (Nigerian National Petroleum Marketing Company)NNPMC at Mosimi, along... Commercial Hatchery Macmed Integrated Farms is setting up a Hatchery for chicken and other birds. We have 2 nos of fully automatic incubator imported from China with combined capacity of 1,500 eggs per setting.\nPlease book in advance.\nMarcus Awoh.\nfarm Operations Manager. PRICE: N100 - N250 Business Services Business Services / Consultants Business Services / Small Business Business Services / Small Business / Business Plans Business Services / Animal Shelters Manufacturing & Industry Manufacturing & Industry / Farming Manufacturing & Industry / Farming / Poultry Housing Suppliers Catfish Day old chicks Farming FINGERLINGS Fishery grow out and aquaculture Meat Poultry eggs spent Pol MORE +4 NaN NaN NaN NaN NaN
...
And saves data.csv (screenshot from LibreOffice):
I am a self-teaching data science student, currently doing my first big Python portfolio project in several steps, the first of which is using pandas to work with IMDb [Internet Movie Database]'s rather oddly structured .tsv files in an effort to create a fully searchable big data repository of all IMDb data (the officially supported searches and even APIs like OMDB (Open Movie Database) don't allow for the kinds of detailed queries I need to do for the larger project).
The structure of IMDb's public files is that they include all data on movies, TV shows, episodes, actors, directors, crew, the whole business, scattered rather haphazardly across seven massive tsv files. I've confirmed that pandas can, in fact, read in all of this data and that my computer's memory can handle it, but what I want to do is merge the seven tsv files into a single DataFrame object which can then be exported to (preferably) a SQL database or even a huge spreadsheet/another TSV file but larger.
Each thing in the database (movie, actor, individual TV episode) has a tconst row, which, in one file is identified as "titleId", a string. In every other file, this is identified as "tconst", also a string. I'm going to need to change titleId when I read that file into tconst; this is one of several challenges I haven't got to yet.
#set pandas formatting parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
This works, but I'm not sure exactly how to proceed next. I want to import each of the other tsv files to attempt to locally reconstruct the imdb database. This means that I don't want to have duplicate tconst strings, but rather to end up with new information about a tconst entry (like a film) appended to it as new columns.
Should I be looking to do a "for i in [new file]" type loop somehow? How would you go about this?
The IMdB files are actually highly structured. Looping is always a bad structure for merging data
structure data sourcing - I used wget rather than manually sourcing
the files are large so work with a subset for modelling purpose. I just have used popular movies and actors as driver
the CSV columns in the tsv files are actually sub-tables. Treat them as such. I build a reference entity dmi to do this
there are other associative relationships there as well primaryProfession, genres
finally join (merge) everything together from OMDB and IMdB. Taken first rows where many items associate to a title
I have left data currently as tsv clearly it would be very simple to put into a database using to_sql() method. Main point is sourcing and transformation. aka ETL which has become an unfashionable term. This can be further supplemented with web scraping. I looked at Box Office Mojo however this would require selenium to scrape it as it's dynamic HTML
IMdB sourcing
import requests, json, re, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
files = {}
for f in soup.find_all("a", href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB transform
set alldata=True on first run to prep data. Second run false and you have a manageable subset
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],
'averageRating': [9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 9.3, 8.7, 8.8, 8.9, 9.2],
'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False))
except KeyError as e:
print(k, dfs[k].columns, e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"], on="nconst").merge(dfs["title.basics"], on="tconst")
OMDB sourcing
omdbcols = ['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk], sep="\t", thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns}, errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day, so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/", params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk], pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk], index=False, sep="\t")
sample analysis
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"], on="tconst")
.merge(dfs["title.ratings"], on="tconst")
.merge(dfs["title.akas"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.crew"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.principals"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst", as_index=False).first(),
left_on="tconst", right_on="parentTconst", how="left", suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"], on="nconst")
.groupby(["tconst"], as_index=False).first(), on="tconst", suffixes=("","_name"))
).T
output
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action, Crime, Drama, Thriller Action, Adventure, Drama, Fantasy, Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay), Christopher Nolan (screenplay), Christopher Nolan (story), David S. Goyer (story), Bob Kane (characters) David Benioff, D.B. Weiss
OMDBActors Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington
OMDBLanguage English, Mandarin English
OMDBCountry USA, UK USA, UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw##._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc#._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database', 'Value': '9.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '84/100'}] [{'Source': 'Internet Movie Database', 'Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Crime,Drama Action,Adventure,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm1888967,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department
I am scraping data from the espn.com for the upcoming NFL schedule. However, I am only able to get the first line of table and the not the rest of the tables. I believe it is because of structure of the html and the each date has a different 'td'. I can get Thursday's game data but, not the rest
****Thursday, September 5****
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Green Bay
Chicago
8:20 PM NBC Tickets as low as $290 Soldier Field, Chicago
Sunday, September 8
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Tennessee
Cleveland
1:00 PM CBS Tickets as low as $121 FirstEnergy Stadium, Cleveland
Cincinnati
Seattle
4:05 PM CBS Tickets as low as $147 CenturyLink Field, Seattle
New York
Dallas
4:25 PM FOX Tickets as low as $50 AT&T Stadium, Arlington
Foxboro
Monday, September 9
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Houston
New Orleans
7:10 PM ESPN Tickets as low as $112 Mercedes-Benz Superdome, New Orleans
Denver
Oakland
10:20 PM ESPN Tickets as low as $72 Oakland Coliseum, Oakland
I have use beautifulsoup and was easily about to get the data, but parsing the data has been a challenged.
I have tried to just continuing using a for loop, but I can a stopiteration traceback. After reading the previous article about the traceback I realize that I need to try a different solution to the problem.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import pandas as pd
main_url = ['http://www.espn.com/nfl/schedule'][1]
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
rows = iter(rows)
df = [td.text for td in next(rows).find_all('td') if td.text]
df2 = [td.text for td in next(rows).find_all('td') if td.text]
[1]: https://www.espn.com/nfl/schedule
I believe that the problem lies in this line :
table = soup.find('table')
The thing is, the above mentioned page consists of 3 table elements that have the class = "schedule" attribute. However in your code, you used the find() function only, instead of find_all(). That's the major reason that you ended up with only the contents of the first table. So, I believe that if just handle that part correctly then you'll be good to go. Now, I'm not much familiar with the set notation used to fill up the lists, hence the code contains the good old for loop style.
#List to store the rows
df = []
#Collect all the tables
tables = soup.find_all('table', class_ = "schedule")
for table in tables:
rows = soup.find_all('tr')
#rows = iter(rows)
row_item = []
for row in rows:
#Collect all 'td' elements from the 'row' & append them to a list 'row_item'
data_items = row.find_all('td')
for data_item in data_items:
row_item.append(data_item.text)
#Append the list to the 'df'
df.append(row_item)
row_item = []
print(df)
If you're trying to pull <table> tags, you can use Pandas .read_html() to do that. It'll return a list of dataframes. In this case, you can append them all together into 1 table:
import pandas as pd
url = 'http://www.espn.com/nfl/schedule'
tables = pd.read_html(url)
df = pd.DataFrame()
for table in tables:
df = df.append(table)
Simple enough question, but I'm guessing the answer is "No":
I have a HTML table that I'm reading in with pandas.read_html just fine. But some cells (columns) have, say, images in them, or lists, or other formatting that obviously read_html throws away. I obviously don't expect pandas to parse any of that, but is there any way to get it to return the raw HTML as, say, a string in the cell of the DataFrame so I can parse it on my own?
EXAMPLE:
<table>
<th>Column 1</th>
<th>Column 2</th>
<tr>
<td>Cell1</td>
<td>Cell2 <img src="http://www.link.com/image.jpg /></td>
</tr>
<tr>
<td>Cell3</td>
<td>Cell4 <img src="http://www.website.com/picture.gif /></td>
</tr>
</table>
If Pandas was to parse this, I'd probably just get "Cell2" and "Cell4" from column2. What I'd like to do is somehow get the entire contents of the cell, including the [raw?] HTML contents of the <img> tag. I can then parse them on my own.
There are no options for the pd.read_html function that do what you want. So when I tried to get a list of business names from the California Department of State website, I get everything except the name when Pandas automatically parses the HTML:
>>> bizname = 'poss'
>>> url = f'https://businesssearch.sos.ca.gov/CBS/SearchResults?filing=&SearchType=CORP&SearchCriteria={bizname}&SearchSubType=Begins'
>>> df = pd.read_html(url)[0]
>>> df
Entity Number Registration Date Status Entity Name Jurisdiction Agent for Service of Process
0 C2645412 04/02/2004 ACTIVE View details for entity number 02645412 POSSU... GEORGIA ERESIDENTAGENT, INC. (C2702827)
1 C0786330 09/22/1976 DISSOLVED View details for entity number 00786330 POSSU... CALIFORNIA I. HALPERN
2 C2334141 03/01/2001 FTB SUSPENDED View details for entity number 02334141 POSSU... CALIFORNIA CLAIR G BURRILL
3 C0658630 11/08/1972 FTB SUSPENDED View details for entity number 00658630 POSSU... CALIFORNIA NaN
4 C1713121 09/23/1992 FTB SUSPENDED View details for entity number 01713121 POSSU... CALIFORNIA LAWRENCE J. TURNER
5 C1207820 08/05/1983 DISSOLVED View details for entity number 01207820 POSSU... CALIFORNIA R L CARL
6 C3921531 06/27/2016 ACTIVE View details for entity number 03921531 POSSU... CALIFORNIA REGISTERED AGENTS INC (C3365816)
The website hides business names behind a button.
But you can use requests to download the raw html.
Then you can use bs4 to extract the raw HTML table as well as any particular row (<tr>) or cell (<td>) that you want.
>>> soup = bs4.BeautifulSoup(requests.get(url).text)
>>> table = soup.find('table').findAll('tr')
>>> names = []
... for row in table:
... names.append(getattr(row.find('button'), 'contents', [''])[0].strip())
>>> names
['',
'POSSUM FILMS, INC',
'POSSUM INC.',
'POSSUM MEDIA, INC.',
'POSSUM POINT PRODUCTIONS, INC.',
'POSSUM PRODUCTIONS, INC.',
'POSSUM-BILITY EXPRESS, INCORPORATED',
]
>>> df['Entity Name'] = names[1:]
>>> df['Entity Name'] = names[1:]
>>> df
Entity Number Registration Date Status Entity Name Jurisdiction Agent for Service of Process
0 C2645412 04/02/2004 ACTIVE POSSUM FILMS, INC GEORGIA ERESIDENTAGENT, INC. (C2702827)
1 C0786330 09/22/1976 DISSOLVED POSSUM INC. CALIFORNIA I. HALPERN
2 C2334141 03/01/2001 FTB SUSPENDED POSSUM MEDIA, INC. CALIFORNIA CLAIR G BURRILL
3 C0658630 11/08/1972 FTB SUSPENDED POSSUM POINT PRODUCTIONS, INC. CALIFORNIA NaN
4 C1713121 09/23/1992 FTB SUSPENDED POSSUM PRODUCTIONS, INC. CALIFORNIA LAWRENCE J. TURNER
5 C1207820 08/05/1983 DISSOLVED POSSUM-BILITY EXPRESS, INCORPORATED CALIFORNIA R L CARL
6 C3921531 06/27/2016 ACTIVE POSSUMS WELCOME CALIFORNIA REGISTERED AGENTS INC (C33658
Doing it this way doesn't process the header correctly, so don't forget to ignore the first row, if you need to.