I am scrapping England's Joint Data and have the results in the correct format I want when I do one hospital at a time. I eventually want to iterate over all hospitals but first decided to make an array of three different hospitals and figure out the iteration.
The code below gives me the correct format of the final results in a pandas DataFrame when I have just one hospital:
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
r=requests.get("http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?
hospitalName=Norfolk%20and%20Norwich%20Hospital")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text.replace(" ",""))
i=i+1
table = np.array(temp).reshape(12,6)
final = pandas.DataFrame(table)
final
In my iterated version, I cannot figure out a way to append each result set into a final DataFrame:
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text)
i=i+1
table = np.array(temp).reshape((int(len(temp)/6)),6)
temp2.append(table)
#df_final = pandas.DataFrame(df)
At the end, the 'table' has all the data I want but its not easy to manipulate so I want to put it in a DataFrame. However, I am getting an "ValueError: Must pass 2-d input" error.
I think this error is saying that I have 3 arrays which would make it 3 dimensional. This is just a practice iteration, there are over 400 hospitals whose data I plan to put into a dataframe but I am stuck here now.
The simple answer to your question would be HERE.
The tough part was taking your code and finding what was not right yet.
Using your full code, I modified it as shown below. Please copy and diff with yours.
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text)
i=i+1
table = np.array(temp).reshape((int(len(temp)/6)),6)
for array in table:
newArray = []
for x in array:
try:
x = x.encode("ascii")
except:
x = 'cannot convert'
newArray.append(x)
temp2.append(newArray)
df_final = pandas.DataFrame(temp2, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print df_final
I tried to use a list comprehension for the ascii conversion, which was absolutely necessary for the strings to show up in the dataframe, but the comprehension was throwing an error, so I built in an exception, and the exception never shows.
I reorganized the code a little and was able to create the dataframe without having to encode.
Solution:
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp = []
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text.replace("-","NaN").replace("+",""))
i=i+1
temp2.append(temp)
table = np.array(temp2).reshape((int(len(temp2[0])/6)),6)
df_final = pandas.DataFrame(table, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
df_final
Related
I recently pulled data from youtube API, and I'm trying to create a data frame using that information.
When I loop through each item with the "print" function, I get 25 rows output for each variable (which is what I want in the data frame I create).
How can I create a new data frame that contains 25 rows using this information instead of just 1 line in the data frame?
When I loop through each item like this:
df = pd.DataFrame(columns = ['video_title', 'video_id', 'date_created'])
#For Loop to Create columns for DataFrame
x=0
while x < len(response['items']):
video_title= response['items'][x]['snippet']['title']
video_id= response['items'][x]['id']['videoId']
date_created= response['items'][x]['snippet']['publishedAt']
x=x+1
#print(video_title, video_id)
df = df.append({'video_title': video_title,'video_id': video_id,
'date_created': date_created}, ignore_index=True)
=========ANSWER UPDATE==========
THANK YOU TO EVERYONE THAT GAVE INPUT !!!
The code that created the Dataframe was:
import pandas as pd
x=0
video_title = []
video_id = []
date_created = []
while x < len(response['items']):
video_title.append (response['items'][x]['snippet']
['title'])
video_id.append (response['items'][x]['id']['videoId'])
date_created.append (response['items'][x]['snippet'].
['publishedAt'])
x=x+1
#print(video_title, video_id)
df = pd.DataFrame({'video_title': video_title,'video_id':
video_id, 'date_created': date_created})
Based on what I know about youtube APIs return objects, the values of 'title' , 'videoId' and 'publishedAt' are strings.
A strategy of making a single df from these strings are:
Store these strings in a list. So you will have three lists.
Convert the lists into a df
You will get a df with x rows, based on x values that are retrieved.
Example:
import pandas as pd
x=0
video_title = []
video_id = []
date_created = []
while x < len(response['items']):
video_title.append (response['items'][x]['snippet']['title'])
video_id.append (response['items'][x]['id']['videoId'])
date_created.append (response['items'][x]['snippet']['publishedAt'])
x=x+1
#print(video_title, video_id)
df = pd.DataFrame({'video_title': video_title,'video_id':
video_id, 'date_created': date_created})
I am trying to iterate over a Pandas data frame with close to a million entries. I am using a for loop to iterate over them. Consider the following code as an example
import pandas as pd
import os
from requests_html import HTMLSession
from tqdm import tqdm
import time
df = pd.read_csv(os.getcwd()+'/test-urls.csv')
df = df.drop('Unnamed: 0', axis=1 )
new_df = pd.DataFrame(columns = ['pid', 'orig_url', 'hosted_url'])
refused_df = pd.DataFrame(columns = ['pid', 'refused_url'])
tic = time.time()
for idx, row in df.iterrows():
img_id = row['pid']
url = row['image_url']
#Let's do scrapping
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=1, keep_page=True, scrolldown=1)
count = 0
link_vals = r.html.find('.zoomable')
if len(link_vals) != 0 :
attrs = link_vals[0].attrs
# print(attrs['src'])
embed_link = attrs['src']
else:
while count <=7:
link_vals = r.html.find('.zoomable')
count += 1
else:
print('Link refused connection for 7 tries. Adding URL to Refused URLs Data Frame')
ref_val = [img_id,URL]
len_ref = len(refused_df)
refused_df.loc[len_ref] = ref_val
print('Refused URL added')
continue
print('Got 1 link')
#Append scraped data to new_df
len_df = len(new_df)
append_value = [img_id,url, embed_link]
new_df.loc[len_df] = append_value
I wanted to know how could I use a progress bar to add a visual representation of how far along I am. I will appreciate any help. Please let me know if you need any clarification.
You could try out TQDM
from tqdm import tqdm
for idx, row in tqdm(df.iterrows()):
do something
This is primarily for a command-line progress bar. There are other solutions if you're looking for more of a GUI. PySimpleGUI comes to mind, but is definitely a little more complicated.
Would comment, but the reason you might want a progress bar is because it is taking a long time because iterrows() is a slow way to do operations in pandas.
I would suggest you use apply/ avoid using iterrows().
If you want to continue using iterrows just include a counter that counts up to the number of rows, df.shape[0]
PySimpleGUI makes this about as simple of a problem to solve as possible, assuming you know ahead of time time how items you have in your list. Indeterminate progress meters are possible, but a little more complicated.
There is no setup required before your loop. You don't need to make a special iterator. The only need you have to do is add 1 line of code inside your loop.
Inside your loop add a call to - one_line_progress_meter. The name sums up what it is. Add this call to the top of your loop, the bottom, it doesn't matter... just add it somewhere that's looped.
There 4 parameters you pass are:
A title to put on the meter (any string will do)
Where you are now - current counter
What the max counter value is
A "key" - a unique string, number, anything you want.
Here's a loop that iterates through a list of integers to demonstrate.
import PySimpleGUI as sg
items = list(range(1000))
total_items = len(items)
for index, item in enumerate(items):
sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )
The list iteration code will be whatever your loop code is. The line of code to focus on that you'll be adding is this one:
sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )
This line of code will show you the window below. It contains statistical information like how long you've been running the loop and an estimation on how much longer you have to go.
How to do that in pandas apply?
I do this
def some_func(a,b):
global index
some function involve a and b
index+=1
sg.one_line_progress_meter('My meter', index, len(df), 'my meter' )
return c
index=0
df['c'] = df[['a','b']].apply(lambda : some_func(*x),axis=1)
I'm having issues with extracting table from this page, and I really need this data for my paper. I came up with this code, but it got stuck on second row.
browser.get('https://www.eex.com/en/market-data/power/futures/french-futures#!/2018/02/01')
table = browser.find_element_by_xpath('//*[#id="content"]/div/div/div/div[1]/div/div/div')
html_table = html.fromstring(table.get_attribute('innerHTML'))
html_code = etree.tostring(html_table)
df = pd.read_html(html_code)[0]
df.drop(['Unnamed: 12', 'Unnamed: 13'], axis=1, inplace=True)
Any advice?
You can always parse the table manually.
I prefer to use BeautifulSoup since I find it much easier to work with.
from bs4 import BeautifulSoup
soup = BeautifulSoup(browser.page_source, "html.parser")
Let's parse the first table, and get the column names:
table = soup.select("table.table-horizontal")[0]
columns = [i.get_text() for i in table.find_all("th")][:-2] ## We don't want the last 2 columns
Now, let's go through the table row by row:
rs = []
for r in table.find_all("tr"):
ds = []
for d in r.find_all("td"):
ds.append(d.get_text().strip())
rs.append(ds[:-2])
You can write the same code more concisely using list comprehensions:
rs = [[d.get_text().strip() for d in r.find_all("td")][:-2] for r in table.find_all("tr")]
Next, we filter rs to remove lists with length != 12 (since we have 12 columns):
rs = [i for i in rs if len(i)==12]
Finally, we can put this into a DataFrame:
df = pd.DataFrame({k:v for k, v in zip(columns, zip(*rs))})
You can follow a similar procedure for the second table. Hope this helps!
I am new to Beautiful Soup and nested table and therefore I try to get some experience scraping a wikipedia table.
I have searched for any good example on the web but unfortunately I have not found anything.
My goal is to parse via pandas the table "States of the United States of America" on this web page. As you can see from my code below I have the following issues:
1) I can not extract all the columns. Apparently my code does not allow to import all the columns properly in a pandas DataFrame and writes the entries of the third column of the html table below the first column.
2) I do not know how to deal with colspan="2" which appears in some lines of the table. In my pandas DataFrame I would like to have the same entry when capital and largest city are the same.
Here is my code. Note that I got stuck trying to overcome my first issue.
Code:
from urllib.request import urlopen
import pandas as pd
wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
right_table=soup.find_all('table')[0] # First table
rows = right_table.find_all('tr')[2:]
A=[]
B=[]
C=[]
D=[]
F=[]
for row in rows:
cells = row.findAll('td')
# print(len(cells))
if len(cells)>=11: #Only extract table body not heading
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
D.append(cells[3].find(text=True))
F.append(cells[4].find(text=True))
df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D
df['Population']=F
df
print(df)
Do you have any suggestings?
Any help to understand better BeautifulSoup would be appreciated.
Thanks in advance.
Here's the strategy I would use.
I notice that each line in the table is complete but, as you say, some lines have two cities in the 'Cities' column and some have only one. This means that we can use the numbers of items in a line to determine whether we need to 'double' the city name mentioned in that line or not.
I begin the way you did.
>>> import requests
>>> import bs4
>>> page = requests.get('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> right_table=soup.find_all('table')[0]
Then I find all of the rows in the table and verify that it's at least approximately correct.
>>> trs = right_table('tr')
>>> len(trs)
52
I poke around until I find the lines for Alabama and Wyoming, the first and last rows, and display their texts. They're example of the two types of rows!
>>> trs[2].text
'\n\xa0Alabama\nAL\nMontgomery\nBirmingham\n\nDec 14, 1819\n\n\n4,863,300\n\n52,420\n135,767\n50,645\n131,171\n1,775\n4,597\n\n7\n\n'
>>> trs[51].text
'\n\xa0Wyoming\nWY\nCheyenne\n\nJul 10, 1890\n\n\n585,501\n\n97,813\n253,335\n97,093\n251,470\n720\n1,864\n\n1\n\n'
I notice that I can split these strings on \n and \xa0. This can be done with a regex.
>>> [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
['Wyoming', 'WY', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']
>>> [_ for _ in re.split(r'[\n\xa0]', trs[2].text) if _]
['Alabama', 'AL', 'Montgomery', 'Birmingham', 'Dec 14, 1819', '4,863,300', '52,420', '135,767', '50,645', '131,171', '1,775', '4,597', '7']
The if _ conditional in these list comprehensions is to discard empty strings.
The Wyoming string has a length of 12, Alabama's is 13. I would leave Alabama's string as it is for pandas. I would extend Wyoming's (and all the others of length 12) using:
>>> row = [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
>>> row[:3]+row[2:]
['Wyoming', 'WY', 'Cheyenne', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']
The solution below should fix both issues you have mentioned.
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States?action=render'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
right_table=soup.find_all('table')[0] # First table
rows = right_table.find_all('tr')[2:]
A=[]
B=[]
C=[]
D=[]
F=[]
for row in rows:
cells = row.findAll('td')
combine_cells = cells[1].get('colspan') # Tells us whether columns for Capital and Established are the same
cells = [cell.text.strip() for cell in cells] # Extracts text and removes whitespace for each cell
index = 0 # allows us to modify columns below
A.append(cells[index]) # State Code
B.append(cells[index + 1]) # Capital
if combine_cells: # Shift columns over by one if columns 2 and 3 are combined
index -= 1
C.append(cells[index + 2]) # Largest
D.append(cells[index + 3]) # Established
F.append(cells[index + 4]) # Population
df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D
df['Population']=F
df
print(df)
Edit: Here's a cleaner version of the above code
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
table_rows = soup.find('table')('tr')[2:] # Get all table rows
cells = [row('td') for row in table_rows] # Get all cells from rows
def get(cell): # Get stripped string from tag
return cell.text.strip()
def is_span(cell): # Check if cell has the 'colspan' attribute <td colspan="2"></td>
return cell.get('colspan')
df = pd.DataFrame()
df['State'] = [get(cell[0]) for cell in cells]
df['Capital'] = [get(cell[1]) for cell in cells]
df['Largest'] = [get(cell[2]) if not is_span(cell[1]) else get(cell[1]) for cell in cells]
df['Statehood'] = [get(cell[3]) if not is_span(cell[1]) else get(cell[2]) for cell in cells]
df['Population'] = [get(cell[4]) if not is_span(cell[1]) else get(cell[3]) for cell in cells]
print(df)
I am scraping some data from a website. I have names, and prices lists from different pages. I want to stack two lists first, then an array contains all data in rows for each page. However, np.insert gives the following error:
'ValueError: could not convert string to float'.
Here is the code
import numpy as np
finallist = []
...
for c in range(0, 10)
narr = np.array(names)
parr = np.array(prices)
nparr = np.array(np.column_stack((narr, parr)))
finallist = np.insert(finallist, c, nparr)
What I want to accomplish is the following.
finallist = [[[name1, price1], [name2, price3], ...] # from page one
[name1, price1], [name2, price3], ...] # from page two
... ] # from page x
So that I will be able reach all data.