Is it possible to scrape data from this page? - python

I'm having issues with extracting table from this page, and I really need this data for my paper. I came up with this code, but it got stuck on second row.
browser.get('https://www.eex.com/en/market-data/power/futures/french-futures#!/2018/02/01')
table = browser.find_element_by_xpath('//*[#id="content"]/div/div/div/div[1]/div/div/div')
html_table = html.fromstring(table.get_attribute('innerHTML'))
html_code = etree.tostring(html_table)
df = pd.read_html(html_code)[0]
df.drop(['Unnamed: 12', 'Unnamed: 13'], axis=1, inplace=True)
Any advice?

You can always parse the table manually.
I prefer to use BeautifulSoup since I find it much easier to work with.
from bs4 import BeautifulSoup
soup = BeautifulSoup(browser.page_source, "html.parser")
Let's parse the first table, and get the column names:
table = soup.select("table.table-horizontal")[0]
columns = [i.get_text() for i in table.find_all("th")][:-2] ## We don't want the last 2 columns
Now, let's go through the table row by row:
rs = []
for r in table.find_all("tr"):
ds = []
for d in r.find_all("td"):
ds.append(d.get_text().strip())
rs.append(ds[:-2])
You can write the same code more concisely using list comprehensions:
rs = [[d.get_text().strip() for d in r.find_all("td")][:-2] for r in table.find_all("tr")]
Next, we filter rs to remove lists with length != 12 (since we have 12 columns):
rs = [i for i in rs if len(i)==12]
Finally, we can put this into a DataFrame:
df = pd.DataFrame({k:v for k, v in zip(columns, zip(*rs))})
You can follow a similar procedure for the second table. Hope this helps!

Related

Parsing a table from website (choosing correct HTML tag)

I need to make dataframe from the following page: http://pitzavod.ru/products/upakovka/
from bs4 import BeautifulSoup
import pandas as pd
import requests
kre = requests.get(f'http://pitzavod.ru/products/upakovka/')
soup = BeautifulSoup(kre.text, 'lxml')
table1 = soup.find('table', id="tab3")
I chose "tab3", as I find in the HTML text <div class="tab-pane fade" id="tab3". But the variable table1 gives no output. How can I get the table? Thank You.
NOTE: you can get the table as a DataFrame in one statement with .read_html, but the DataFrame returned by pd.read_html('http://pitzavod.ru/products/upakovka/')[0] will not retain line breaks.
.find('table', id="tab3") searches for table tags with id="tab3", and there are no such elements in that page's HTML.
There's a div with id="tab3" (as you've notice), but it does not contain any tables.
The only table on the page is contained in a div with id="tab4", so you might have used table1 = soup.find('div', id="tab4").table [although I prefer using .select with CSS selectors for targeting nested tags].
Suggested solution:
kre = requests.get('http://pitzavod.ru/products/upakovka/')
# print(kre.status_code, kre.reason, 'from', kre.url)
kre.raise_for_status()
soup = BeautifulSoup(kre.content, 'lxml')
# table = soup.select_one('div#tab4>div.table-responsive>table')
table = soup.find('table') # soup.select_one('table')
tData = [{
1 if 'center' in c.get('style', '') else ci: '\n'.join([
l.strip() for l in c.get_text('\n').splitlines() if l.strip()
]) for ci, c in enumerate(r.select('td'))
} for r in table.select('tr')]
df = pandas.DataFrame(tData)
## combine the top 2 rows to form header ##
df.columns = ['\n'.join([
f'{d}' for d in df[c][:2] if pandas.notna(d)
]) for c in df.columns]
df = df.drop([0,1], axis='rows').reset_index(drop=True)
# print(df.to_markdown(tablefmt="fancy_grid"))
(Normally, I would use this function if I wanted to specify the separator for tag-contents inside cells, but the middle cell in the 2nd header row would be shifted if I used .DataFrame(read_htmlTable(table, tSep='\n', asObj='dicts')) - the 1 if 'center' in c.get('style', '') else ci bit in the above code is for correcting that.)

How to store elements of a list of HTML tags fetched with BeautifulSoup within a dataframe separated in alphabetically columns with pandas?

I am completely new to Jupiter Notebook, Python, Webscraping and stuff. I looked and different answers but no one seems to has the same problem (and I am not good in adapting "a similar" approach, change it a bit so I can use it for my purpose).
I want to create a data grid with all existing HTML tags. As source I am using MDN docs. It works find to get all Tags with Beautiful Soup but I struggle to go any further with this data.
Here is the code from fetching the data with beautiful soup
from bs4 import BeautifulSoup
import requests
url = "https://developer.mozilla.org/en-US/docs/Web/HTML/Element"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
get_nav_tag = soup.find("nav", class_="sidebar-inner")
get_second_div = nav_tag.find_all("div")[2]
get_ol = get_second_div.find("ol")
get_li = get_second_div.find_all("li", class_="toggle")[3]
tag_list = get_li.find_all("code")
print("There are currently", len(tag_list), "tags.")
for tags in tag_list:
print(tags.text)
The list is already sorted.
Now I work with Pandas to create a dataframe
import pandas as pd
tag_data = []
for tag in tag_list:
tag_data.append({"Tags": tag.text})
df = pd.DataFrame(tag_data)
df
The output looks like
QUESTION
How do I create a dataframe where there are columns for each character and the elements are listed under each column?
Like:
A B C
1 <a> <b> <caption>
2 <abbr> <body> <code>
3 <article> .. ...
4 ... ... ...
How do I separate this list in more list corresponding to each elements first letter? I guess I will need it for further interactions as well, like creating graphs as such. E.g. to show in a bar chart, how many tags starting with "a", "b" etc exists.
Thank you!
The code below should do the work.
df['first_letter'] = df.Tags.str[1]
tag_matrix = pd.DataFrame()
for letter in df.first_letter.unique():
# Create a pandas series whose name matches the first letter of the tag and contains tags starting with the letter
matching_tags = pd.Series(df[df.first_letter==letter].reset_index(drop=True).Tags, name=letter)
# Append the series to the tag_matrix
tag_matrix = pd.concat([tag_matrix, matching_tags], axis=1)
tag_matrix
Here's a sample of the output:
Note that you might want to do some additional cleaning, such as dropping duplicate tags or converting to lower case.
You can use pivot and concat methods to achieve this
df["letter"] = df["Tags"].str[1].str.upper()
df = df.pivot(columns="letter", values="Tags")
df = pd.concat([df[c].dropna().reset_index(drop=True) for c in df.columns], axis=1)
This gives

Pandas: ValueError: arrays must all be same length - when orient=index doesn't work

grateful for your help.
I'm trying to return multiple search results from Google based on two or more search terms. Example inputs:
digital economy gov.uk
digital economy gouv.fr
For about 50% of the search results I input, the script below works fine. However, for the remaining search terms, I receive:
ValueError: arrays must all be same length
Any ideas on how I can address this?
output_df1=pd.DataFrame()
for input in inputs:
query = input
#query = urllib.parse.quote_plus(query)
number_result = 20
ua = UserAgent()
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")
result_div = soup.find_all('div', attrs = {'class': 'ZINbbc'})
links = []
titles = []
descriptions = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
link = r.find('a', href = True)
title = r.find('div', attrs={'class':'vvjwJb'}).get_text()
description = r.find('div', attrs={'class':'s3v9rd'}).get_text()
# Check to make sure everything is present before appending
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
# Next loop if one element is not present
except:
continue
to_remove = []
clean_links = []
for i, l in enumerate(links):
clean = re.search('\/url\?q\=(.*)\&sa',l)
# Anything that doesn't fit the above pattern will be removed
if clean is None:
to_remove.append(i)
continue
clean_links.append(clean.group(1))
output_dict = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame(output_dict, columns = output_dict.keys())
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Based on this answer: Python Pandas ValueError Arrays Must be All Same Length I have also tried to use orient=index. While this does not give me the array error, it only returns one response for each search result:
a = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame.from_dict(a, orient='index')
search_df = search_df.transpose()
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Edit: based on #Hammurabi's answer, I was able to at least pull 20 returns per input, but these appear to be duplicates. Any idea how I iterate the unique returns to each row?
df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']
for i in range(20):
df_this_row = pd.DataFrame([[input, titles, descriptions, clean_links]], columns=cols)
df = df.append(df_this_row)
df = df.reset_index(drop=True)
##merging the data frames
output_df1=pd.concat([output_df1,df])
Any thoughts on either how I can address the array error so it works for all search terms? Or how I make the orient='index' method work for multiple search results - in my script I am trying to pull 20 results per search term.
Thanks for your help!
You are having trouble with columns of different lengths, maybe because sometimes you get more or fewer than 20 results per term. You can put dataframes together even if they have different lengths. I think you want to append the dataframes, because you have different search terms so there is probably no merging to do to consolidate matching search terms. I don't think you want orient='index' because in the example you post, that puts lists into the df, rather than separating out the list items into different columns. Also, I don't think you want the built-in input as part of the df, looks like you want to repeat the query for each relevant row. Maybe something is going wrong in the dictionary creation.
You could consider appending 1 row at a time to your main dataframe, and skip the list and dictionary creation, after your line
if link != '' and title != '' and description != '':
Maybe simplifying the df creation will avoid the error. See this toy example:
df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']
query = 'search_term1'
for i in range(2):
link = 'url' + str(i)
title = 'title' + str(i)
description = 'des' + str(i)
df_this_row = pd.DataFrame([[query, title, description, link]], columns=cols)
df = df.append(df_this_row)
df = df.reset_index(drop=True) # originally, every row has index 0
print(df)
# Search_Term Title Description URL
# 0 search_term1 title0 des0 url0
# 1 search_term1 title1 des1 url1
Update: you mentioned that you are getting the same result 20 times. I suspect that is because you are only getting number_result = 20, and you probably want to iterate instead.
Your code fixes number_result at 20, then uses it in the url:
number_result = 20
# ...
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
Try iterating instead:
for number_result in range(1, 21): # if results start at 1 rather than 0
# ...
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Scenerio:
Parse the PDF Bank statement and transform into clean and formatted csv file.
What I've tried:
I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.
Code:
import camelot
import pandas as pd
tables = camelot.read_pdf('test.pdf', pages = '3')
for i, table in enumerate(tables):
print(f'table_id:{i}')
print(f'page:{table.page}')
print(f'coordinates:{table._bbox}')
tables = camelot.read_pdf('test.pdf', flavor='stream', pages = '3')
columns = df.iloc[0]
df.columns = columns
df = df.drop(0)
df.head()
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
df[c] = df[c].str.replace('-', '')
def convert_to_float(num):
try:
return float(num.replace(',',''))
except:
return 0
for col in ['Deposits', 'Withdrawals', 'Balance']:
df[col] = df[col].map(convert_to_float)
My_Result:
Desired_Output:
The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?
I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.
Using Transform -
df.loc[~df.Date.isna(), 'group'] = 1
g = df.group.fillna(0).cumsum()
df['Description'] = df.groupby(g)['Description'].transform(' '.join)
new_df = df.loc[~df['Date'].isna()]

Beautiful Soup Wikipidia nested tables

I am new to Beautiful Soup and nested table and therefore I try to get some experience scraping a wikipedia table.
I have searched for any good example on the web but unfortunately I have not found anything.
My goal is to parse via pandas the table "States of the United States of America" on this web page. As you can see from my code below I have the following issues:
1) I can not extract all the columns. Apparently my code does not allow to import all the columns properly in a pandas DataFrame and writes the entries of the third column of the html table below the first column.
2) I do not know how to deal with colspan="2" which appears in some lines of the table. In my pandas DataFrame I would like to have the same entry when capital and largest city are the same.
Here is my code. Note that I got stuck trying to overcome my first issue.
Code:
from urllib.request import urlopen
import pandas as pd
wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
right_table=soup.find_all('table')[0] # First table
rows = right_table.find_all('tr')[2:]
A=[]
B=[]
C=[]
D=[]
F=[]
for row in rows:
cells = row.findAll('td')
# print(len(cells))
if len(cells)>=11: #Only extract table body not heading
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
D.append(cells[3].find(text=True))
F.append(cells[4].find(text=True))
df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D
df['Population']=F
df
print(df)
Do you have any suggestings?
Any help to understand better BeautifulSoup would be appreciated.
Thanks in advance.
Here's the strategy I would use.
I notice that each line in the table is complete but, as you say, some lines have two cities in the 'Cities' column and some have only one. This means that we can use the numbers of items in a line to determine whether we need to 'double' the city name mentioned in that line or not.
I begin the way you did.
>>> import requests
>>> import bs4
>>> page = requests.get('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> right_table=soup.find_all('table')[0]
Then I find all of the rows in the table and verify that it's at least approximately correct.
>>> trs = right_table('tr')
>>> len(trs)
52
I poke around until I find the lines for Alabama and Wyoming, the first and last rows, and display their texts. They're example of the two types of rows!
>>> trs[2].text
'\n\xa0Alabama\nAL\nMontgomery\nBirmingham\n\nDec 14, 1819\n\n\n4,863,300\n\n52,420\n135,767\n50,645\n131,171\n1,775\n4,597\n\n7\n\n'
>>> trs[51].text
'\n\xa0Wyoming\nWY\nCheyenne\n\nJul 10, 1890\n\n\n585,501\n\n97,813\n253,335\n97,093\n251,470\n720\n1,864\n\n1\n\n'
I notice that I can split these strings on \n and \xa0. This can be done with a regex.
>>> [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
['Wyoming', 'WY', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']
>>> [_ for _ in re.split(r'[\n\xa0]', trs[2].text) if _]
['Alabama', 'AL', 'Montgomery', 'Birmingham', 'Dec 14, 1819', '4,863,300', '52,420', '135,767', '50,645', '131,171', '1,775', '4,597', '7']
The if _ conditional in these list comprehensions is to discard empty strings.
The Wyoming string has a length of 12, Alabama's is 13. I would leave Alabama's string as it is for pandas. I would extend Wyoming's (and all the others of length 12) using:
>>> row = [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
>>> row[:3]+row[2:]
['Wyoming', 'WY', 'Cheyenne', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']
The solution below should fix both issues you have mentioned.
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States?action=render'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
right_table=soup.find_all('table')[0] # First table
rows = right_table.find_all('tr')[2:]
A=[]
B=[]
C=[]
D=[]
F=[]
for row in rows:
cells = row.findAll('td')
combine_cells = cells[1].get('colspan') # Tells us whether columns for Capital and Established are the same
cells = [cell.text.strip() for cell in cells] # Extracts text and removes whitespace for each cell
index = 0 # allows us to modify columns below
A.append(cells[index]) # State Code
B.append(cells[index + 1]) # Capital
if combine_cells: # Shift columns over by one if columns 2 and 3 are combined
index -= 1
C.append(cells[index + 2]) # Largest
D.append(cells[index + 3]) # Established
F.append(cells[index + 4]) # Population
df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D
df['Population']=F
df
print(df)
Edit: Here's a cleaner version of the above code
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
table_rows = soup.find('table')('tr')[2:] # Get all table rows
cells = [row('td') for row in table_rows] # Get all cells from rows
def get(cell): # Get stripped string from tag
return cell.text.strip()
def is_span(cell): # Check if cell has the 'colspan' attribute <td colspan="2"></td>
return cell.get('colspan')
df = pd.DataFrame()
df['State'] = [get(cell[0]) for cell in cells]
df['Capital'] = [get(cell[1]) for cell in cells]
df['Largest'] = [get(cell[2]) if not is_span(cell[1]) else get(cell[1]) for cell in cells]
df['Statehood'] = [get(cell[3]) if not is_span(cell[1]) else get(cell[2]) for cell in cells]
df['Population'] = [get(cell[4]) if not is_span(cell[1]) else get(cell[3]) for cell in cells]
print(df)

Categories