BeautifulSoup: Repated columns

BeautifulSoup: Repated columns - python

I am trying to scrape data from a webpage that contains a table to then put it in a pandas data frame.
I beleive to have done everything correctly but i get repeated columns...
Here is my code:
html_content = requests.get('http://timetables.itsligo.ie:81/reporting/individual;student+set;id;SG_KAPPL_H08%2FF%2FY1%2F1%2F%28A%29%0D%0A?t=student+set+individual&days=1-5&weeks=29&periods=3-20&template=student+set+individual').text
soup = BeautifulSoup(html_content,'html.parser')
all_tables = soup.find_all('table')
wanted_table = all_tables[6]
first_tr = wanted_table.find('tr')
following_tr = first_tr.find_next_siblings()
details = []
for tr in following_tr:
prepare = []
for td in tr.find_all('td'):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
pd.set_option('display.max_columns', None)
display(df)
Which works great but as you can see in the bellow picture(column1 and 2 in row 0) , im getting repeated td's and one always has \n repeated.
The thing i noticed is that the details list return its double for some reason,maybe there is a table nested in a table?
Im doing this in jupyter by the way.
Thank you in advance!

The reason your details list is nested is because you are constructing it that way; that is, if you append a list (prepare) to another list (details), you get a nested list. See here. And this is okay, since it is works well to be read into your DataFrame.
Still, you are correct that there is a nested table thing going on in the HTML. I won't try to format the HTML here, but each box in the schedule is a <td> within the overarching wanted_table. When there is a course in one of those cells, there is another <table> used to hold the details. So the class name, professor, etc. are more <td> elements within this nested <table>. So when finding all the cells (tr.find_all('td')), you encounter both the larger class box, as well as its nested elements. And when you get the .text on the outermost <td>, you also get the text from the innermost cells, hence the repetition.
I am not sure if this is the best way, but one option would be to prevent the search from entering the nested table, using the recursive parameter in find_all.
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
The above should prevent the repeated elements from appearing. However, there is still the problem of having many \n characters, as well as not including the fact that some cells span multiple columns. You can start to fix the first by including some strip-ing on the text. For the second, you can access the colspan attribute to pad the prepare list:
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
text = td.text.strip('\s\n')
prepare += [text] + [None] * (int(td.get('colspan', 0)) - 1)
details.append(prepare)
df = pd.DataFrame(details)
It's a little too unwieldy to post the output. And there is still formatting you will likely want to do, but that is getting beyond the scope of your original post. Hopefully something in here helps!

import pandas as pd
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url)
len(df_list)
Output:
32
after specifying na_values Below
pd.read_html(
url,
na_values=["Forbes: The World's Billionaires website"]
)[0].tail()

Related

What is the most efficient way to increase speed of nested loops in Python?

I understand that Vectorization or Parallel Programming is the way to go. But what if my program doesn't fit in those use case scenarios, like let's say NumPy doesn't work for a particular problem.
For demonstration purposes, I have wrote a simple program here:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def extract_data(location_name, date):
ex_data = []
url = f'http://www.example.com/search.php?locationid={location_name}&date={date}'
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
for tr in soup.find('table', class_='classTable').find_all('tr', attrs={'class': None}):
text = [td.text for td in tr.find_all('td')]
ex_data.append(text)
return ex_data
# list of all dates in a year
get_dates = pd.read_csv('dates.csv')
#list of a number of locations
location_list = pd.read_csv('locations.csv')
master_data = []
for indexLoc, rowLoc in location_list.iterrows():
data = []
for index, row in dates.iterrows():
_date = row['Date']
location = rowLoc['Location']
row = extract_data(location, _date)
data = data + row
master_data = master_data + data
master_df = pd.DataFrame(master_data)
print(master_df)
The program basically puts a list of dates and location in separate dataframes, then loops through each location and nested loops through each date to execute a function. The function makes a request to url taking the parameters and gets some information from a table using BeautifulSoup, which it returns back. Program then stores those return values in a list and loop continues.
Now, let's say there are 100 locations and 365 dates, the program will run through each location for 365 days which makes the loop execution : 100*365. This infinity number and on top of the temp storage required for the returned list from the function for each loop execution - is definitely not anywhere near the efficient solution.
Using NumPy may change the date into Datetime variable, rather than keeping it as String (at least, that's what happened in my case). Using Multiprocessing/Multithreading might break the sequence in which the final master_data should be displayed, if a request in the function took too long to fulfill. For instance, Feb 17,...,20 data will be added in the list before Feb 16, because requesting url for Feb 16 too long. I understand that it can be sorted later on, but what if the data is such that it can't sorted.
What would be a simple, light-weight solution for these nested loops, which would be the best way to get maximum speed efficiency for the program execution. I would also like to know why that would be the best option with some example, if you can provide.

How to store elements of a list of HTML tags fetched with BeautifulSoup within a dataframe separated in alphabetically columns with pandas?

I am completely new to Jupiter Notebook, Python, Webscraping and stuff. I looked and different answers but no one seems to has the same problem (and I am not good in adapting "a similar" approach, change it a bit so I can use it for my purpose).
I want to create a data grid with all existing HTML tags. As source I am using MDN docs. It works find to get all Tags with Beautiful Soup but I struggle to go any further with this data.
Here is the code from fetching the data with beautiful soup
from bs4 import BeautifulSoup
import requests
url = "https://developer.mozilla.org/en-US/docs/Web/HTML/Element"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
get_nav_tag = soup.find("nav", class_="sidebar-inner")
get_second_div = nav_tag.find_all("div")[2]
get_ol = get_second_div.find("ol")
get_li = get_second_div.find_all("li", class_="toggle")[3]
tag_list = get_li.find_all("code")
print("There are currently", len(tag_list), "tags.")
for tags in tag_list:
print(tags.text)
The list is already sorted.
Now I work with Pandas to create a dataframe
import pandas as pd
tag_data = []
for tag in tag_list:
tag_data.append({"Tags": tag.text})
df = pd.DataFrame(tag_data)
df
The output looks like
QUESTION
How do I create a dataframe where there are columns for each character and the elements are listed under each column?
Like:
A B C
1 <a> <b> <caption>
2 <abbr> <body> <code>
3 <article> .. ...
4 ... ... ...
How do I separate this list in more list corresponding to each elements first letter? I guess I will need it for further interactions as well, like creating graphs as such. E.g. to show in a bar chart, how many tags starting with "a", "b" etc exists.
Thank you!

The code below should do the work.
df['first_letter'] = df.Tags.str[1]
tag_matrix = pd.DataFrame()
for letter in df.first_letter.unique():
# Create a pandas series whose name matches the first letter of the tag and contains tags starting with the letter
matching_tags = pd.Series(df[df.first_letter==letter].reset_index(drop=True).Tags, name=letter)
# Append the series to the tag_matrix
tag_matrix = pd.concat([tag_matrix, matching_tags], axis=1)
tag_matrix
Here's a sample of the output:
Note that you might want to do some additional cleaning, such as dropping duplicate tags or converting to lower case.

You can use pivot and concat methods to achieve this
df["letter"] = df["Tags"].str[1].str.upper()
df = df.pivot(columns="letter", values="Tags")
df = pd.concat([df[c].dropna().reset_index(drop=True) for c in df.columns], axis=1)
This gives

How can I align columns if rows have different number of values?

I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.

make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:

I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).

how to return data from multiple pages from table in url using beautifulsoup

i am trying to retrieve the code as well as title but somehow i am not able to retrieve the website is
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27
here i have tried to get the value from the table
import requests
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?
CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(link, 'lxml')
print(soup.prettify())
all_table = soup.find_all('table')
print(all_table)
right_table = soup.find_all('table',
id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
tables = right_table.find_all('td')
print(tables)
the errors AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
i expect to save the code as well as title in a list and save it in dataframe later
is there any way to continue to next page without manually providing values like in search code like 51% as there as more than 20 pages inside 51%

From the documentation
AttributeError: 'ResultSet' object has no attribute 'foo' - This
usually happens because you expected find_all() to return a single tag
or string. But find_all() returns a list of tags and strings–a
ResultSet object. You need to iterate over the list and look at the
.foo of each one. Or, if you really only want one result, you need to
use find() instead of find_all()
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
unspsc_link = "https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27"
link = requests.get(unspsc_link).text
soup = BeautifulSoup(link, 'lxml')
right_table = soup.find('table', id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView")
df = pd.read_html(str(right_table))[0]
# Clean up the DataFrame
df = df[[0, 1]]
df.columns = df.iloc[0]
df = df[1:]
print(df)
Output:
0 Code Title
1 51180000 Hormones and hormone antagonists
2 51280000 Antibacterials
3 51290000 Antidepressants
4 51390000 Sympathomimetic or adrenergic drugs
5 51460000 Herbal drugs
...
Notes:
The row order may be a little different but the data seems to be the same.
You will have to remove the last one or two rows
from the DataFrame as they are not relevant.
This is the data from the first page only. Look into
selenium to get the data from all pages by clicking on the buttons [1] [2] .... You can also use requests to emulate the POST request but it is a bit difficult for this site (IMHO).

How to fix this “TypeError: sequence item 0: expected str instance, float found”

I am trying to combine the cell values (strings) in a dataframe column using groupby method, separating the cell values in the grouped cell using commas. I ran into the following error:
TypeError: sequence item 0: expected str instance, float found
The error occurs on the following line of code, see the code block for complete codes:
toronto_df['Neighbourhood'] = toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].agg(lambda x: ','.join(x))
It seems that in the groupby function, the index corresponding to each row in the un-grouped dataframe is automatically added to the string before it was joined. This causes the TypeError. However, I have no idea how to fix the issue. I browsed a lot of threads but didn't find a solution. I would appreciate any guidance or assistance!
# Import Necessary Libraries
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
# Use BeautifulSoup to scrap information in the table from the Wikipedia page, and set up the dataframe containing all the information in the table
wiki_html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_html, 'lxml')
# print(soup.prettify())
table = soup.find('table', class_='wikitable sortable')
table_columns = []
for th_txt in table.tbody.findAll('th'):
table_columns.append(th_txt.text.rstrip('\n'))
toronto_df = pd.DataFrame(columns=table_columns)
for row in table.tbody.findAll('tr')[1:]:
row_data = []
for td_txt in row.findAll('td'):
row_data.append(td_txt.text.rstrip('\n'))
toronto_df = toronto_df.append({table_columns[0]: row_data[0],
table_columns[1]: row_data[1],
table_columns[2]: row_data[2]}, ignore_index=True)
toronto_df.head()
# Remove cells with a borough that is Not assigned
toronto_df.replace('Not assigned',np.nan, inplace=True)
toronto_df = toronto_df[toronto_df['Borough'].notnull()]
toronto_df.reset_index(drop=True, inplace=True)
toronto_df.head()
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
toronto_df['Neighbourhood'] = toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].agg(lambda x: ','.join(x))
toronto_df.drop_duplicates(inplace=True)
toronto_df.head()
The expected result of the 'Neighbourhood' column should separate the cell values in the grouped cell using commas, showing something like this (I cannot post images yet, so I just provide the link):
https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1557273600000&hmac=936wN3okNJ1UTDA6rOpQqwELESvqgScu08_Spai0aQQ

As mentioned in the comments, the NaN is a float, so trying to do string operations on it doesn't work (and this is the reason for the error message)
Replace your last part of code with this:
The filling of the nan is done with boolean indexing according to the logic you specified in your comment
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
toronto_df.Neighbourhood = np.where(toronto_df.Neighbourhood.isnull(),toronto_df.Borough,toronto_df.Neighbourhood)
toronto_df['Neighbourhood'] = toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].agg(lambda x: ','.join(x))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup: Repated columns - python

import pandas as pd url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires' df_list = pd.read_html(url) len(df_list) Output: 32 after specifying na_values Below pd.read_html( url, na_values=["Forbes: The World's Billionaires website"] )[0].tail()

Related

What is the most efficient way to increase speed of nested loops in Python?

How to store elements of a list of HTML tags fetched with BeautifulSoup within a dataframe separated in alphabetically columns with pandas?

How can I align columns if rows have different number of values?

how to return data from multiple pages from table in url using beautifulsoup

How to fix this “TypeError: sequence item 0: expected str instance, float found”

Categories

Resources