I'm attempting to create a wordcloud based off the scraped text from a specific website. The issue I'm having is with the webscraping portion of this. I've attempted two different ways, and both attempts I get stuck on how to proceed further.
First method:
Scrape the data for each specific tag into its own data frame
main_content= soup.find("div", attrs= {"class" : "col-md-4"})
main_content2= soup.find("article", attrs= {"class" : "col-lg-7 mid_info"})
comp_service= soup.find("div", attrs= {"class" : "col-md-6 col-lg-4"})
Here I'm stuck on how to add the three dataframes together in order to create the word cloud. This works fine if I use only one of the DF's and add it into 'lists' but I'm unsure how to add the other two into a single DF to then run the rest of the code. The following is the rest of the code for the word cloud potion:
str = ""
for list in lists:
info= list.text
str+=info
mask = np.array(Image.open("Desktop/big.png"))
color= ImageColorGenerator(mask)
wordcloud = WordCloud(width=1200, height=1000,
max_words=400,mask=mask,
stopwords=STOPWORDS,
background_color="white",
random_state=42).generate(str)
plt.imshow(wordcloud.recolor(color_func=color),interpolation="bilinear")
plt.axis("off")
plt.show()
Attempt 2
I found a peice of code that will extract all of the data from specific tags and put it into text
i = 0
for lists in soup.find_all(['article','div']):
print (lists.text)
However, when I attempt to run the rest of the code,
mask = np.array(Image.open("Desktop/big.png"))
color= ImageColorGenerator(mask)
wordcloud = WordCloud(width=1200, height=1000,
max_words=400,mask=mask,
stopwords=STOPWORDS,
background_color="white",
random_state=42).generate(str)
plt.imshow(wordcloud.recolor(color_func=color),interpolation="bilinear")
plt.axis("off")
plt.show()
I get 'ValueError: We need at least 1 word to plot a word cloud, got 0.' after running the wordcloud DF code.
I'm essentially just trying to pull all of the data from a website, store that information into a text file, then transform that data into a word cloud.
Please let me know any suggestions or clarifications I can provide.
Thank you.
This ended up working for me
lists = soup.find_all(['article','div'])
str = ""
for list in lists:
info= list.text
str+=info
Related
I am completely new to Jupiter Notebook, Python, Webscraping and stuff. I looked and different answers but no one seems to has the same problem (and I am not good in adapting "a similar" approach, change it a bit so I can use it for my purpose).
I want to create a data grid with all existing HTML tags. As source I am using MDN docs. It works find to get all Tags with Beautiful Soup but I struggle to go any further with this data.
Here is the code from fetching the data with beautiful soup
from bs4 import BeautifulSoup
import requests
url = "https://developer.mozilla.org/en-US/docs/Web/HTML/Element"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
get_nav_tag = soup.find("nav", class_="sidebar-inner")
get_second_div = nav_tag.find_all("div")[2]
get_ol = get_second_div.find("ol")
get_li = get_second_div.find_all("li", class_="toggle")[3]
tag_list = get_li.find_all("code")
print("There are currently", len(tag_list), "tags.")
for tags in tag_list:
print(tags.text)
The list is already sorted.
Now I work with Pandas to create a dataframe
import pandas as pd
tag_data = []
for tag in tag_list:
tag_data.append({"Tags": tag.text})
df = pd.DataFrame(tag_data)
df
The output looks like
QUESTION
How do I create a dataframe where there are columns for each character and the elements are listed under each column?
Like:
A B C
1 <a> <b> <caption>
2 <abbr> <body> <code>
3 <article> .. ...
4 ... ... ...
How do I separate this list in more list corresponding to each elements first letter? I guess I will need it for further interactions as well, like creating graphs as such. E.g. to show in a bar chart, how many tags starting with "a", "b" etc exists.
Thank you!
The code below should do the work.
df['first_letter'] = df.Tags.str[1]
tag_matrix = pd.DataFrame()
for letter in df.first_letter.unique():
# Create a pandas series whose name matches the first letter of the tag and contains tags starting with the letter
matching_tags = pd.Series(df[df.first_letter==letter].reset_index(drop=True).Tags, name=letter)
# Append the series to the tag_matrix
tag_matrix = pd.concat([tag_matrix, matching_tags], axis=1)
tag_matrix
Here's a sample of the output:
Note that you might want to do some additional cleaning, such as dropping duplicate tags or converting to lower case.
You can use pivot and concat methods to achieve this
df["letter"] = df["Tags"].str[1].str.upper()
df = df.pivot(columns="letter", values="Tags")
df = pd.concat([df[c].dropna().reset_index(drop=True) for c in df.columns], axis=1)
This gives
I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).
I am trying to scrape data from a webpage that contains a table to then put it in a pandas data frame.
I beleive to have done everything correctly but i get repeated columns...
Here is my code:
html_content = requests.get('http://timetables.itsligo.ie:81/reporting/individual;student+set;id;SG_KAPPL_H08%2FF%2FY1%2F1%2F%28A%29%0D%0A?t=student+set+individual&days=1-5&weeks=29&periods=3-20&template=student+set+individual').text
soup = BeautifulSoup(html_content,'html.parser')
all_tables = soup.find_all('table')
wanted_table = all_tables[6]
first_tr = wanted_table.find('tr')
following_tr = first_tr.find_next_siblings()
details = []
for tr in following_tr:
prepare = []
for td in tr.find_all('td'):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
pd.set_option('display.max_columns', None)
display(df)
Which works great but as you can see in the bellow picture(column1 and 2 in row 0) , im getting repeated td's and one always has \n repeated.
The thing i noticed is that the details list return its double for some reason,maybe there is a table nested in a table?
Im doing this in jupyter by the way.
Thank you in advance!
The reason your details list is nested is because you are constructing it that way; that is, if you append a list (prepare) to another list (details), you get a nested list. See here. And this is okay, since it is works well to be read into your DataFrame.
Still, you are correct that there is a nested table thing going on in the HTML. I won't try to format the HTML here, but each box in the schedule is a <td> within the overarching wanted_table. When there is a course in one of those cells, there is another <table> used to hold the details. So the class name, professor, etc. are more <td> elements within this nested <table>. So when finding all the cells (tr.find_all('td')), you encounter both the larger class box, as well as its nested elements. And when you get the .text on the outermost <td>, you also get the text from the innermost cells, hence the repetition.
I am not sure if this is the best way, but one option would be to prevent the search from entering the nested table, using the recursive parameter in find_all.
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
prepare.append(td.text)
details.append(prepare)
df = pd.DataFrame(details)
The above should prevent the repeated elements from appearing. However, there is still the problem of having many \n characters, as well as not including the fact that some cells span multiple columns. You can start to fix the first by including some strip-ing on the text. For the second, you can access the colspan attribute to pad the prepare list:
# all your other code above
for tr in following_tr:
prepare = []
for td in tr.find_all('td', recursive=False):
text = td.text.strip('\s\n')
prepare += [text] + [None] * (int(td.get('colspan', 0)) - 1)
details.append(prepare)
df = pd.DataFrame(details)
It's a little too unwieldy to post the output. And there is still formatting you will likely want to do, but that is getting beyond the scope of your original post. Hopefully something in here helps!
import pandas as pd
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url)
len(df_list)
Output:
32
after specifying na_values Below
pd.read_html(
url,
na_values=["Forbes: The World's Billionaires website"]
)[0].tail()
I am currently working on a project, the idea is to extract tweets (with geo enabled) from a Hashtag and to print a map (with Folium). Inside the map, I should have markers according to the user locations and when I click at the marker I should have the text of the tweet. But currently I only have a map and the markers.
This is my code :
import pandas as pd
import folium, ast
locations = pd.read_csv('tweets.csv', usecols=[3]).dropna()
l_locations = []
for loc in locations.values:
l_locations.append(ast.literal_eval(loc[0])['coordinates'])
print_tweet_map = folium.Map(location=[48.864716, 2.349014], zoom_start=8, tiles='Mapbox Bright')
for geo in l_locations:
folium.Marker(location=geo).add_to(print_tweet_map)
print_tweet_map.save('index.html')
Can you guys help me to print the markers and the text details of the tweet ?
Thanks in advance.
PS : I have currently :
Some lines of the csv file :
created_at,user,text,geo
2017-09-30 15:28:56,yanceustie,"RT #ChelseaFC: .#AlvaroMorata and #marcosalonso03 have been checking out the pitch ahead of kick-off..., null
2017-09-30 15:48:18,trendinaliaVN,#CHEMCI just started trending with 17632 tweets. More trends at ... #trndnl,"{'type': 'Point', 'coordinates': [21.0285, 105.8048]}"
Examine read_csv closely and then use it to retrieve the tweet text as well. Reading folium documentation, the popup seems the most relevant place to put each tweet's text.
Also you iterate over the same things 2 times. You can reduce those into 1 iteration loop that places a tweet on the map. (Imagine the map as your empty list you were appending to). No need to be overly sequential.
import pandas as pd
import folium, ast
frame = pd.read_csv('tweets.csv', usecols=[2, 3], names=['text', 'location'], header=None).dropna()
print_tweet_map = folium.Map(location=[48.864716, 2.349014], zoom_start=8, tiles='Mapbox Bright')
for index, item in frame.iterrows():
loc = item.get('location')
text = item.get('text')
geo = ast.literal_eval(loc[0])['coordinates']
folium.Marker(location=geo, popup=text) \
.add_to(print_tweet_map)
print_tweet_map.save('index.html')
This should work, or very close to work, but I don't have a proper computer handy for testing.
I am trying to extract the data from a table from a webpage, but keep receiving the above error. I have looked at the examples on this site, as well as others, but none deal directly with my problem. Please see code below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table', class_='dataframe')
rows = table.find_all('tr')[2:]
data = {
'RK' : [],
'PLAYER' : [],
'TEAM' : [],
'GP' : [],
'G' : [],
'A' : [],
'PTS' : [],
'+/-' : [],
'PIM' : [],
'PTS/G' : [],
'SOG' : [],
'PCT' : [],
'GWG' : [],
'G1' : [],
'A1' : [],
'G2' : [],
'A2' : []
}
for row in rows:
cols = row.find_all('td')
data['RK'].append( cols[0].get_text() )
data['PLAYER'].append( cols[1].get_text() )
data['TEAM'].append( cols[2].get_text() )
data['GP'].append( cols[3].get_text() )
data['G'].append( cols[4].get_text() )
data['A'].append( cols[5].get_text() )
data['PTS'].append( cols[6].get_text() )
data['+/-'].append( cols[7].get_text() )
data['PIM'].append( cols[8].get_text() )
data['PTS/G'].append( cols[9].get_text() )
data['SOG'].append( cols[10].get_text() )
data['PCT'].append( cols[11].get_text() )
data['GWG'].append( cols[12].get_text() )
data['G1'].append( cols[13].get_text() )
data['A1'].append( cols[14].get_text() )
data['G2'].append( cols[15].get_text() )
data['A2'].append( cols[16].get_text() )
df = pd.DataFrame(data)
df.to_csv("NHL_Players_Stats.csv")
I have eradicated the error, by seeing that the error was referring to the table (i.e. the Resultset) not having the method find_all and got the code running by commenting out the following line:
#rows = table.find_all('tr')[2:]
and changing this:
for row in rows:
This, however, does not extracts any data from the webpage and simply creates a .csv file with column headers.
I have tried to extract some data directly into rows using soup.find_all, but get the following error;
data['GP'].append( cols[3].get_text() )
IndexError: list index out of range
which I have not been able to resolve.
Therefore, any help would be very much appreciated.
Also, out of curiosity, are there any ways to achieve the desired outcome using:
dataframe = pd.read_html('url')
because, I have tried this also, but keep keeping:
FeatureNotFound: Couldn't find a tree builder with the features you
requested: html5lib. Do you need to install a parser library?
Ideally this is the method that I would prefer, but can't find any examples online.
find_all returns a ResultSet, which is basically a list of elements. For this reason, it has no method find_all, as this is a method that belongs to an individual element.
If you only want one table, use find instead of find_all to look for it.
table = soup.find('table', class_='dataframe')
Then, getting its rows should work as you have already done:
rows = table.find_all('tr')[2:]
The second error you got is because, for some reason, one of the table's rows seems to have only 3 cells, thus your cols variable became a list with only indexes 0, 1 and 2. That's why cols[3] gives you an IndexError.
In terms of achieving the same outcome using:
dataframe = pd.read_html('url')
It is achieved using just that or similar:
dataframe = pd.read_html(url, header=1, index_col=None)
The reason why I was receiving errors previously is because I had not configured Spyder's iPython console's backend to 'automatic' in 'Preferences'.
I am still, however, trying to resolve this problem using BeautifulSoup. So any useful comments would be appreciated.