Trying to scrape .md file into a csv - python

I'm trying to scrape a git .md file. I have made a python scraper but I'm kind of stuck on how to actually get the data I want. The page has a long list of job listings. they are all in separate Li elements. I want to get the A elements. After the A elements, there is just plain text separated by | I want to scrape those as well. I really want this to end up as a CSV file with the A tag as a column, the location text before the | as a column, and the remaining description text as a column.
Here's my code:
from bs4 import BeautifulSoup
import requests
import json
def getLinkData(link):
return requests.get(link).content
content = getLinkData('https://github.com/poteto/hiring-without-whiteboards/blob/master/README.md')
soup = BeautifulSoup(content, 'html.parser')
ul = soup.find_all('ul')
li = soup.find_all("li")
data = []
for uls in ul:
rows = uls.find_all('a')
data.append(rows)
print(data)
When I run this I get the A tags, but obviously not the rest yet. There seem to be a few other ul elements that are included. I just want the one with all the job LIs but the LIs nor the UL have any ids or classes. Any suggestions on how to accomplish what I want? Maybe add Pandas into this(not sure how)
screenshot:
screenshot2:

import requests
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/poteto/hiring-without-whiteboards/master/README.md'
res = requests.get(url).text
jobs = res.split('## A - C\n\n')[1].split('\n\n## Also see')[0]
jobs = [j[3:] for j in jobs.split('\n') if j.startswith('- [')]
df = pd.DataFrame(columns=['Company', 'URL', 'Location', 'Info'])
for i, job in enumerate(jobs):
company, rest = job.split(']', 1)
url, rest = rest[1:].split(')', 1)
rest = rest.split(' | ')
if len(rest) == 3:
_, location, info = rest
else:
_, location = rest
info = np.NaN
df.loc[i, :] = (company, url, location, info)
df.to_csv('file.csv')
print(df.head())
prints
index
Company
URL
Location
Info
0
Able
https://able.co/careers
Lima, PE / Remote
Coding interview, Technical interview (Backlog Refinement + System Design), Leadership interview (Behavioural)
1
Abstract
https://angel.co/abstract/jobs
San Francisco, CA
NaN
2
Accenture
https://www.accenture.com/us-en/careers
San Francisco, CA / Los Angeles, CA / New York, NY / Kuala Lumpur, Malaysia
Technical phone discussion with architecture manager, followed by behavioral interview focusing on soft skills
3
Accredible
https://www.accredible.com/careers
Cambridge, UK / San Francisco, CA / Remote
Take home project, then a pair-programming and discussion onsite / Skype round.
4
Acko
https://acko.com
Mumbai, India
Phone interview, followed by a small take home problem. Finally a F2F or skype pair programming session

import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import chain
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = list(chain.from_iterable([[(
i['href'],
i.get_text(strip=True),
*(i.next_sibling[3:].split(' | ', 1) if i.next_sibling else ['']*2))
for i in x.select('a')] for x in soup.select(
'h2[dir=auto] + ul', limit=9)]))
df = pd.DataFrame(goal)
df.to_csv('data.csv', index=False)
main('https://github.com/poteto/hiring-without-whiteboards/blob/master/README.md')

Related

InvalidSchema: No connection adapters were found for "link"?

I have a dataset with multiple links and I'm trying to get the text of all the links using the code below, but I'm getting a error message "InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'".
Dataset:
links
'https://en.wikipedia.org/wiki/Wagner_Group'
'https://en.wikipedia.org/wiki/Vladimir_Putin'
'https://en.wikipedia.org/wiki/Islam_in_Russia'
The code I'm using to web-scrape is:
def get_data(url):
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
text = ""
for paragraph in soup.find_all('p'):
text += paragraph.text
return(text)
#works fine
url = 'https://en.wikipedia.org/wiki/M142_HIMARS'
get_data(url)
#Doesn't work
df['links'].apply(get_data)
Error: InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'"
Thank you in advance
#It works just fine when I apply it to a single url but it doens't work when I apply
it to a dataframe.
df['links'].apply(get_data) is not compatible with requests and bs4.
You can try one of the right ways as follows:
Example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
links =[
'https://en.wikipedia.org/wiki/Wagner_Group',
'https://en.wikipedia.org/wiki/Vladimir_Putin',
'https://en.wikipedia.org/wiki/Islam_in_Russia']
data = []
for url in links:
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
for pra in soup.select('div[class="mw-parser-output"] > table~p'):
paragraph = pra.get_text(strip=True)
data.append({
'paragraph':paragraph
})
#print(data)
df = pd.DataFrame(data)
print(df)
Output:
paragraph
0 TheWagner Group(Russian:Группа Вагнера,romaniz...
1 The group came to global prominence during the...
2 Because it often operates in support of Russia...
3 The Wagner Group first appeared in Ukraine in ...
4 The Wagner Group itself was first active in 20...
.. ...
440 A record 18,000 Russian Muslim pilgrims from a...
441 For centuries, theTatarsconstituted the only M...
442 A survey published in 2019 by thePew Research ...
443 Percentage of Muslims in Russia by region:
444 According to the 2010 Russian census, Moscow h...
[445 rows x 1 columns]

Why am I getting a block of unformatted text rather than indvidual columns?

I've been trying to format terminal output in line with this stack.
Yet for some reason, the only method that works is the use of columnar but that's limited to showing only 3 rows of text.
I've tried almost all of the methods and yet I almost aways get an output that looks like this:
[Payne, Roberts and Davis, Vasquez-Davidson, Jackson, Chambers and Levy, Savage-Bradley, Ramirez Inc, Rogers-Yates, Kramer-Klein, Meyers-Johnson, Hughes-Williams, Jones, Williams and Villa, Garcia PLC, Gregory and Sons, Clark, Garcia and Sosa, Bush PLC, Salazar-Meyers, Parker, Murphy and Brooks, Cruz-Brown, Macdonald-Ferguson, Williams, Peterson and Rojas, Smith and Sons, Moss, Duncan]
I've been trying to learn how to scrape a website and display the output in a readable format.
import requests
from bs4 import BeautifulSoup
from columnar import columnar
import numpy as np
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
job_elements = results.find_all("div", class_="card-content")
title = []
company = []
location = []
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
formatted_title_element = title_element.text.strip()
formatted_company_element = company_element.text.strip()
formatted_location_element = location_element.text.strip()
title.append(formatted_title_element)
company.append(formatted_company_element)
location.append(formatted_location_element)
data = []
data.append(title)
data.append(company)
data.append(location)
headers = ['Title', 'Company', 'Location']
table = columnar(data, headers, no_borders=True)
print(table)
The columar solution above is the only one from that stack that doesn't automatically format things in the example at the top of the question but again, it only outputs 3 lines. Columnar does have a variable head=x which is meant to show x number of rows, however when I use the switch I get the same output as in the example at the very top.

How do I get two DIV's text, so that it becomes a table using BeautifulSoup in Python?

How can I iterate through the links, then access their pages' specific divs' content and form like a table, using Python?
I've come this far (only), but the output is not right:
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
base_url = 'http://www.warrencountyschools.org'
url = 'https://www.warrencountyschools.org/district_staff.aspx?action=search&location=29&department=0'
response = http.request('GET', url)
soup = BeautifulSoup(response.data)
# the second tr in the table - index starts at 0
table = soup.find('table', {'class': 'content staff-table'})
rows = table.findAll('tr')
fieldContent = []
for tr in rows:
cols = tr.findAll('td')
if len(cols) >= 3:
link = cols[2].find('a').get('href')
abs_link = base_url+link
profileURL = abs_link
profilePagResp = http.request('GET', profileURL)
soup2 = BeautifulSoup(profilePagResp.data)
flDiv = soup2.findAll('div', {'class', 'field-label'})
fcDiv = soup2.find('div', {'class', 'field-content'})
for fl in flDiv:
fieldContent.append(fcDiv.text)
print(fieldContent)
The output now consists of each name repeated the number of times it's iterates, while it should be like this:
Name
Email
Website
Phone
Buildings
SomeName
email#
wwww.
78978978
SomeBuildin
#Antonio Santos, All profile data aren't in the same order. So you can grab data only as follows :
Script
from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'http://www.warrencountyschools.org'
url = 'https://www.warrencountyschools.org/district_staff.aspx?action=search&location=29&department=0'
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
# the second tr in the table - index starts at 0
table = soup.find('table', {'class': 'content staff-table'})
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if len(cols) >= 3:
link = cols[2].find('a').get('href')
abs_link = base_url+link
print(abs_link)
final_page = requests.get(abs_link)
soup2 = BeautifulSoup(final_page .text,'html.parser')
profile_data =[x.get_text(strip=True) for x in soup2.findAll("div","field-content")]
print(profile_data)
Output:
http://www.warrencountyschools.org/staff/13650
['Greg Blewett', 'Greg.blewett#warren.kyschools.us', 'Access Staff Website', '270-746-7205', 'Greg Blewett - Construction-Carpentry - Warren County Area Technology Center']http://www.warrencountyschools.org/staff/25689
['Adrian Boggess', 'Staff', 'adrian.boggess#warren.kyschools.us', 'Tike Barton - Computerized Manufacturing and Machining - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/2403
['Kim Coomer', 'Teacher', 'kim.coomer#warren.kyschools.us', '270-746-7205', 'Kim Coomer - Career Specialist - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/13651
['Rex Cundiff', 'Rex.cundiff#warren.kyschools.us', 'Access Staff Website', '270-746-7205', 'Rex Cundiff - Welding - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/13652
['Susan Devore', 'Susan.devore#warren.kyschools.us', 'Access Staff Website', '270-746-7205', 'Susan Devore - Information Technology - Warren County Area Technology Center']http://www.warrencountyschools.org/staff/13666
['Michael Emberton', 'michael.emberton#warren.kyschools.us', 'Access Staff Website', 'Micheal Emberton - Automotive - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/25684
['Jacob Hildebrant', 'Staff', 'jacob.hildebrant#warren.kyschools.us', 'Greg Blewett -
Construction-Carpentry - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/25346
['Jeton Hyseni', 'Staff', 'Jeton.Hyseni#warren.kyschools.us', 'Administrative Assistant - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/25041
['Jesse Muse', 'Staff', 'jesse.muse#warren.kyschools.us', 'Tike Barton - Computerized
Manufacturing and Machining - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/2560
['Chris Riggs', 'Staff', 'chris.riggs#warren.kyschools.us', '467-7500', 'Administrative Assistant - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/24757
['Allison Runner', 'Staff', 'allison.runner#warren.kyschools.us', 'Administrative Assistant - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/25881
['Jacob Thomas', 'Staff', 'jacob.thomas#warren.kyschools.us', 'Greg Blewett - Construction-Carpentry - Warren County Area Technology Center']
http://www.warrencountyschools.org/staff/25880
['Brooke Weakly', 'Staff', 'brooke.bruington#warren.kyschools.us', 'Administrative Assistant - Warren County Area Technology Center']
How do I get two DIV's text, so that it becomes a table ...
Your approach to collecting the data is already very good, but to form key value pairs, which can be transferred into a table via pandas, for example, we have to consider the following points:
Prepare the headers
Extracting the text we use a list comprehension and remove the colons via list slicing to get clean results.
keys = [x.text[:-1] for x in soup2.find_all('div', {'class', 'field-label'})]
Note: Since 2016 in BeautifulSoup the method findALL() was renamed to find_all() it would be better to use in new code the actually syntax.
Prepare the content and glue it togehter
Extract the contents in the same way as the headers and combine them into a dict via zip().
profile = dict(tuple(zip(keys,[x.get_text(strip=True) for x in soup2.find_all('div', {'class', 'field-content'})])))
Note This is the crucial point in order to be able to map the contents in the correct columns.
Adjsutments on website value
Since the url of the website is not human-readable text (won't get it with text or get_text() method), but is in the href of the <a>, we have to do a separate check and take the url if it exists.
if (website := soup.select_one('a:-soup-contains("Access Staff Website")')):
profile['Website'] = base_url+website['href']
else:
profile['Website'] = ''
Store profiles and create the table
Last but not least, we add the dict to our result list and can transfer it via pandas into a data frame. Using fillna() we can determine what should be in all empty cells and to_csv() saves the data frame as a csv file.
fieldContent.append(profile)
pd.DataFrame(fieldContent).fillna('')#.to_csv('profile.csv', index=False)
Note In pandas you can determine which columns should be in output, some sorting, manipulating data, ....
Example
The complete example uses css selectors instead of the find() / find_all() methods, cause in my opinion you can select more focused but result is the same.
from bs4 import BeautifulSoup
import pandas as pd
import urllib3
http = urllib3.PoolManager()
base_url = 'http://www.warrencountyschools.org'
url = 'https://www.warrencountyschools.org/district_staff.aspx?action=search&location=29&department=0'
response = http.request('GET', url)
soup = BeautifulSoup(response.data)
fieldContent = []
for a in soup.select('a:-soup-contains("[Profile]")'):
profilePagResp = http.request('GET', base_url+a['href'])
soup = BeautifulSoup(profilePagResp.data)
keys = [x.text[:-1] for x in soup.select('.field-label')]
profile = dict(tuple(zip(keys,[x.get_text(strip=True) for x in soup.select('.field-content')])))
if (website := soup.select_one('a:-soup-contains("Access Staff Website")')):
profile['Website'] = base_url+website['href']
else:
profile['Website'] = ''
fieldContent.append(profile)
pd.DataFrame(fieldContent).fillna('')[['Name','Email','Website','Phone','Buildings']]#.to_csv('profile.csv', index=False)
Output
Name
Email
Website
Phone
Buildings
Greg Blewett
Greg.blewett#warren.kyschools.us
http://www.warrencountyschools.org/olc/13650
270-746-7205
Greg Blewett - Construction-Carpentry - Warren County Area Technology Center
Adrian Boggess
adrian.boggess#warren.kyschools.us
Tike Barton - Computerized Manufacturing and Machining - Warren County Area Technology Center
Kim Coomer
kim.coomer#warren.kyschools.us
270-746-7205
Kim Coomer - Career Specialist - Warren County Area Technology Center
Rex Cundiff
Rex.cundiff#warren.kyschools.us
http://www.warrencountyschools.org/olc/13651
270-746-7205
Rex Cundiff - Welding - Warren County Area Technology Center
You could use an async library like trio as this is more I/0 bound as you will be awaiting responses for requests to individual staff pages. I have added a custom sort, based on last name, in attempt to recreate the original results order. For larger result sets this might not match perfectly in case of ties. You might then extend by adding in a first name sort. The additional sort column can be dropped.
There does seem to be a FIFO processing instruction within trio but I haven't explored that.
import pandas as pd
import httpx
import trio
from bs4 import BeautifulSoup
LINK = 'https://www.warrencountyschools.org/district_staff.aspx?action=search&location=29&department=0'
ALL_INFO = []
async def get_soup(content):
return BeautifulSoup(content, 'lxml')
async def get_staff_info(link, nurse):
async with httpx.AsyncClient(timeout=None) as client:
r = await client.get(link)
soup = await get_soup(r.text)
info_items = ['Name', 'Email', 'Website', 'Phone', 'Buildings']
staff_info = {}
for key in info_items:
try:
if key == 'Website':
value = 'https://www.warrencountyschools.org' + soup.select_one(
f'.field-label:-soup-contains("{key}:") + .field-content > a')['href']
else:
value = soup.select_one(
f'.field-label:-soup-contains("{key}:") + .field-content').text.strip()
except:
value = 'N/A'
finally:
staff_info[key.lower()] = value
ALL_INFO.append(staff_info)
async def get_links(LINK, nurse):
async with httpx.AsyncClient(timeout=None) as client:
r = await client.get(LINK)
soup = await get_soup(r.text)
for x in soup.select('#ctl00_ctl00_MasterContent_ContentColumnRight_ctl01_dg_staff .staff-profile-button > a'):
nurse.start_soon(
get_staff_info, 'https://www.warrencountyschools.org' + x['href'], nurse)
async def main():
async with trio.open_nursery() as nurse:
nurse.start_soon(get_links, LINK, nurse)
if __name__ == "__main__":
trio.run(main)
df = pd.DataFrame(ALL_INFO)
df['sort_value'] = [i.strip().split(' ')[-1] for i in df['name'].tolist()]
df.sort_values(by=['sort_value'], ascending=True, inplace=True)
#print(df)
df.to_csv('staff.csv',
encoding='utf-8-sig', index=False)
Output csv:

Avoid to copy some content while scraping through pages

I have some difficulties in saving the results that I am scraping.
Please refer to this code (this code was slightly changed for my specific case):
import bs4, requests
import pandas as pd
import re
import time
headline=[]
corpus=[]
dates=[]
tag=[]
start=1
url="https://www.imolaoggi.it/category/cron/"
while True:
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html')
headlines=soup.find_all('h3')
corpora=soup.find_all('p')
dates=soup.find_all('time', attrs={'class':'entry-date published updated'})
tags=soup.find_all('span', attrs={'class':'cat-links'})
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text)
for d in date:
dates.append(d.text)
for c in tags:
tag.append(c.text)
if soup.find_all('a', attrs={'class':'page-numbers'}):
url = f"https://www.imolaoggi.it/category/cron/page/{page}"
page +=1
else:
break
Create dataframe
df = pd.DataFrame(list(zip(date, headline, tag, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
I would like to save all the pages from this link. The code works, but it seems that it writes everytime (i.e. every page) two identical sentences for the corpus:
I think this is happening because of the tag I chosen:
corpora=soup.find_all('p')
This causes a misalignment in rows in my dataframe, as data are saved in lists and corpus starts being correctly scraped later, if compared to others.
I hope you cab help to understand how to fix it.
You were close, but your selectors were off, and you mis-naned some of your variables.
I would use css selectors like this:
eadline=[]
corpus=[]
date_list=[]
tag_list=[]
headlines=soup.select('h3.entry-title')
corpora=soup.select('div.entry-meta + p')
dates=soup.select('div.entry-meta span.posted-on')
tags=soup.select('span.cat-links')
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text.strip())
for d in dates:
date_list.append(d.text)
for c in tags:
tag_list.append(c.text)
df = pd.DataFrame(list(zip(date_list, headline, tag_list, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
df
Output:
Date Headlines Tags Corpus
0 30 Ottobre 2020 Roma: con spranga di ferro danneggia 50 auto i... CRONACA, NEWS Notte di vandalismi a Colli Albani dove un uom...
1 30 Ottobre 2020\n30 Ottobre 2020 Aggressione con machete: grave un 28enne, arre... CRONACA, NEWS Roma - Ha impugnato il suo machete e lo ha agi...
2 30 Ottobre 2020\n30 Ottobre 2020 Deep State e globalismo, Mons. Viganò scrive a... CRONACA, NEWS LETTERA APERTA\r\nAL PRESIDENTE DEGLI STATI UN...
3 30 Ottobre 2020 Meluzzi e Scandurra: “Sacrificare libertà per ... CRONACA, NEWS "Sacrificare la libertà per la sicurezza è un ...
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get("https://www.imolaoggi.it/category/cron/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.time.text, x.h3.a.text, x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
for x in soup.select("div.entry-content")]
return goal
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 2937)]
allin = []
for f in fs:
allin.extend(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Tags", "Content"])
print(df)
df.to_csv("result.csv", index=False)

Web scraping with Python/BeautifulSoup: Site with multiple links to profiles > needing profile contents

For my Master Thesis I want to send a questionnaire to as many people as possible in the field (Early Childhood Education), so my goal is to scrape Emails from Dacare Centers (KiTa) from a public site. I am very new to Python, so while this seems trivial to most, it's proven to be quite a challenge for my level of knowledge. I'm also not familiar with the lingo, so I don't even know what I need to look for.
This is the site (German): https://www.kitanetz.de/
To get to the content I want, I have to first select a country ("Bundesland"), will be directed to the next level where I need to click "Kreise auflisten". Then I get to the next level, where all the small counties inside the Country are listed. Every link opens a next level of pages with postalcodes and profile links. Some of those profiles have Emails, some don't (found tutorials to let that be no problem).
It took me two days now to scrape postal codes and names of the centres from one of those pages. What do I need to do so Python is able to iterate through every country, every county and every profile to get to the links? If you know a ressource or a keyword I should look out for that'd be a great next step. I also haven't tried to put the data from this code in a dataframe using pandas yet, but my other attempts didn't work.
This is my attempt so far. I added ## to my comments/questions in the code. # are comments from the tutorial:
import requests
from bs4 import BeautifulSoup
## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup
# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")
# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html
# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do?
## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]
## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})
table_data = table.find_all("tr") ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags
for link in table.find_all("a"):
print("Name: {}".format(link.text))
print("href: {}".format(link.get("href")))
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
t_row = {}
# Each table row is stored in the form of
## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
You can use the site's sitemap.xml to get all links to profiles. When you have all links, then it's just simple parsing:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.kitanetz.de/sitemap.xml'
sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
if r.search(loc.text):
html_data = requests.get(loc.text).text
soup = BeautifulSoup(html_data, 'html.parser')
title = soup.h1.text
email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
if email:
email = email[1] + '#' + email[2] + '.' + email[3]
else:
email = '-'
print('{:<60} {:<35} {}'.format(title, email, loc.text))
Prints:
Evangelisch-lutherische Kindertagessstätte Lemförde kts.lemfoerde#evlka.de https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I kiga.stuhr#stuhr.de https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße) frankestr#kath-kita-wunstorf.de https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin ektketzin.wagenschuetz#arcor.de https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´ strolche#humanisten.de https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen kita.idensen#wunstorf.de https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´ nesthaekchen-isernhagen#gmx.de https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte venhof#t-online.de https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´ buddelkiste#uetze.de https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte m.herzog#lebenshilfe-dh.de https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe kita.luthe#drk-hannover.de https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh info#kindergarten-allerleirauh.de https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis johannis.bs.kita#lk-bs.de https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I kita.immensen#htp-tel.de https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club svms-mini-club#freenet.de https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal kiga-transvaal#awo-emden.de https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt kita.gartenstadt#braunschweig.de https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker tilleulenspiegel-bs#gmx.de https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun kts.versoehnung-garbsen#evlka.de https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz ratzenspatz#kila-ini.de https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld krippe-hw#stadthemmingen.de https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php
... and so on.

Categories