Python/Selenium - how to extract text from modal fade content? - python
I want to extract bios from a list of people from the website:
https://blueprint.connectiv.com/speakers/
I want to extract their title, company, and bio. However, bio is available only when you click each photo from the website.
Below is my coding to extract the title & company:
driver.find_element_by_xpath("//*[#id='speakers']/div/div/div/div/div/div/div").text.split('\n')
Can anyone help me extract the bios for each person? Any advice is appreciated!
You do not have to click the images as all the modals for each speaker are fully populated in the source. You can extract the content from these modals by using driver.execute_script:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://blueprint.connectiv.com/speakers/')
results = d.execute_script("""
var people = [];
for (var i of document.querySelectorAll('.modal.speakerCard')){
people.push({
name:i.querySelector('.description h4').textContent,
title:i.querySelector('p.title').textContent,
company:i.querySelector('p.company').textContent,
bio:i.querySelector('p.bio').textContent,
});
}
return people;
""")
Output (first 20 results):
[{'bio': 'Andrew is a recovering consultant turned serial entrepreneur, startup mentor and angel investor. He is the Managing Director of Dreamit Urbantech, investing in Proptech and Construction Tech. Andrew has written for Fortune, Forbes, Propmodo, CREtech, Builders Online, Architect Magazine, Multifamily Executive, AlleyWatch, Edsurge, The 74 Million, et. al. Andrew founded two companies and has a keen appreciation for how hard it is to build a successful startup, even under the best of circumstances.', 'company': 'Dreamit Ventures', 'name': 'Andrew Ackerman', 'title': 'Venture Partner'}, {'bio': 'Salman Ahmad is the CEO and co-founder of Mosaic, a construction technology company focused on making homebuilding scalable. By standardizing the process (homebuilding) and not the product (homes), Mosaic is delivering places people love and creating better communities. Salman holds a PhD in Electrical Engineering and Computer Science from MIT, focusing on Programming Language Design for Service-Oriented Systems, an MS in Computer Science from Stanford University focusing on Human Computer Interaction, and a BSE in Computer Systems Engineering from Arizona State University.\u2028 He also has 20 technical publications and patents in the areas of software systems, programming languages, machine learning, human-computer interaction, and sensor hardware. With a passion for construction, software, and computer science, Salman co-founded Mosaic to build places people love and make them widely available. ', 'company': 'Mosaic', 'name': 'Salman Ahmad', 'title': 'CEO and Co-Founder '}, {'bio': 'Dafna Akiva is a 10+ year veteran in the real estate investment, development, management and construction industries. Before assuming the role of Chief Revenue Officer at Veev, Dafna oversaw day-to-day operations and drove a number of company-scaling initiatives as Chief Operating Officer. Now, as Chief Revenue Officer, Dafna leads the development of new Veev projects that redefine customers’ living experiences, and drive revenue growth for the company’s bottom line. She oversees all real estate acquisitions and operation strategies, the real estate developments and account management, as well as sales, marketing, legal and HR.', 'company': 'Veev', 'name': 'Dafna Akiva', 'title': 'CRO & Co-Founder'}, {'bio': 'Min Alexander serves as CEO of PunchListUSA, the real estate platform digitizing home inspections for online ordering of repairs and lifecycle services. For the past decade, Min has been driving digital disruption to democratize real estate. She has led two national B2B2C platforms, field operations and created a top-10 U.S. brokerage, transforming the industry to increase access, quality and transparency.\n\nPrior to joining PunchListUSA, Min served as COO for Auction.com, as CEO and President of REALHome Services and Solutions and as SVP of Real Estate Services at Altisource. Min holds a BA from Duke and MBA from MIT. ', 'company': 'PunchlistUSA', 'name': 'Min Alexander', 'title': 'CEO & Co-Founder'}, {'bio': 'Nora Apsel is the Co-founder and CEO of Morty, the online mortgage marketplace. Morty provides homebuyers a place to evaluate competitive offers from multiple lenders, then lock and close their loans through an automated platform. Founded and led by engineers, Morty uses technology to forge a new path in mortgage: fully digital, free of legacy infrastructure, and backed by the flexible, scalable capital base of traditional lenders. As CEO, Nora is leading the Morty team through rapid, product-driven growth and nationwide expansion. Morty is a venture-backed company whose investors include Thrive Capital, Lerer Hippeau, MetaProp, March Capital, Prudence Holdings, FJ Labs and Rethink Impact. Trained as a software engineer before becoming an operator, Nora holds a M.S. in Computer Science from the University of Pennsylvania and a B.S. from Emory University.', 'company': 'Morty', 'name': 'Nora Apsel', 'title': 'CEO & Co-Founder'}, {'bio': 'Carey Armstrong is the co-founder and chief revenue officer of Tomo, a fintech startup that will provide the most customer-centric way to buy a home. Tomo was founded in the fall of 2020, raising an initial seed round of $40 million led by Ribbit Capital, NFX and Zigg Capital.\n\nCarey’s focus is on defining and delivering a delightful home buying experience for Tomo customers. She leads the development of our core transactional product offering as well as the growth and evolution of the business units that support it, including mortgage and brokerage. \n\nBefore co-founding Tomo, Carey was Vice President, Premier Agent, at Zillow Group, where she led business strategy, product strategy, and core operations for the $1B buyer services business. In this capacity, she was responsible for major leaps forward with initiatives including Connections, Home Tours, and Flex Select teams. \n\nPrior to Zillow, Carey was a strategy consultant and industry analyst with Boston Consulting Group and Forrester Research, respectively. Carey has a B.A. from Harvard University and an M.B.A. from the Tuck School of Business at Dartmouth. She and her family reside in Seattle.', 'company': 'Tomo', 'name': 'Carey Armstrong', 'title': 'CRO & Co-Founder'}, {'bio': 'Arie is the founder and CEO of WiredScore, the pioneer behind the international WiredScore certification system that evaluates and distinguishes best-in-class Internet connectivity in commercial buildings. Prior to founding WiredScore, Arie worked as a consultant with the Boston Consulting Group in New York City where he focused on the technology and media industries. Arie holds an MBA from the Wharton School and a BA and BS in Business and Political Science from the University of California, Berkeley.', 'company': 'WiredScore', 'name': 'Arie Barendrecht', 'title': 'CEO & Founder'}, {'bio': 'Demetrios Barnes is the Chief Operating Officer of SmartRent, where he leads the client engagement, supply chain and field operations teams. With over a decade of experience in property management operations, he is passionate about helping owners and operators understand the innovations technology can produce, while forging strong interpersonal relationships and participating in thought leadership discussions. Prior to co-founding SmartRent, he was Vice President of Technology for Colony Starwood Homes, Previously, Mr. Barnes was Director of Property Management and Technology with Beazer Pre-Owned Rental Homes, and a Regional Manager for several multifamily companies. Mr. Barnes holds a Bachelor of Science in Business Administration from Arizona State University.', 'company': 'SmartRent', 'name': 'Demetrios Barnes', 'title': 'COO & Co-Founder'}, {'bio': "Ryan J. S. Baxter is PropTech Advisor to the New York State Energy Research and Development Authority (NYSERDA), Cofounder of the PropTech Challenge, NYC Community Growth Lead for MetaProp NYC, and the founder of PASSNYC. Previously, Ryan served as a Vice President at the Real Estate Board of New York (REBNY). He is a native New Yorker who works passionately to make the City's built environment more educational.\n", 'company': 'Proptech Challenge', 'name': 'Ryan Baxter', 'title': 'Co-Founder'}, {'bio': 'Gary is CEO of Roofstock, a leading real estate investment marketplace which he co-founded in 2015. Gary has spent most of his career building businesses in the real estate, hospitality and tech sectors. After earning his BA in economics from Northwestern, Gary ventured west to earn his MBA from Stanford, where he caught the entrepreneurial bug and still serves as a regular guest lecturer. Previously Gary was instrumental in acquiring and integrating more than $800 million of resort properties for KSL Resorts, and spent five years as CFO of online brokerage pioneer ZipRealty, which he led through its successful IPO in 2004. Gary also served as CEO of Joie de Vivre Hospitality, then the second largest boutique hotel management company in the country. Immediately before starting Roofstock, Gary led one of the largest single-family rental platforms in the U.S. through its IPO as co-CEO of Starwood Waypoint Residential Trust, now part of Invitation Homes.', 'company': 'Roofstock', 'name': 'Gary Beasley', 'title': 'CEO & Co-Founder '}, {'bio': "Robyn has a track record of taking sophisticated climate and clean energy-related technical concepts and transforming them into commercially-oriented strategies that lead to impact, scale and results. She began her career in 2004 at Google in Mt View, CA, reporting directly to the co-founders working on strategic initiatives as they took the company public. Robyn went on to found Google's first business unit focused on incorporating clean energy generation across the company's global operations. In this capacity, she oversaw and catalyzed Google’s first clean energy initiatives, including large-scale clean energy procurement for data centers and the development and installation of a 1.7MW rooftop solar installation at the Mountain View HQ. Since then she has built, invested in, and raised $50M+ for new ventures and programs for Vestas Wind A/S in Copenhagen, Dean Kamen at DEKA R&D, and NRG Energy. Most recently she was an executive at Lennar Corp, where she built the firm’s first corporate venture platform while incubating Blueprint Power Technologies. Today, Robyn Beavers is the CEO and co-founder of Blueprint Power, a NYC-based real estate tech company that turns buildings into revenue-generating clean power plants. Robyn was named EY’s NY Entrepreneur of the Year in 2020. Robyn holds both a B.S. in Civil Engineering and an MBA from Stanford University.", 'company': 'Blueprint Power', 'name': 'Robyn Beavers', 'title': 'CEO & Co-Founder'}, {'bio': 'Liza Benson is a Partner with Moderne Ventures and helps lead and manage investment activity with particular focus on high-growth technology companies that can achieve rapid adoption and scale. Moderne Ventures is an early stage investment fund and industry immersion program which is focused on investing in technology companies in and around the multi-trillion dollar industries of real estate, mortgage, finance, insurance and home services.\n\nPrior to Moderne, Liza was a Partner with StarVest Partners, a $400M venture fund focused on expansion stage B2B SaaS investments. Previously, Liza was a Managing Director in the growth equity group at Highbridge Principals Strategies, a multi-billion asset manager. Before her experience at Highbridge, Liza was a Managing Director with Bear Stearns’ Constellation Growth Capital and an investment banker at Patricof & Co and First Union where she started her career.', 'company': 'Moderne Ventures', 'name': 'Liza Benson', 'title': 'Partner'}, {'bio': 'Jeremy Bernard is the CEO, North America at essensys, the world’s leading provider of software and technology to the flexible real estate industry. He has over 25 years of experience in the real estate and technology sectors. Most recently, Jeremy was the Global Head of Real Estate for Knotel where he grew and oversaw a portfolio of 5.5MM sq ft of flexible office space around the world. In previous roles, he has held C-level positions at real estate investment firms and launched several proptech companies. Jeremy resides in Westport, CT with his wife Jamie, daughter Morgan and son Brody.', 'company': 'essensys', 'name': 'Jeremy Bernard', 'title': 'CEO, North America'}, {'bio': "Benjamin Birnbaum is a Partner at Keyframe – a NYC based investment firm. His focus is primarily on how technology is causing market change across a number of physical infrastructure categories, like transportation and energy, inspired by earlier career experiences as an operating leader for one of the world's largest passenger transportation companies. Ben is also a co-founder of TeraWatt Infrastructure, a specialized owner of electric vehicle charging infrastructure focused on fleet electrification. ", 'company': 'Keyframe Capital', 'name': 'Ben Birnbaum', 'title': 'Partner'}, {'bio': 'Sean is the Co-Founder & CEO of BLACK, a tech-powered and cloud based CRE brokerage platform based in NYC. Prior to founding BLACK, Sean served as EVP of Real Estate and Enterprise Sales at WeWork, He has been involved in millions of square feet of commercial real estate leasing transactions over his 20 year tenure, and has worked at many of the world’s largest commercial brokerage firms including Cushman & Wakefield, JLL, Newmark, and Grubb & Ellis. ', 'company': 'BlackRE', 'name': 'Sean Black', 'title': 'CEO & Co-Founder'}, {'bio': 'As chief operating officer of CA Student Living, Steve Boyack is responsible for driving the performance and growth of CASL’s property management platform, as well as overseeing its corporate operational functions including technology, human resources, communications and culture. Steve leverages his decades of experience in the industry to develop and advance the people, processes and technologies that form the foundation of the business.\n\nBoyack previously served as global head of property management for CA Ventures, a parent company of CA Student Living, where he laid the foundation for the firm’s European student operating platform (Novel Student), global sustainability initiative, wellness program and innovation department. Prior to joining CA, Steve was a senior managing director at Greystar where he was responsible for overseeing real estate operations and leading the expansion of the company’s footprint in key Midwest markets. In addition, he oversaw Greystar’s national construction and maintenance operations and worked with their global innovation team.\n\nSteve earned a BS in Economics from the University of Iowa and a CPM® designation from the Institute of Real Estate Management. As a\xa0member of several industry advisory boards and associations, Steve is a\xa0recognized subject matter expert and thought leader, with particular focus on integrated property technology.', 'company': 'CA Ventures', 'name': 'Steve Boyack', 'title': 'COO, Student Living'}, {'bio': 'Laura Cain is the CEO and co-founder of Willow Servicing, a technology company focused on streamlining mortgage servicing. Willow’s platform automates core workflows, enabling lenders to provide digital-first borrower experiences while reducing operational costs and ensuring compliance with industry policies & regulations. Prior to Willow, Laura was a product manager at Snapdocs, where she built out their initial eClose product offering to lenders, and a venture investor at Thomvest, where she focused on early stage fintech investments.', 'company': 'Willow Servicing', 'name': 'Laura Cain', 'title': 'CEO & Co-Founder'}, {'bio': 'Madhu Chamarty is the co-founder and CEO of BeyondHQ, a startup that helps companies plan and scale distributed teams. An engineer and math nerd at heart, he has 15+ yrs of startup experience in Silicon Valley, as an early employee and co-founder at 3 high-growth B2B startups in digital media (Adify - Cox acq. # $300MM), employee communities (Dynamic Signal), and geospatial analytics (Descartes Labs). He has scaled sales & support teams globally, in both colocated and remote formats. He grew up in a fully distributed family across 4 countries, so believes he was destined to build BeyondHQ even before he knew it.', 'company': 'BeyondHQ', 'name': 'Madhu Chamarty', 'title': 'CEO & Co-Founder'}, {'bio': 'Alex Chatzielftheriou is a Greek entrepreneur and CEO and co-founder of Blueground — a real estate tech company founded in 2013. Blueground provides a network of fully-furnished, move-in ready apartments in 14 cities across the globe for stays of a month, a year, or longer. Having lived and worked in more than 15 cities around the world, Alex sought to provide business and leisure travelers with a hassle-free way to find places that feel like home — to show up and start living from day one. Along the way, Alex disrupted the traditional lease model, enabling flexible living to encourage travel and exploration of the world and its cultures while providing a place to feel "grounded" and call home. ', 'company': 'Blueground', 'name': 'Alex Chatzieleftheriou', 'title': 'CEO & Co-Founder'}, {'bio': 'Jit Kee Chin is the Chief Data & Innovation Officer and Executive Vice President at Suffolk. Ms. Chin is responsible for leveraging big data and advanced analytics to improve the organization’s core business. Ms. Chin is also responsible for helping to position Suffolk to achieve its vision of transforming the construction experience while working closely with the company’s Innovation and Strategy teams to fundamentally reinvent the future of construction in the digital age. \n\nPrior to her role at Suffolk, Ms. Chin spent 10 years with management consulting firm McKinsey and Company where she counseled senior executives on strategic, commercial and advanced analytics topics. Most recently, she was a Senior Expert in Analytics in McKinsey’s Boston office where she specialized in the design and implementation of end- to-end analytics transformations. Prior to that role, Ms. Chin was an Associate Principal in McKinsey’s London office where she helped organizations drive multi-year business transformations and change programs and developing strategies for profitable growth.', 'company': 'Suffolk Construction', 'name': 'Jit Kee Chin', 'title': 'Chief Data & Innovation Officer'}]
In pandas:
import pandas as pd
df = pd.DataFrame(results)
print(df)
Output:
bio company name title
0 Andrew is a recovering consultant turned seria... Dreamit Ventures Andrew Ackerman Venture Partner
1 Salman Ahmad is the CEO and co-founder of Mosa... Mosaic Salman Ahmad CEO and Co-Founder
2 Dafna Akiva is a 10+ year veteran in the real ... Veev Dafna Akiva CRO & Co-Founder
3 Min Alexander serves as CEO of PunchListUSA, t... PunchlistUSA Min Alexander CEO & Co-Founder
4 Nora Apsel is the Co-founder and CEO of Morty,... Morty Nora Apsel CEO & Co-Founder
.. ... ... ... ...
128 Ms. Wong joined Tishman Speyer in 2015. Jenny ... Tishman Speyer Jenny Wong Managing Director
129 Joseph is the Founder and CEO of Neighbor.com,... Neighbor Joseph Woodbury CEO & Founder
130 Based in Palo Alto, Michael Yang is a Managing... OMERS Ventures Michael Yang Managing Partner
131 Since joining RET Ventures as Partner in 2019,... RET Ventures Christopher Yip Partner & Managing Director
132 Chris Zlocki, Global Head of Client Experience... Colliers Chris Zlocki EVP, Occupier Services
[133 rows x 4 columns]
Instead of driver.execute_script, you can use BeautifulSoup:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://blueprint.connectiv.com/speakers/')
s = soup(d.page_source, 'html.parser').select('.modal.speakerCard')
r = [dict(zip(['name', 'title', 'company', 'bio'],
[b.text for b in i.select(':is(h4, p.title, p.company, p.bio)')])) for i in s]
If all the information you are looking for is within a paragraph tag <p> that has a class of bio (so <p class='bio'>), and all the modals are already present in the source code, then you can simply select all with:
bios = driver.find_elements_by_xpath('//p[#class="bio"]')
That will select all elements that are a <p> tag that also has a class equal to 'bio' and return it in a list. If some of the p tags have other classes in them (i.e. <p class='bio someotherclass'>), then you will need to use the contains() method in your xpath, like so:
bios = driver.find_elements_by_xpath('//p[contains(#class, "bio")]')
You can then loop through the results like so:
for bio in bios:
print(bio.text)
Related
How to remove url's from string
Currently I have many rows in one column similar to the string below. On python I have run the code to remove , <a, href=, and the url itself using this code df["text"] = df["text"].str.replace(r'\s*https?://\S+(\s+|$)', ' ').str.strip() df["text"] = df["text"].str.replace(r'\s*href=//\S+(\s+|$)', ' ').str.strip() However, the output continues to remain the same. Please advise. <p>On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this map.</p> <p>In 2020 it was reported that the Huawei research and development center received approval by a local council despite the nation’s ongoing security concerns around the Chinese company.</p> <p>Chinese state media later reported that Huawei's expansion in Cambridge 'is part of a five-year, £3 billion investment plan for the UK that [Huawei] announced alongside [then] British Prime Minister Theresa May' in February 2018.</p>
IIUC you want to replace the following html tags: <p>, <a, href=, and the url Code df['text'] = df.text.replace(regex = {r'<p>': ' ', r'</p>': '', r'<a.*?\/a>': '+'}) Explanation Regex dictionary does the following substitutions <p> replaced by ' ' <a href = .../a> replaced by '+' </p> replaced by '' Example Create Data s = '''<p>On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this map.</p> <p>In 2020 it was reported that the Huawei research and development center received approval by a local council despite the nation’s ongoing security concerns around the Chinese company.</p> <p>Chinese state media later reported that Huawei's expansion in Cambridge 'is part of a five-year, £3 billion investment plan for the UK that [Huawei] announced alongside [then] British Prime Minister Theresa May' in February 2018.</p>''' data = {'text':s.split('\n')} df = pd.DataFrame(data) print(df.text[0]) # show first row pre-replacement # Perform replacements df['text'] = df.text.replace(regex = {r'<p>': ' ', r'</p>': '', r'<a.*?\/a>': '+'}) print(df.text[0]) # show first row post replacement Output The first row only Before replacement On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this map. Post Replacement On 4 May 2019, + (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this +
You can use the following regex pattern instead: <a href=(.*?)"> I successfully tested this using your test string on regex101. Full code: import re df["text"] = df["text"].str.replace(r'<a href=(.*?)">', "").str.strip()
I don't think your regex is quite right. Try: df["text"] = df["text"].str.replace(r'<a href=\".*?\">', ' ').str.strip() df["text"] = df["text"].str.replace(r'</a>', ' ').str.strip()
Beautifulsoup class not contain multiple 'strings'
Is there a way to scrape p tags that do not contain multiple classes? Here's my code so far (after compiling codes and researching StackOverflow): import requests import bs4 import re url = 'https://www.sp2.upenn.edu/person/amy-hillier/' req = requests.get(url).text soup = bs4.BeautifulSoup(req,'html.parser') regex = re.compile('^((?!Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital).)*$') for texts in soup.find_all('div'): for i in texts.findAll('p',{'class': regex}): print(i) So my thought process is that I've created a regex to list strings that if exist, then the web scraper will not scrape the paragraph. To put it simply, if any of these words pop up on the class section, then don't scrape them. Someone also recommend me to use a css selector syntax with :not() pseudo class and * contains operator, which I interpreted as: for texts in soup.find_all('div'): for i in texts.select('p[class]:not([class*="Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital"])'): print(i) Unfortunately, neither of them works. Any help is greatly appreciated! Edit Adding examples of text: <p class="sub has-white-color has-normal-font-size tw-pb-5"> The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more. </p> <p> Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate. </p> <p class="Paragraph-sc-1mxv4ns-0 bGbcwt"> Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004. </p> I need to scrape the second and third paragraphs. My logic is since the first paragraph's class has the word 'color' in it, I can exclude that. The rest of the words that I listed on the regex variable are pretty much the words that I have found and needed to be excluded across multiple URLs. I hope that clarifies my question.
Perhaps you can use custom function when searching for the right <p> tags. For example: from bs4 import BeautifulSoup html_doc = """\ <p class="sub has-white-color has-normal-font-size tw-pb-5"> The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more. </p> <p> Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate. </p> <p class="Paragraph-sc-1mxv4ns-0 bGbcwt"> Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004. </p>""" soup = BeautifulSoup(html_doc, "html.parser") words = ["color"] for p in soup.find_all( lambda t: t.name == "p" and all(w not in c.lower() for c in t.get("class", []) for w in words) ): print(p) print("-" * 80) Prints: <p> Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate. </p> -------------------------------------------------------------------------------- <p class="Paragraph-sc-1mxv4ns-0 bGbcwt"> Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004. </p> --------------------------------------------------------------------------------
How to remove element tags from results, Web Scraping Articles with Python
I've recently been teaching myself python and instead of diving right into courses I decided to think of some script ideas I could research and work through myself. The first I decided to make after seeing something similar referenced in a video was a web scraper to grab articles from sites, such as the New York Times. (I'd like to preface the post by stating that I understand some sites might have varying TOS regarding this and I want to make it clear I'm only doing this to learn the aspects of code and do not have any other motive -- I also have an account to NYT and have not done this on websites where I do not possess an account) I've gained a bit of an understanding of the python required to perform this as well as began utilizing some BeautifulSoup commands and some of it works well! I've found the specific elements that refer to parts of the article in F12 inspect and am able to successfully grab just the text from these parts. When it comes to the body of the article, however, the elements are set up in such a way that I'm having troubling grabbing all of the text and not bringing some tags along with it. Where I'm at so far: from bs4 import BeautifulSoup import requests source = requests.get('https://www.nytimes.com/2022/01/08/us/teachers-unions-covid-schools.html').text soup = BeautifulSoup(source, 'lxml') print('----------') print(f'TITLE: {soup.title.string}') print('----------') print(f'H1: {soup.h1.string}') print(f'H2: {soup.h2.string}') print(f'H3: {soup.h3.string}') print('----------') article_summary = soup.find('p', class_='css-w6ymp8 e1wiw3jv0').text print(f'Article summary: {article_summary}') print('----------') image_summary = soup.find('span', class_='css-16f3y1r e13ogyst0').text print(f'Image summary: {image_summary}') print('----------') authors = soup.find('p', class_= 'css-aknsld e1jsehar1') author1 = authors.find('span', class_= 'css-1baulvz').text author2 = authors.find('span', class_= 'css-1baulvz last-byline').text print(f'Authors: {author1} and {author2}') print('----------') for item in soup.select('.StoryBodyCompanionColumn'): try: para = item.find_all('p') print(para) except Exception as e: print('f') The output I get from this is: ---------- TITLE: Teachers’ Unions Push for Remote Schooling, Worrying Democrats - The New York Times ---------- H1: As More Teachers’ Unions Push for Remote Schooling, Parents Worry. So Do Democrats. H2: The Coronavirus Pandemic: Latest Updates H3: None ---------- Article summary: Chicago teachers have voted to go remote. Other unions are agitating for change. For Democrats, who promised to keep schools open, the tensions are a distinctly unwelcome development. ---------- Image summary: Alex Brandenburg, an elementary school teacher, protested outside of the Oakland Unified School District headquarters on Friday as part of a sick out. ---------- Authors: Dana Goldstein and Noam Scheiber ---------- [<p class="css-axufdj evys1bk0">Few American cities have labor politics as fraught as Chicago’s, where the nation’s third-largest school system shut down this week after teachers’ union members refused to work in person, arguing that classrooms were unsafe amid the Omicron surge.</p>, <p class="css-axufdj evys1bk0">But in a number of other places, the tenuous labor peace that has allowed most schools to operate normally this year is in danger of collapsing.</p>, <p class="css-axufdj evys1bk0">While not yet threatening to walk off the job, unions are back at negotiating tables, pushing in some cases for a return to remote learning. They frequently cite understaffing because of illness, and shortages of rapid tests and medical-grade masks. Some teachers, in a rear-guard action, have staged sick outs.</p>, <p class="css-axufdj evys1bk0">In Milwaukee, schools are remote until Jan. 18, because of staffing issues. But the teachers’ union president, Amy Mizialko, doubts that the situation will significantly improve<span class="css-8l6xbc evw5hdy0"> </span>and worries that the school board will resist extending online classes.</p>] [<p class="css-axufdj evys1bk0">“I anticipate it’ll be a fight,” Ms. Mizialko said.</p>, <p class="css-axufdj evys1bk0">She credited the district for at least delaying in-person schooling to start the year but criticized Democratic officials for placing unrealistic pressure on teachers and schools.</p>, <p class="css-axufdj evys1bk0">“I think that Joe Biden and Miguel Cardona and the newly elected mayor of New York City and Lori Lightfoot — they can all declare that schools will be open,” Ms. Mizialko added, referring to the U.S. education secretary and the mayor of Chicago. “But unless they have hundreds of thousands of people to step in for educators who are sick in this uncontrolled surge, they won’t be.”</p>] [<p class="css-axufdj evys1bk0">For many parents and teachers, the pandemic has become a slog of anxiety over the risk of infection, child care crises, the tedium of school-through-a-screen and, most of all, chronic instability.</p>, <p class="css-axufdj evys1bk0">And for Democrats, the revival of tensions over remote schooling is a distinctly unwelcome development.</p>] [<p class="css-axufdj evys1bk0">Because they have close ties to the unions, Democrats are concerned that additional closures like those in Chicago could lead to a possible replay of the party’s recent loss in Virginia’s governor race. <a class="css-1g7m0tk" href="https://dfer.org/press/poll-confirms-education-motivating-issue-for-va-voters-in-2021-election-likely-to-be-major-factor-in-midterms/" rel="noopener noreferrer" target="_blank" title="">Polling</a> showed that school disruptions were an important issue for swing voters who broke Republican — particularly suburban white women.</p>, <p class="css-axufdj evys1bk0">“It’s a big deal in most state polling we do,” said Brian Stryker, a partner at the polling firm ALG Research <a class="css-1g7m0tk" href="https://thirdway.imgix.net/pdfs/override/Qualitative-Research-Findings-%E2%80%93-Virginia-Post-Election-Research.pdf" rel="noopener noreferrer" target="_blank" title="">whose work</a> in Virginia indicated that school closures hurt Democrats.</p>, <p class="css-axufdj evys1bk0">“Anyone who thinks this is a political problem that stops at the Chicago city line is kidding themselves,” added Mr. Stryker, whose firm polled for President Biden’s 2020 campaign. “This is going to resonate all across Illinois, across the country.”</p>, <p class="css-axufdj evys1bk0">More than one million of the country’s 50 million public school students<span class="css-8l6xbc evw5hdy0"> </span>were affected by districtwide shutdowns in the first week of January, many of which were announced abruptly and triggered a wave of frustration among parents.</p>, <p class="css-axufdj evys1bk0">“The kids are not the ones that are seriously ill by and large, but we know kids are the ones suffering from remote learning,” said Dan Kirk, whose son attends Walter Payton College Preparatory High School in Chicago, which was closed amid the district’s standoff this week.</p>, <p class="css-axufdj evys1bk0">Several nonunion charter-school networks and districts temporarily transitioned to remote learning after the holidays. But as has been true throughout the pandemic, most of the temporary districtwide closures — including in Detroit, Cleveland, Milwaukee — are taking place in liberal-leaning areas with powerful unions and a more cautious approach to the coronavirus.</p>] [<p class="css-axufdj evys1bk0">The unions’ demands echo the ones they have made for nearly two years, despite all that has changed. There are now vaccines and <a class="css-1g7m0tk" href="https://www.cdc.gov/coronavirus/2019-ncov/science/science-briefs/transmission_k_12_schools.html#sars-cov-2" rel="noopener noreferrer" target="_blank" title="">the reassuring knowledge</a> that in-school transmission of the virus has been limited.<span class="css-8l6xbc evw5hdy0"> </span>The Omicron variant, while highly contagious, appears to cause less severe illness than previous iterations of Covid-19. </p>] [<p class="css-axufdj evys1bk0">Most district leaders and many educators say it is imperative for schools to remain open. They cite a large body of research showing that closures harm children, <a class="css-1g7m0tk" href="https://www.nytimes.com/2021/07/28/us/covid-schools-at-home-learning-study.html" title="">academically</a> and <a class="css-1g7m0tk" href="https://www.nytimes.com/2022/01/04/briefing/american-children-crisis-pandemic.html" title="">emotionally</a>, and widen income and racial disparities. </p>, <p class="css-axufdj evys1bk0">But some local union officials are far warier of packed classrooms. In Newark, schools began 2022 with an unexpected stretch of remote learning, set to end on Jan. 18. John Abeigon, the Newark Teachers Union president, said he was hopeful about the return to buildings but that he remained unsure if every school could operate safely.<span class="css-8l6xbc evw5hdy0"> </span>Student vaccination is far from universal, and most parents have not consented to their children taking regular virus tests.</p>, <p class="css-axufdj evys1bk0">Mr. Abeigon said that if tests remain scarce, he might ask for remote learning at specific schools with low vaccination rates and high case counts. He agreed that online learning was a burden to working parents but argued that educators should not be sacrificed for the good of the economy.</p>, <p class="css-axufdj evys1bk0">“I’d see the entire city of Newark unemployed before I allowed one single teacher’s aide to die needlessly,” he said.</p>, <p class="css-axufdj evys1bk0">In Los Angeles, the district has worked closely with the union to keep classrooms open after one of the longest pandemic shutdowns in the country last school year. The vaccination rate for students 12 and older is about 90 percent, with a student vaccine mandate set to <a class="css-1g7m0tk" href="https://www.nytimes.com/2021/12/18/us/los-angeles-vaccine-mandate-delayed.html" title="">kick in this fall</a>. All students and staff are tested for the virus weekly.</p>] [<p class="css-axufdj evys1bk0">Still, the president of the local union, Cecily Myart-Cruz, would not rule out pushing for a districtwide return to remote learning in the coming weeks. “You know, I want to be honest — I don’t know,” she said.</p>, <p class="css-axufdj evys1bk0">The tensions are not limited to liberal<span class="css-8l6xbc evw5hdy0"> </span>states. In Kentucky, teachers’ unions and at least <a class="css-1g7m0tk" href="https://www.wdrb.com/in-depth/remote-learning-probable-at-some-point-for-jcps-as-covid-19-cases-surge-pollio-says/article_aae24c48-6e40-11ec-846b-5bdbd3d76870.html" rel="noopener noreferrer" target="_blank" title="">one large school district</a> have said they need the flexibility to go remote amid escalating infection rates.</p>, <p class="css-axufdj evys1bk0">But the Republican-controlled state legislature has granted no more than 10 days for such instruction districtwide, and unions there worry that may be inadequate. Jeni Ward Bolander, a leader of a statewide union, said that teachers may have to walk off the job.</p>, <p class="css-axufdj evys1bk0">“Frustration is building on teachers,” Ms. Ward Bolander said. “I hate to say we’d walk out at that point, but it’s absolutely possible.”</p>, <p class="css-axufdj evys1bk0">National teachers’ unions continue to call for classrooms to remain open, but local affiliates hold the most power in negotiations over whether individual districts will close schools.</p>, <p class="css-axufdj evys1bk0">And over the last decade, some locals, including those in Los Angeles and Chicago, were taken over by activist leaders whose tactics can be more aggressive than those of national leaders like <a class="css-1g7m0tk" href="https://www.nytimes.com/2021/02/08/us/schools-reopening-teachers-unions.html" title="">Randi Weingarten</a> of the American Federation of Teachers and <a class="css-1g7m0tk" href="https://www.nytimes.com/2021/12/12/us/politics/teachers-union-becky-pringle.html" title="">Becky Pringle</a> of the National Education Association, both close allies of President Biden.</p>] [<p class="css-axufdj evys1bk0">Complicating matters, some local unions face internal pressure from their own members. <a class="css-1g7m0tk" href="https://sanfrancisco.cbslocal.com/2022/01/06/covid-oakland-unified-school-district-warns-potential-teacher-sickout/" rel="noopener noreferrer" target="_blank" title="">In the Bay Area</a>, splinter groups of teachers in both Oakland and San Francisco have planned sick outs, and demanded N95 masks, more virus testing and other safety measures.</p>, <p class="itemClass"><strong>The latest Covid data in the U.S.<!-- --> </strong><span>As the Omicron surge causes case counts to reach record highs and hospitalizations to surpass the height of the Delta wave, here’s how to think about the data and what it’s beginning to show about Omicron’s potential toll across the county.</span></p>, <p class="itemClass"><strong>Around the world.<!-- --> </strong><span>In Europe, Germany is bracing for major protests against restrictions after thousands took to the streets in France and Austria, and a tough new vaccine requirement came into force in Italy. In Uganda, schools reopened after the longest pandemic-prompted shutdown in the world.</span></p>, <p class="itemClass"><strong>Staying safe.<!-- --> </strong><span>Worried about spreading Covid? Keep yourself and others safe by following some basic guidance on when to test and how to use at-home virus tests (if you can find them). Here is what to do if you test positive for the coronavirus.</span></p>, <p class="css-axufdj evys1bk0">Rori Abernethy, a middle-school teacher in San Francisco, organized a sick out there on Thursday. She said the Chicago action had prompted some teachers to ask, “Why isn’t our union doing this?”</p>, <p class="css-axufdj evys1bk0">In Chicago and San Francisco, working-class parents of color disproportionately send their children to the public schools, and they have often supported strict safety measures during the pandemic, including periods of remote learning. And in New York, the nation’s largest school district, schools are operating in person with increased virus testing, with limited dissent from teachers.</p>, <p class="css-axufdj evys1bk0">But the politics become more complicated in suburbs, where union leaders may find themselves at odds with public officials at pains to<span class="css-8l6xbc evw5hdy0"> </span>preserve in-person schooling.</p>, <p class="css-axufdj evys1bk0">In Fairfax County, Virginia’s largest district, the superintendent has <a class="css-1g7m0tk" href="https://www.fcps.edu/return-school/return-school-safety/navigating-january-2022-covid-surge" rel="noopener noreferrer" target="_blank" title="">a plan</a> for switching individual schools to remote learning in the event of many absent teachers.</p>, <p class="css-axufdj evys1bk0">Kimberly Adams, the president of the<span class="css-8l6xbc evw5hdy0"> </span>local education association, said her union may want stricter measures. And she said that districts should be planning for virus surges by distributing devices for potential short bursts of online schooling. </p>, <p class="css-axufdj evys1bk0">But Dan Helmer, a Democratic state delegate whose swing district includes part of Fairfax County, said there was little support among his constituents for a return to online education.</p>] [<p class="css-axufdj evys1bk0">Deb Andraca, a Democratic state representative in Wisconsin whose district lies just north of Milwaukee, where schools went remote this past week, said that Republicans have targeted her seat and that she expected schools to be a line of attack.</p>, <p class="css-axufdj evys1bk0">“Everyone I know wants schools to stay open,” she said. “But there’s a lot of talk about how teachers’ unions don’t want schools to stay open.”</p>, <p class="css-axufdj evys1bk0">Jim Hobart, a partner at Public Opinion Strategies, a polling firm that counts several Republican senators and governors as clients, said the school closure issue created two advantages for G.O.P. candidates. It has helped narrow their margins among a demographic they’ve traditionally struggled with — white women between their mid-20s and mid-50s — and it has generally undermined Democrats’ claims to competence.</p>, <p class="css-axufdj evys1bk0">“A lot of people — Biden, Mayor Lightfoot in Chicago — have said schools should be open,” Mr. Hobart said. “If they’re not able to prevent schools from choosing to close, that shows a weakness on their part.”</p>, <p class="css-axufdj evys1bk0">Labor officials say that many of their critics are acting in bad faith, exploiting parents’ pandemic-related frustrations to advance longstanding political goals, like discrediting unions and expanding private-school vouchers.</p>, <p class="css-axufdj evys1bk0">Thus far, neither the critiques nor the broader pandemic challenges appear to have significantly hampered unions’ public standing, even according to <a class="css-1g7m0tk" href="https://www.educationnext.org/hunger-for-stability-quells-appetite-for-change-results-2021-education-next-survey-public-opinion-poll/" rel="noopener noreferrer" target="_blank" title="">polls</a> conducted by<span class="css-8l6xbc evw5hdy0"> </span>researchers skeptical of teachers’ unions.</p>] [<p class="css-axufdj evys1bk0">And if it turns out that Democratic candidates pay a political price for unions’ assertiveness, local labor officials do not consider it to be among their top concerns.</p>, <p class="css-axufdj evys1bk0">If periods of remote learning this winter hurt the Democratic Party, “that’s a question for the consultants and the brain trusts to figure out,” said Mr. Abeigon, the Newark union president. “But that it’s the right thing to do? There’s no question in my mind.”</p>, <p class="css-pncxxs etfikam0">Holly Secon<!-- --> contributed reporting from San Francisco.</p>] I feel so close to having it accomplished, but I've hit a roadblock so I'm hoping someone can tell me how to clean this up a bit. If I were to do: for item in soup.select('.StoryBodyCompanionColumn'): try: para = item.find('p').text print(para) except Exception as e: print('f') If I use find('p').text or find('p').get_text instead of find_all('p'), it will either give me a failure and print out f or it will give me only the FIRST paragraph within a certain div.(I can upload the results that it spits out when I do the other options but there's a few more I could add and it would greatly lengthen this post even further) (This article has 10 or 11 different divs for the body, each with 2-4 paragraphs in each, and they ALL possess the same class_= tag. This is really where I've been running into an issue. I might be able to splice out the tags with some nifty coding that simply deletes the undesired tags from my results, but I'd rather have smoother code that actually works and I know I'm missing something. Any and all help is very much appreciated as I continue learning! (Apologies for length of post or if it's an inadequate question -- I'm knew to the site & programming overall so I'm at the stage where I don't know what I don't know yet, so if there is a link that answers my question instead of you taking time to respond that would be fantastic as well, thanks!)
Select the paragraphs more specific, while adding p to your css selector, than item is the paragraph and you can simply call .text or if there is something to strip -> .text.strip() or .get_text(strip=True): for item in soup.select('.StoryBodyCompanionColumn p'): try: para = item.text print(para) except Exception as e: print('f') Just select the section tag that holds all p tags and call get_text() to get all human readable text: soup.select_one('section[name="articleBody"]').get_text(strip=True) or iterate over the divs and join() the texts to a single string: ' '.join([item.get_text(strip=True) for item in soup.select('.StoryBodyCompanionColumn p')]) Example from bs4 import BeautifulSoup import requests source = requests.get('https://www.nytimes.com/2022/01/08/us/teachers-unions-covid-schools.html').text soup = BeautifulSoup(source, 'lxml') soup.select_one('section[name="articleBody"]').get_text(strip=True)
PyPDF2 won't extract all text from PDF
I'm trying to extract text from a PDF (https://www.sec.gov/litigation/admin/2015/34-76574.pdf) using PyPDF2, and the only result I'm getting is the following string: b'' Here is my code: import PyPDF2 import urllib.request import io url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf' remote_file = urllib.request.urlopen(url).read() memory_file = io.BytesIO(remote_file) read_pdf = PyPDF2.PdfFileReader(memory_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(1) page_content = page.extractText() print(page_content.encode('utf-8')) This code worked correctly on a few of the PDFs I'm working with (e.g. https://www.sec.gov/litigation/admin/2016/34-76837-proposed-amended-distribution-plan.pdf), but the others like the file above didn't work. Any idea what's wrong?
I don't know why pypdf2 can't extract the information from that PDF, but the package pdftotext can: import pdftotext from six.moves.urllib.request import urlopen import io url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf' remote_file = urlopen(url).read() memory_file = io.BytesIO(remote_file) pdf = pdftotext.PDF(memory_file) # Iterate over all the pages for page in pdf: print(page) Extracted UNITED STATES OF AMERICA Before the SECURITIES AND EXCHANGE COMMISSION SECURITIES EXCHANGE ACT OF 1934 Release No. 76574 / December 7, 2015 ADMINISTRATIVE PROCEEDING File No. 3-16987 ORDER INSTITUTING CEASE-AND-DESIST In the Matter of PROCEEDINGS, PURSUANT TO SECTION 21C OF THE SECURITIES EXCHANGE ACT KEFEI WANG OF 1934, MAKING FINDINGS, AND IMPOSING REMEDIAL SANCTIONS AND A Respondent. CEASE-AND-DESIST ORDER I. The Securities and Exchange Commission (“Commission”) deems it appropriate and in the public interest that cease-and-desist proceedings be, and hereby are, instituted pursuant to 21C of the Securities Exchange Act of 1934 (“Exchange Act”) against Kefei Wang (“Respondent”). II. In anticipation of the institution of these proceedings, Respondent has submitted an Offer of Settlement (the “Offer”) which the Commission has determined to accept. Solely for the purpose of these proceedings and any other proceedings brought by or on behalf of the Commission, or to which the Commission is a party, and without admitting or denying the findings herein, except as to the Commission’s jurisdiction over him and the subject matter of these proceedings, which are admitted, and except as provided herein in Section V, Respondent consents to the entry of this Order Instituting Cease-and-Desist Proceedings, Pursuant to Section 21C of the Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a Cease-and-Desist Order (“Order”), as set forth below. III. On the basis of this Order and Respondent’s Offer, the Commission finds1 that: Summary 1. Respondent violated Section 15(a)(1) of the Exchange Act by acting as an unregistered broker-dealer in connection with his representation of clients who were seeking U.S. residency through the Immigrant Investor Program. Respondent helped effect certain individuals’ securities purchases in an EB-5 Regional Center. Respondent received a commission from that Regional Center for each investment he facilitated. Respondent 2. Kefei Wang, age 39, is a resident of China. During the relevant time period, he was a U.S. resident and an owner of Nautilus Global Capital, LLC , a now defunct entity that was based in Fremont, California. Background 3. The United States Congress created the Immigrant Investor Program, also known as “EB-5,” in 1990 to stimulate the U.S. economy through job creation and capital investment by foreign investors. The Program offers EB-5 visas to individuals who invest $1 million in a new commercial enterprise that creates or preserves at least 10 full-time jobs for qualifying U.S. workers (or $500,000 in an enterprise located in a rural area or an area of high unemployment). A certain number of EB-5 visas are set aside for investors in approved Regional Centers. A Regional Center is defined as “any economic unit, public or private, which is involved with the promotion of economic growth, including increased export sales, improved regional productivity, job creation, and increased domestic capital investment.” 8 C.F.R. § 204.6(e) (2015). 4. Typical Regional Center investment vehicles are offered as limited partnership interests. The partnership interests are securities, usually offered pursuant to one or more exemptions from the registration requirements of the U.S. securities laws. The Regional Centers are often managed by a person or entity which acts as a general partner of the limited partnership. The Regional Centers, the investment vehicles, and the managers are collectively referred to herein as “EB-5 Investment Offerers.” 5. Various EB-5 Investment Offerers paid commissions to anyone who successfully sold limited partnership interests to new investors. 1 The findings herein are made pursuant to Respondent’s Offer of Settlement and are not binding on any other person or entity in this or any other proceeding. 2 Respondent Received Commissions for His Clients’ EB-5 Investments 6. From at least January 2010 through May 2014, Respondent received a portion of commissions from one EB-5 Investment Offerer totaling $40,000. The commissions constituted his portion of the commissions that were paid pursuant to a written Agency Agreement between Nautilus Global Capital and the EB-5 Investment Offerer. On one or more occasions the commission was paid to a foreign bank account identified by the Respondent despite the fact that the Respondent was U.S.-based during the relevant time period. 7. Respondent performed activities necessary to effectuate the transaction, including recommending the specific EB-5 Investment Offerer referenced in paragraph 6 to his clients; acting as a liaison between the EB-5 Investment Offerer and the investors; and facilitating the transfer and/or documentation of investment funds to the EB-5 Investment Offerer. Respondent received his portion of transaction-based commissions due to Nautilus Global Capital for its services from that EB-5 Investment Offerer. 8. As a result of the conduct described above, Respondent violated Section 15(a)(1) of the Exchange Act which makes it unlawful for any broker or dealer which is either a person other than a natural person or a natural person not associated with a broker or dealer to make use of the mails or any means or instrumentality of interstate commerce “to effect any transactions in, or to induce or attempt to induce the purchase or sale of, any security” unless such broker or dealer is registered in accordance with Section 15(b) of the Exchange Act. IV. In view of the foregoing, the Commission deems it appropriate to impose the sanctions agreed to in Respondent Kefei Wang’s Offer. Accordingly, pursuant to Section 21C of the Exchange Act, it is hereby ORDERED that: A. Respondent shall cease and desist from committing or causing any violations and any future violations of Section 15(a)(1) of the Exchange Act. B. Respondent shall, within ten (10) days of the entry of this Order, pay disgorgement of $40,000, prejudgment interest of $1,590, and a civil money penalty of $25,000 to the Securities and Exchange Commission for transfer to the general fund of the United States Treasury in accordance with Exchange Act Section 21F(g)(3). If timely payment of disgorgement and prejudgment interest is not made, additional interest shall accrue pursuant to SEC Rule of Practice 600 [17 C.F.R. § 201.600]. If timely payment of the civil money penalty is not made, additional interest shall accrue pursuant to 31 U.S.C. § 3717. Payment must be made in one of the following ways: (1) Respondent may transmit payment electronically to the Commission, which will provide detailed ACH transfer/Fedwire instructions upon request; 3 (2) Respondent may make direct payment from a bank account via Pay.gov through the SEC website at http://www.sec.gov/about/offices/ofm.htm; or (3) Respondent may pay by certified check, bank cashier’s check, or United States postal money order, made payable to the Securities and Exchange Commission and hand-delivered or mailed to: Enterprise Services Center Accounts Receivable Branch HQ Bldg., Room 181, AMZ-341 6500 South MacArthur Boulevard Oklahoma City, OK 73169 Payments by check or money order must be accompanied by a cover letter identifying Kefei Wang as a Respondent in these proceedings, and the file number of these proceedings; a copy of the cover letter and check or money order must be sent to Stephen L. Cohen, Associate Director, Division of Enforcement, Securities and Exchange Commission, 100 F St., NE, Washington, DC 20549-5553. V. It is further Ordered that, solely for purposes of exceptions to discharge set forth in Section 523 of the Bankruptcy Code, 11 U.S.C. § 523, the findings in this Order are true and admitted by Respondent, and further, any debt for disgorgement, prejudgment interest, civil penalty or other amounts due by Respondent under this Order or any other judgment, order, consent order, decree or settlement agreement entered in connection with this proceeding, is a debt for the violation by Respondent of the federal securities laws or any regulation or order issued under such laws, as set forth in Section 523(a)(19) of the Bankruptcy Code, 11 U.S.C. § 523(a)(19). By the Commission. Brent J. Fields Secretary 4 [Finished in 0.5s]
I think that there might be an issue with how you are extracting the pages try making a loop and calling each page separately like so for i in range(0 , number_of_pages ): pageObj = pdfReader.getPage(i) page = pageObj.extractText()
Regex that captures backslash
I know there are backslash posts, but their suggestions do not work for me. I am trying to capture everything that come after SUBJECT: and up to COMPANY (see below). I'm using this code. notice the double backslashes \. but my output for the regex stops at CHI Children because of the backslash in 'CHI Children\'s'. What do I do to deal with this backslash that doesn't want to be caught? indextext = re.findall(r'SUBJECT:\s+[A-Z\s\(\w+\%\)\;\&\:\-\,\/\\]+', udoc2)[0] indextext = re.sub(r'\r\n','\n', indextext) UPDATE: The reason I can't pre-specify 'COMPANY:' is because each document has a different word. Sometimes company doesn't exist. I would be forced to hard code dozens of exceptions. udoc = [SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:] current output: SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS & STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%); VENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%); CHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP (78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS (78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT INNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%); SPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS (74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%); LABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE (62%) NY-GENYOUth-SAP; CHI Children
My Big Huge Caveat: I don't like your approach, so I'm throwing it out the window. The last thing you want to do is to use regular expressions to match HUGE NUMBERS OF THINGS while you wait to get to just a few things. That's the exact opposite of what a regex should do: so don't you do it either. My Big Huge Assumption: I played with your code for quite awhile, trying to figure out exactly what you were trying to do and why. It seems to me you're trying to index those values somehow, something like {"ENTREPRENEURSHIP":93,"PRESS RELEASES":91,...}, so that's what I built. Maybe that's not your end goal, in which case jeebus brother give us some feedback here.... My Itty Bitty Code: text = """udoc = [SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:]""" # sheesh that's a big string literal! Let's take a few lines to breathe. # after all, we have to give the interpreter enough # time # to # process all that # data we just fed it # # ...right? values = {' '.join(item[:-1]):item[-1].strip("(%)") for full_list in text.split(":")[1:-1] for element in full_list.split(";")[:-1] for item in [element.strip().split()]} for key,value in values.items(): print("{:35}: {}".format(key,value)) # AMERICAN FOOTBALL TOURNAMENTS : 74 # LABOR FORCE : 70 # BUSINESS ANALYTICS : 67 # COMPUTER SOFTWARE : 85 # CHARITIES : 78 # AMERICAN FOOTBALL : 74 # BUSINESS SOFTWARE (62%) : NY-GENYOUth-SAP # FOUNDATIONS : 78 # CHILDREN : 78 # SPORTS : 74 # SPONSORSHIP : 78 # EDUCATION SYSTEMS & INSTITUTIONS : 78 # PUBLIC PRIVATE PARTNERSHIPS : 78 # SPORTS & RECREATION EVENTS : 74 # NUTRITION : 90 # ALLIANCES & PARTNERSHIPS : 77 # ENTERTAINMENT & ARTS : 77 # PRESS RELEASES : 91 # WORKPLACE PROGRAMS : 77 # VENTURE CAPITAL : 90 # CHI Children's Related : News # STUDENTS & STUDENT LIFE : 90 # AGRICULTURE DEPARTMENTS : 73 # EXERCISE & FITNESS : 90 # ENTREPRENEURSHIP : 93 # NONPROFIT ORGANIZATIONS : 90 # PRODUCT INNOVATION : 77 # SPORTS FANS : 74 # PHILANTHROPY : 78 # LICENSING AGREEMENTS : 74 # PREVENTION & WELLNESS : 90 # EXECUTIVES : 70 Now I know what you're saying, "adsmith," you begin, "But look at the values in "CHI Children's Related" and "BUSINESS SOFTWARE (62%)," that's clearly wrong!! I can't help your input being poorly formatted, no one can. CHI Children's Related has a value of News, that's not your fault and it's not my fault. They neglected to put a : between BUSINESS SOFTWARE and (62%), and we don't take the blame for that either. Conclusion On second thought, let's not go to the re module. 'Tis a silly place.
You are not the first to bang your head here http://docs.python.org/2/howto/regex.html#the-backslash-plague You will need 4 backslashes to escape a backslash in your target string. That said, I like using an interactive tool to perfect the regex, such as regex coach. http://www.weitz.de/regex-coach/ If you dont want to do the silly 4 backslashes, copy from your external tool and use re.compile(re.escape(string)) http://docs.python.org/2/library/re.html#re.escape
Your question is a little vague so I am not totally sure what you are looking for udoc = "SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:" Notice the change from a list to a string seems to me you are looking for everything between the colons s = udoc.split(':')[1] and then you might need to mess around with the individual items mylist = [item for item in s.split(';')] To clean them up a little newlist = [] for item in mylist: newlist.append(' '.join(item.split())) you can get rid of the last word (COMPANY in this case) by some easy manipulation newlist[-1] = ' '.join(newlist[-1].split()[:-1]) Finally if you want the results as a string just join newlist with some separator
Why not: import re re.search(r'SEARCH:(.+)COMPANY:', udoc2)
You don't have to use regex for this. In this case it seems like there is a much easier solution. Why not get the index of "COMPANY:]" and then get everything up to that?
How about this? (SUBJECT\:.*\:) You can see how it works at http://regex101.com/r/aB7nJ2