How to remove url's from string - python

Currently I have many rows in one column similar to the string below. On python I have run the code to remove , <a, href=, and the url itself using this code
df["text"] = df["text"].str.replace(r'\s*https?://\S+(\s+|$)', ' ').str.strip()
df["text"] = df["text"].str.replace(r'\s*href=//\S+(\s+|$)', ' ').str.strip()
However, the output continues to remain the same. Please advise.
<p>On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this map.</p>
<p>In 2020 it was reported that the Huawei research and development center received approval by a local council despite the nation’s ongoing security concerns around the Chinese company.</p>
<p>Chinese state media later reported that Huawei's expansion in Cambridge 'is part of a five-year, £3 billion investment plan for the UK that [Huawei] announced alongside [then] British Prime Minister Theresa May' in February 2018.</p>

IIUC you want to replace the following html tags:
<p>, <a, href=, and the url
Code
df['text'] = df.text.replace(regex = {r'<p>': ' ', r'</p>': '', r'<a.*?\/a>': '+'})
Explanation
Regex dictionary does the following substitutions
<p> replaced by ' '
<a href = .../a> replaced by '+'
</p> replaced by ''
Example
Create Data
s = '''<p>On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this map.</p>
<p>In 2020 it was reported that the Huawei research and development center received approval by a local council despite the nation’s ongoing security concerns around the Chinese company.</p>
<p>Chinese state media later reported that Huawei's expansion in Cambridge 'is part of a five-year, £3 billion investment plan for the UK that [Huawei] announced alongside [then] British Prime Minister Theresa May' in February 2018.</p>'''
data = {'text':s.split('\n')}
df = pd.DataFrame(data)
print(df.text[0]) # show first row pre-replacement
# Perform replacements
df['text'] = df.text.replace(regex = {r'<p>': ' ', r'</p>': '', r'<a.*?\/a>': '+'})
print(df.text[0]) # show first row post replacement
Output
The first row only
Before replacement
On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a
'400-person chip research and development factory' outside Cambridge.
Planned to be operational by 2021, the factory will include an R&D
centre and will be built on a 550-acre site reportedly purchased by
Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted
in the FT article cited Huawei's long-term collaboration with
Cambridge University, which includes a five-year, £25 million
research partnership with BT, which launched a joint research group at
the University of Cambridge. Read more about that partnership on this
map.
Post Replacement
On 4 May 2019, + (FT) reported that Huawei is planning to build a
'400-person chip research and development factory' outside Cambridge.
Planned to be operational by 2021, the factory will include an R&D
centre and will be built on a 550-acre site reportedly purchased by
Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted
in the FT article cited Huawei's long-term collaboration with
Cambridge University, which includes a five-year, £25 million
research partnership with BT, which launched a joint research group at
the University of Cambridge. Read more about that partnership on this + ​

You can use the following regex pattern instead:
<a href=(.*?)">
I successfully tested this using your test string on regex101.
Full code:
import re
df["text"] = df["text"].str.replace(r'<a href=(.*?)">', "").str.strip()

I don't think your regex is quite right. Try:
df["text"] = df["text"].str.replace(r'<a href=\".*?\">', ' ').str.strip()
df["text"] = df["text"].str.replace(r'</a>', ' ').str.strip()

Related

Beautifulsoup class not contain multiple 'strings'

Is there a way to scrape p tags that do not contain multiple classes? Here's my code so far (after compiling codes and researching StackOverflow):
import requests
import bs4
import re
url = 'https://www.sp2.upenn.edu/person/amy-hillier/'
req = requests.get(url).text
soup = bs4.BeautifulSoup(req,'html.parser')
regex = re.compile('^((?!Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital).)*$')
for texts in soup.find_all('div'):
for i in texts.findAll('p',{'class': regex}):
print(i)
So my thought process is that I've created a regex to list strings that if exist, then the web scraper will not scrape the paragraph. To put it simply, if any of these words pop up on the class section, then don't scrape them.
Someone also recommend me to use a css selector syntax with :not() pseudo class and * contains operator, which I interpreted as:
for texts in soup.find_all('div'):
for i in texts.select('p[class]:not([class*="Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital"])'):
print(i)
Unfortunately, neither of them works. Any help is greatly appreciated!
Edit
Adding examples of text:
<p class="sub has-white-color has-normal-font-size tw-pb-5">
The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more.
</p>
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>
I need to scrape the second and third paragraphs. My logic is since the first paragraph's class has the word 'color' in it, I can exclude that. The rest of the words that I listed on the regex variable are pretty much the words that I have found and needed to be excluded across multiple URLs. I hope that clarifies my question.
Perhaps you can use custom function when searching for the right <p> tags. For example:
from bs4 import BeautifulSoup
html_doc = """\
<p class="sub has-white-color has-normal-font-size tw-pb-5">
The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more.
</p>
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
words = ["color"]
for p in soup.find_all(
lambda t: t.name == "p"
and all(w not in c.lower() for c in t.get("class", []) for w in words)
):
print(p)
print("-" * 80)
Prints:
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
--------------------------------------------------------------------------------
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>
--------------------------------------------------------------------------------

How do I solve Attribute error in Google Colab

I had written the following code in Google Colab. Earlier it was working fine but now it is showing attribute error('Page' has no attribute 'Rotation Matrix') at the convert line.
Content of Apollo 2019 file-
ORDINARY BUSINESS:
To consider and adopt:
a. the audited financial statement of the Company for the financial year ended March 31, 2019, the reports of the Board of
Directors and Auditors thereon; and
b. the audited consolidated financial statement of the Company for the financial year ended March 31, 2019 and report of
Auditors thereon.
To declare dividend of 3.25 per equity share, for the financial year ended March 31, 2019.
To appoint Mr.Robert Steinmetz (DIN: 00178792), who retires by rotation, and being eligible, offers himselfforre-appointment
and in this regard to consider and if thought fit, to pass the following resolution as a Special Resolution:-
“RESOLVED THAT pursuant to provisions of Section 152 and all other applicable provisions of the Companies Act, 2013
and Regulation 17(1A) of SEBI (Listing Obligations & Disclosure Requirements) Regulations, 2015, and other applicable
provisions, if any, (including any statutory modification(s) or re-enactment thereof, for the time being in force), consent of the
Members of the Company be and is hereby accorded to re-appoint, Mr. Robert Steinmetz (DIN: 00178792), Director, aged 79
years, who retires by rotation and being eligible offers himself for re-appointment, as a Director of the Company, liable to
retire by rotation.”
To appoint a Director in place of Mr. Francesco Gori (DIN: 07413105), who retires by rotation, and being eligible, offers
himself for re-appointment.
Please provide any solution if anyone knows.
!pip install python-docx
!pip install pdf2docx
from pdf2docx import Converter
from docx import Document
from google.colab import drive
drive.mount('/content/drive/')
file_name='/content/drive/MyDrive/Colab Notebooks/PDF files/Apollo 2019.pdf'
word_name='Apollo 2019.docx'
cv=Converter(file_name)
cv.convert(word_name)
cv.close()
Element is a base class processing coordinates, so set rotation matrix globally
--> 279 Element.set_rotation_matrix(self.fitz_page.rotationMatrix)
280
281 return raw_layout
AttributeError: 'Page' object has no attribute 'rotationMatrix'
This is the error which I am getting.
This is not a fix, per se. But I went in and commented out the line 279 in the RawPage.py file. After that, pdf2docx still executed properly and returned a file. Good luck

Python/Selenium - how to extract text from modal fade content?

I want to extract bios from a list of people from the website:
https://blueprint.connectiv.com/speakers/
I want to extract their title, company, and bio. However, bio is available only when you click each photo from the website.
Below is my coding to extract the title & company:
driver.find_element_by_xpath("//*[#id='speakers']/div/div/div/div/div/div/div").text.split('\n')
Can anyone help me extract the bios for each person? Any advice is appreciated!
You do not have to click the images as all the modals for each speaker are fully populated in the source. You can extract the content from these modals by using driver.execute_script:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://blueprint.connectiv.com/speakers/')
results = d.execute_script("""
var people = [];
for (var i of document.querySelectorAll('.modal.speakerCard')){
people.push({
name:i.querySelector('.description h4').textContent,
title:i.querySelector('p.title').textContent,
company:i.querySelector('p.company').textContent,
bio:i.querySelector('p.bio').textContent,
});
}
return people;
""")
Output (first 20 results):
[{'bio': 'Andrew is a recovering consultant turned serial entrepreneur, startup mentor and angel investor. He is the Managing Director of Dreamit Urbantech, investing in Proptech and Construction Tech. Andrew has written for Fortune, Forbes, Propmodo, CREtech, Builders Online, Architect Magazine, Multifamily Executive, AlleyWatch, Edsurge, The 74 Million, et. al. Andrew founded two companies and has a keen appreciation for how hard it is to build a successful startup, even under the best of circumstances.', 'company': 'Dreamit Ventures', 'name': 'Andrew Ackerman', 'title': 'Venture Partner'}, {'bio': 'Salman Ahmad is the CEO and co-founder of Mosaic, a construction technology company focused on making homebuilding scalable. By standardizing the process (homebuilding) and not the product (homes), Mosaic is delivering places people love and creating better communities. Salman holds a PhD in Electrical Engineering and Computer Science from MIT, focusing on Programming Language Design for Service-Oriented Systems, an MS in Computer Science from Stanford University focusing on Human Computer Interaction, and a BSE in Computer Systems Engineering from Arizona State University.\u2028 He also has 20 technical publications and patents in the areas of software systems, programming languages, machine learning, human-computer interaction, and sensor hardware. With a passion for construction, software, and computer science, Salman co-founded Mosaic to build places people love and make them widely available. ', 'company': 'Mosaic', 'name': 'Salman Ahmad', 'title': 'CEO and Co-Founder '}, {'bio': 'Dafna Akiva is a 10+ year veteran in the real estate investment, development, management and construction industries. Before assuming the role of Chief Revenue Officer at Veev, Dafna oversaw day-to-day operations and drove a number of company-scaling initiatives as Chief Operating Officer. Now, as Chief Revenue Officer, Dafna leads the development of new Veev projects that redefine customers’ living experiences, and drive revenue growth for the company’s bottom line. She oversees all real estate acquisitions and operation strategies, the real estate developments and account management, as well as sales, marketing, legal and HR.', 'company': 'Veev', 'name': 'Dafna Akiva', 'title': 'CRO & Co-Founder'}, {'bio': 'Min Alexander serves as CEO of PunchListUSA, the real estate platform digitizing home inspections for online ordering of repairs and lifecycle services. For the past decade, Min has been driving digital disruption to democratize real estate. She has led two national B2B2C platforms, field operations and created a top-10 U.S. brokerage, transforming the industry to increase access, quality and transparency.\n\nPrior to joining PunchListUSA, Min served as COO for Auction.com, as CEO and President of REALHome Services and Solutions and as SVP of Real Estate Services at Altisource. Min holds a BA from Duke and MBA from MIT. ', 'company': 'PunchlistUSA', 'name': 'Min Alexander', 'title': 'CEO & Co-Founder'}, {'bio': 'Nora Apsel is the Co-founder and CEO of Morty, the online mortgage marketplace. Morty provides homebuyers a place to evaluate competitive offers from multiple lenders, then lock and close their loans through an automated platform. Founded and led by engineers, Morty uses technology to forge a new path in mortgage: fully digital, free of legacy infrastructure, and backed by the flexible, scalable capital base of traditional lenders. As CEO, Nora is leading the Morty team through rapid, product-driven growth and nationwide expansion. Morty is a venture-backed company whose investors include Thrive Capital, Lerer Hippeau, MetaProp, March Capital, Prudence Holdings, FJ Labs and Rethink Impact. Trained as a software engineer before becoming an operator, Nora holds a M.S. in Computer Science from the University of Pennsylvania and a B.S. from Emory University.', 'company': 'Morty', 'name': 'Nora Apsel', 'title': 'CEO & Co-Founder'}, {'bio': 'Carey Armstrong is the co-founder and chief revenue officer of Tomo, a fintech startup that will provide the most customer-centric way to buy a home. Tomo was founded in the fall of 2020, raising an initial seed round of $40 million led by Ribbit Capital, NFX and Zigg Capital.\n\nCarey’s focus is on defining and delivering a delightful home buying experience for Tomo customers. She leads the development of our core transactional product offering as well as the growth and evolution of the business units that support it, including mortgage and brokerage. \n\nBefore co-founding Tomo, Carey was Vice President, Premier Agent, at Zillow Group, where she led business strategy, product strategy, and core operations for the $1B buyer services business. In this capacity, she was responsible for major leaps forward with initiatives including Connections, Home Tours, and Flex Select teams. \n\nPrior to Zillow, Carey was a strategy consultant and industry analyst with Boston Consulting Group and Forrester Research, respectively. Carey has a B.A. from Harvard University and an M.B.A. from the Tuck School of Business at Dartmouth. She and her family reside in Seattle.', 'company': 'Tomo', 'name': 'Carey Armstrong', 'title': 'CRO & Co-Founder'}, {'bio': 'Arie is the founder and CEO of WiredScore, the pioneer behind the international WiredScore certification system that evaluates and distinguishes best-in-class Internet connectivity in commercial buildings. Prior to founding WiredScore, Arie worked as a consultant with the Boston Consulting Group in New York City where he focused on the technology and media industries. Arie holds an MBA from the Wharton School and a BA and BS in Business and Political Science from the University of California, Berkeley.', 'company': 'WiredScore', 'name': 'Arie Barendrecht', 'title': 'CEO & Founder'}, {'bio': 'Demetrios Barnes is the Chief Operating Officer of SmartRent, where he leads the client engagement, supply chain and field operations teams. With over a decade of experience in property management operations, he is passionate about helping owners and operators understand the innovations technology can produce, while forging strong interpersonal relationships and participating in thought leadership discussions. Prior to co-founding SmartRent, he was Vice President of Technology for Colony Starwood Homes, Previously, Mr. Barnes was Director of Property Management and Technology with Beazer Pre-Owned Rental Homes, and a Regional Manager for several multifamily companies. Mr. Barnes holds a Bachelor of Science in Business Administration from Arizona State University.', 'company': 'SmartRent', 'name': 'Demetrios Barnes', 'title': 'COO & Co-Founder'}, {'bio': "Ryan J. S. Baxter is PropTech Advisor to the New York State Energy Research and Development Authority (NYSERDA), Cofounder of the PropTech Challenge, NYC Community Growth Lead for MetaProp NYC, and the founder of PASSNYC. Previously, Ryan served as a Vice President at the Real Estate Board of New York (REBNY). He is a native New Yorker who works passionately to make the City's built environment more educational.\n", 'company': 'Proptech Challenge', 'name': 'Ryan Baxter', 'title': 'Co-Founder'}, {'bio': 'Gary is CEO of Roofstock, a leading real estate investment marketplace which he co-founded in 2015. Gary has spent most of his career building businesses in the real estate, hospitality and tech sectors. After earning his BA in economics from Northwestern, Gary ventured west to earn his MBA from Stanford, where he caught the entrepreneurial bug and still serves as a regular guest lecturer. Previously Gary was instrumental in acquiring and integrating more than $800 million of resort properties for KSL Resorts, and spent five years as CFO of online brokerage pioneer ZipRealty, which he led through its successful IPO in 2004. Gary also served as CEO of Joie de Vivre Hospitality, then the second largest boutique hotel management company in the country. Immediately before starting Roofstock, Gary led one of the largest single-family rental platforms in the U.S. through its IPO as co-CEO of Starwood Waypoint Residential Trust, now part of Invitation Homes.', 'company': 'Roofstock', 'name': 'Gary Beasley', 'title': 'CEO & Co-Founder '}, {'bio': "Robyn has a track record of taking sophisticated climate and clean energy-related technical concepts and transforming them into commercially-oriented strategies that lead to impact, scale and results. She began her career in 2004 at Google in Mt View, CA, reporting directly to the co-founders working on strategic initiatives as they took the company public. Robyn went on to found Google's first business unit focused on incorporating clean energy generation across the company's global operations. In this capacity, she oversaw and catalyzed Google’s first clean energy initiatives, including large-scale clean energy procurement for data centers and the development and installation of a 1.7MW rooftop solar installation at the Mountain View HQ. Since then she has built, invested in, and raised $50M+ for new ventures and programs for Vestas Wind A/S in Copenhagen, Dean Kamen at DEKA R&D, and NRG Energy. Most recently she was an executive at Lennar Corp, where she built the firm’s first corporate venture platform while incubating Blueprint Power Technologies. Today, Robyn Beavers is the CEO and co-founder of Blueprint Power, a NYC-based real estate tech company that turns buildings into revenue-generating clean power plants. Robyn was named EY’s NY Entrepreneur of the Year in 2020. Robyn holds both a B.S. in Civil Engineering and an MBA from Stanford University.", 'company': 'Blueprint Power', 'name': 'Robyn Beavers', 'title': 'CEO & Co-Founder'}, {'bio': 'Liza Benson is a Partner with Moderne Ventures and helps lead and manage investment activity with particular focus on high-growth technology companies that can achieve rapid adoption and scale. Moderne Ventures is an early stage investment fund and industry immersion program which is focused on investing in technology companies in and around the multi-trillion dollar industries of real estate, mortgage, finance, insurance and home services.\n\nPrior to Moderne, Liza was a Partner with StarVest Partners, a $400M venture fund focused on expansion stage B2B SaaS investments. Previously, Liza was a Managing Director in the growth equity group at Highbridge Principals Strategies, a multi-billion asset manager. Before her experience at Highbridge, Liza was a Managing Director with Bear Stearns’ Constellation Growth Capital and an investment banker at Patricof & Co and First Union where she started her career.', 'company': 'Moderne Ventures', 'name': 'Liza Benson', 'title': 'Partner'}, {'bio': 'Jeremy Bernard is the CEO, North America at essensys, the world’s leading provider of software and technology to the flexible real estate industry. He has over 25 years of experience in the real estate and technology sectors. Most recently, Jeremy was the Global Head of Real Estate for Knotel where he grew and oversaw a portfolio of 5.5MM sq ft of flexible office space around the world. In previous roles, he has held C-level positions at real estate investment firms and launched several proptech companies. Jeremy resides in Westport, CT with his wife Jamie, daughter Morgan and son Brody.', 'company': 'essensys', 'name': 'Jeremy Bernard', 'title': 'CEO, North America'}, {'bio': "Benjamin Birnbaum is a Partner at Keyframe – a NYC based investment firm. His focus is primarily on how technology is causing market change across a number of physical infrastructure categories, like transportation and energy, inspired by earlier career experiences as an operating leader for one of the world's largest passenger transportation companies. Ben is also a co-founder of TeraWatt Infrastructure, a specialized owner of electric vehicle charging infrastructure focused on fleet electrification. ", 'company': 'Keyframe Capital', 'name': 'Ben Birnbaum', 'title': 'Partner'}, {'bio': 'Sean is the Co-Founder & CEO of BLACK, a tech-powered and cloud based CRE brokerage platform based in NYC. Prior to founding BLACK, Sean served as EVP of Real Estate and Enterprise Sales at WeWork, He has been involved in millions of square feet of commercial real estate leasing transactions over his 20 year tenure, and has worked at many of the world’s largest commercial brokerage firms including Cushman & Wakefield, JLL, Newmark, and Grubb & Ellis. ', 'company': 'BlackRE', 'name': 'Sean Black', 'title': 'CEO & Co-Founder'}, {'bio': 'As chief operating officer of CA Student Living, Steve Boyack is responsible for driving the performance and growth of CASL’s property management platform, as well as overseeing its corporate operational functions including technology, human resources, communications and culture. Steve leverages his decades of experience in the industry to develop and advance the people, processes and technologies that form the foundation of the business.\n\nBoyack previously served as global head of property management for CA Ventures, a parent company of CA Student Living, where he laid the foundation for the firm’s European student operating platform (Novel Student), global sustainability initiative, wellness program and innovation department. Prior to joining CA, Steve was a senior managing director at Greystar where he was responsible for overseeing real estate operations and leading the expansion of the company’s footprint in key Midwest markets. In addition, he oversaw Greystar’s national construction and maintenance operations and worked with their global innovation team.\n\nSteve earned a BS in Economics from the University of Iowa and a CPM® designation from the Institute of Real Estate Management. As a\xa0member of several industry advisory boards and associations, Steve is a\xa0recognized subject matter expert and thought leader, with particular focus on integrated property technology.', 'company': 'CA Ventures', 'name': 'Steve Boyack', 'title': 'COO, Student Living'}, {'bio': 'Laura Cain is the CEO and co-founder of Willow Servicing, a technology company focused on streamlining mortgage servicing. Willow’s platform automates core workflows, enabling lenders to provide digital-first borrower experiences while reducing operational costs and ensuring compliance with industry policies & regulations. Prior to Willow, Laura was a product manager at Snapdocs, where she built out their initial eClose product offering to lenders, and a venture investor at Thomvest, where she focused on early stage fintech investments.', 'company': 'Willow Servicing', 'name': 'Laura Cain', 'title': 'CEO & Co-Founder'}, {'bio': 'Madhu Chamarty is the co-founder and CEO of BeyondHQ, a startup that helps companies plan and scale distributed teams. An engineer and math nerd at heart, he has 15+ yrs of startup experience in Silicon Valley, as an early employee and co-founder at 3 high-growth B2B startups in digital media (Adify - Cox acq. # $300MM), employee communities (Dynamic Signal), and geospatial analytics (Descartes Labs). He has scaled sales & support teams globally, in both colocated and remote formats. He grew up in a fully distributed family across 4 countries, so believes he was destined to build BeyondHQ even before he knew it.', 'company': 'BeyondHQ', 'name': 'Madhu Chamarty', 'title': 'CEO & Co-Founder'}, {'bio': 'Alex Chatzielftheriou is a Greek entrepreneur and CEO and co-founder of Blueground — a real estate tech company founded in 2013. Blueground provides a network of fully-furnished, move-in ready apartments in 14 cities across the globe for stays of a month, a year, or longer. Having lived and worked in more than 15 cities around the world, Alex sought to provide business and leisure travelers with a hassle-free way to find places that feel like home — to show up and start living from day one. Along the way, Alex disrupted the traditional lease model, enabling flexible living to encourage travel and exploration of the world and its cultures while providing a place to feel "grounded" and call home. ', 'company': 'Blueground', 'name': 'Alex Chatzieleftheriou', 'title': 'CEO & Co-Founder'}, {'bio': 'Jit Kee Chin is the Chief Data & Innovation Officer and Executive Vice President at Suffolk. Ms. Chin is responsible for leveraging big data and advanced analytics to improve the organization’s core business. Ms. Chin is also responsible for helping to position Suffolk to achieve its vision of transforming the construction experience while working closely with the company’s Innovation and Strategy teams to fundamentally reinvent the future of construction in the digital age. \n\nPrior to her role at Suffolk, Ms. Chin spent 10 years with management consulting firm McKinsey and Company where she counseled senior executives on strategic, commercial and advanced analytics topics. Most recently, she was a Senior Expert in Analytics in McKinsey’s Boston office where she specialized in the design and implementation of end- to-end analytics transformations. Prior to that role, Ms. Chin was an Associate Principal in McKinsey’s London office where she helped organizations drive multi-year business transformations and change programs and developing strategies for profitable growth.', 'company': 'Suffolk Construction', 'name': 'Jit Kee Chin', 'title': 'Chief Data & Innovation Officer'}]
In pandas:
import pandas as pd
df = pd.DataFrame(results)
print(df)
Output:
bio company name title
0 Andrew is a recovering consultant turned seria... Dreamit Ventures Andrew Ackerman Venture Partner
1 Salman Ahmad is the CEO and co-founder of Mosa... Mosaic Salman Ahmad CEO and Co-Founder
2 Dafna Akiva is a 10+ year veteran in the real ... Veev Dafna Akiva CRO & Co-Founder
3 Min Alexander serves as CEO of PunchListUSA, t... PunchlistUSA Min Alexander CEO & Co-Founder
4 Nora Apsel is the Co-founder and CEO of Morty,... Morty Nora Apsel CEO & Co-Founder
.. ... ... ... ...
128 Ms. Wong joined Tishman Speyer in 2015. Jenny ... Tishman Speyer Jenny Wong Managing Director
129 Joseph is the Founder and CEO of Neighbor.com,... Neighbor Joseph Woodbury CEO & Founder
130 Based in Palo Alto, Michael Yang is a Managing... OMERS Ventures Michael Yang Managing Partner
131 Since joining RET Ventures as Partner in 2019,... RET Ventures Christopher Yip Partner & Managing Director
132 Chris Zlocki, Global Head of Client Experience... Colliers Chris Zlocki EVP, Occupier Services
[133 rows x 4 columns]
Instead of driver.execute_script, you can use BeautifulSoup:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://blueprint.connectiv.com/speakers/')
s = soup(d.page_source, 'html.parser').select('.modal.speakerCard')
r = [dict(zip(['name', 'title', 'company', 'bio'],
[b.text for b in i.select(':is(h4, p.title, p.company, p.bio)')])) for i in s]
If all the information you are looking for is within a paragraph tag <p> that has a class of bio (so <p class='bio'>), and all the modals are already present in the source code, then you can simply select all with:
bios = driver.find_elements_by_xpath('//p[#class="bio"]')
That will select all elements that are a <p> tag that also has a class equal to 'bio' and return it in a list. If some of the p tags have other classes in them (i.e. <p class='bio someotherclass'>), then you will need to use the contains() method in your xpath, like so:
bios = driver.find_elements_by_xpath('//p[contains(#class, "bio")]')
You can then loop through the results like so:
for bio in bios:
print(bio.text)

can't commit data to the database due to unknown characters in python

I am scraping some websites and storing the data in my database. Sometimes I get character maps to error, which I think is due to non-ASCII characters. Since I am scraping many websites with texts in different languages, I could not solve my issue in a general and efficient way.
an error example
Message: 'commit exception GRANTS.GOV'
Arguments: (UnicodeEncodeError('charmap', 'The Embassy of the United States in Nur-Sultan and the Consulate General of the United States in Almaty announces an open competition for past participants (“alumni”) of U.S. government-funded and U.S. government-sponsored exchange programs to submit applications to the 2021 Alumni Engagement Innovation Fund (AEIF) 2021.\xa0\xa0We seek proposals from teams of at least two alumni that meet all program eligibility requirements below. Exchange alumni interested in participating in AEIF 2021 should submit proposals to KazakhstanAlumni#state.gov\xa0by March 31, 2021, 18:00 Nur-Sultan time.\xa0\nAEIF provides alumni of U.S. sponsored and facilitated exchange programs with funding to expand on skills gained during their exchange experience to design and implement innovative solutions to global challenges facing their community. Since its inception in 2011, AEIF has funded nearly 500 alumni-led projects around the world through a competitive global competition.\n\nThis year, the U.S. Mission to Kazakhstan will accept proposals managed by teams of at least two (2) alumni that support the following theme:\n\u25cf\xa0\xa0\xa0\xa0\xa0\xa0Mental health awareness, promotion of mental wellbeing and resiliency.\nGoals. Projects may support one or more of the following goals:\nGoal 1: Increase in public understanding of mental health issues,\xa0its signs and strategies for providing timely help;\nGoal 2: Increase in public understanding of resources, methods, and tools that promote mental health and resiliency, especially among at-risk audiences; American best practices to promote mental health.\nGoal 3: Combatting stigma around mental health issues and dispelling common myths.\n\nFor full package of required forms please Related Documents section.', 1098, 1099, 'character maps to <undefined>'),)
my code :
title ='..............'
description ='......'
op = Op(
website='',
op_link='',
title='it might be a long text coming form websites,
description= it might be a long text coming from websites.,
organization_id=org_id,
close_date=',
checksum=singleData['checksum'],
published_date='',
language_id=lang_id,
is_open=1)
try:
session.add(op)
session.commit()
session.flush()
....
....
Please note: it should work on a Linux system; my database (Mysql) is in a Linux system.
I mostly face the issue with title and description, which can be in many languages and any length. How can I make encode it correctly so that I don't get any error while committing to the database?
Thank you

PyPDF2 won't extract all text from PDF

I'm trying to extract text from a PDF (https://www.sec.gov/litigation/admin/2015/34-76574.pdf) using PyPDF2, and the only result I'm getting is the following string:
b''
Here is my code:
import PyPDF2
import urllib.request
import io
url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urllib.request.urlopen(url).read()
memory_file = io.BytesIO(remote_file)
read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(1)
page_content = page.extractText()
print(page_content.encode('utf-8'))
This code worked correctly on a few of the PDFs I'm working with (e.g. https://www.sec.gov/litigation/admin/2016/34-76837-proposed-amended-distribution-plan.pdf), but the others like the file above didn't work. Any idea what's wrong?
I don't know why pypdf2 can't extract the information from that PDF, but the package pdftotext can:
import pdftotext
from six.moves.urllib.request import urlopen
import io
url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urlopen(url).read()
memory_file = io.BytesIO(remote_file)
pdf = pdftotext.PDF(memory_file)
# Iterate over all the pages
for page in pdf:
print(page)
Extracted
UNITED STATES OF AMERICA
Before the
SECURITIES AND EXCHANGE COMMISSION
SECURITIES EXCHANGE ACT OF 1934
Release No. 76574 / December 7, 2015
ADMINISTRATIVE PROCEEDING
File No. 3-16987
ORDER INSTITUTING CEASE-AND-DESIST
In the Matter of PROCEEDINGS, PURSUANT TO SECTION
21C OF THE SECURITIES EXCHANGE ACT
KEFEI WANG OF 1934, MAKING FINDINGS, AND
IMPOSING REMEDIAL SANCTIONS AND A
Respondent. CEASE-AND-DESIST ORDER
I.
The Securities and Exchange Commission (“Commission”) deems it appropriate and in the
public interest that cease-and-desist proceedings be, and hereby are, instituted pursuant to 21C of
the Securities Exchange Act of 1934 (“Exchange Act”) against Kefei Wang (“Respondent”).
II.
In anticipation of the institution of these proceedings, Respondent has submitted an Offer
of Settlement (the “Offer”) which the Commission has determined to accept. Solely for the
purpose of these proceedings and any other proceedings brought by or on behalf of the
Commission, or to which the Commission is a party, and without admitting or denying the findings
herein, except as to the Commission’s jurisdiction over him and the subject matter of these
proceedings, which are admitted, and except as provided herein in Section V, Respondent consents
to the entry of this Order Instituting Cease-and-Desist Proceedings, Pursuant to Section 21C of the
Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a
Cease-and-Desist Order (“Order”), as set forth below.
III.
On the basis of this Order and Respondent’s Offer, the Commission finds1 that:
Summary
1. Respondent violated Section 15(a)(1) of the Exchange Act by acting as an
unregistered broker-dealer in connection with his representation of clients who were seeking U.S.
residency through the Immigrant Investor Program. Respondent helped effect certain individuals’
securities purchases in an EB-5 Regional Center. Respondent received a commission from that
Regional Center for each investment he facilitated.
Respondent
2. Kefei Wang, age 39, is a resident of China. During the relevant time period, he was
a U.S. resident and an owner of Nautilus Global Capital, LLC , a now defunct entity that was based
in Fremont, California.
Background
3. The United States Congress created the Immigrant Investor Program, also known as
“EB-5,” in 1990 to stimulate the U.S. economy through job creation and capital investment by
foreign investors. The Program offers EB-5 visas to individuals who invest $1 million in a new
commercial enterprise that creates or preserves at least 10 full-time jobs for qualifying U.S.
workers (or $500,000 in an enterprise located in a rural area or an area of high unemployment). A
certain number of EB-5 visas are set aside for investors in approved Regional Centers. A Regional
Center is defined as “any economic unit, public or private, which is involved with the promotion of
economic growth, including increased export sales, improved regional productivity, job creation,
and increased domestic capital investment.” 8 C.F.R. § 204.6(e) (2015).
4. Typical Regional Center investment vehicles are offered as limited partnership
interests. The partnership interests are securities, usually offered pursuant to one or more
exemptions from the registration requirements of the U.S. securities laws. The Regional Centers
are often managed by a person or entity which acts as a general partner of the limited partnership.
The Regional Centers, the investment vehicles, and the managers are collectively referred to herein
as “EB-5 Investment Offerers.”
5. Various EB-5 Investment Offerers paid commissions to anyone who successfully
sold limited partnership interests to new investors.
1
The findings herein are made pursuant to Respondent’s Offer of Settlement and are not
binding on any other person or entity in this or any other proceeding.
2
Respondent Received Commissions for His Clients’ EB-5 Investments
6. From at least January 2010 through May 2014, Respondent received a portion of
commissions from one EB-5 Investment Offerer totaling $40,000. The commissions constituted
his portion of the commissions that were paid pursuant to a written Agency Agreement between
Nautilus Global Capital and the EB-5 Investment Offerer. On one or more occasions the
commission was paid to a foreign bank account identified by the Respondent despite the fact that
the Respondent was U.S.-based during the relevant time period.
7. Respondent performed activities necessary to effectuate the transaction, including
recommending the specific EB-5 Investment Offerer referenced in paragraph 6 to his clients;
acting as a liaison between the EB-5 Investment Offerer and the investors; and facilitating the
transfer and/or documentation of investment funds to the EB-5 Investment Offerer. Respondent
received his portion of transaction-based commissions due to Nautilus Global Capital for its
services from that EB-5 Investment Offerer.
8. As a result of the conduct described above, Respondent violated Section 15(a)(1) of
the Exchange Act which makes it unlawful for any broker or dealer which is either a person other
than a natural person or a natural person not associated with a broker or dealer to make use of the
mails or any means or instrumentality of interstate commerce “to effect any transactions in, or to
induce or attempt to induce the purchase or sale of, any security” unless such broker or dealer is
registered in accordance with Section 15(b) of the Exchange Act.
IV.
In view of the foregoing, the Commission deems it appropriate to impose the sanctions
agreed to in Respondent Kefei Wang’s Offer.
Accordingly, pursuant to Section 21C of the Exchange Act, it is hereby ORDERED that:
A. Respondent shall cease and desist from committing or causing any violations and
any future violations of Section 15(a)(1) of the Exchange Act.
B. Respondent shall, within ten (10) days of the entry of this Order, pay disgorgement
of $40,000, prejudgment interest of $1,590, and a civil money penalty of $25,000 to the Securities
and Exchange Commission for transfer to the general fund of the United States Treasury in
accordance with Exchange Act Section 21F(g)(3). If timely payment of disgorgement and
prejudgment interest is not made, additional interest shall accrue pursuant to SEC Rule of Practice
600 [17 C.F.R. § 201.600]. If timely payment of the civil money penalty is not made, additional
interest shall accrue pursuant to 31 U.S.C. § 3717. Payment must be made in one of the following
ways:
(1) Respondent may transmit payment electronically to the Commission, which will
provide detailed ACH transfer/Fedwire instructions upon request;
3
(2) Respondent may make direct payment from a bank account via Pay.gov through the
SEC website at http://www.sec.gov/about/offices/ofm.htm; or
(3) Respondent may pay by certified check, bank cashier’s check, or United States
postal money order, made payable to the Securities and Exchange Commission and
hand-delivered or mailed to:
Enterprise Services Center
Accounts Receivable Branch
HQ Bldg., Room 181, AMZ-341
6500 South MacArthur Boulevard
Oklahoma City, OK 73169
Payments by check or money order must be accompanied by a cover letter identifying
Kefei Wang as a Respondent in these proceedings, and the file number of these proceedings; a
copy of the cover letter and check or money order must be sent to Stephen L. Cohen, Associate
Director, Division of Enforcement, Securities and Exchange Commission, 100 F St., NE,
Washington, DC 20549-5553.
V.
It is further Ordered that, solely for purposes of exceptions to discharge set forth in Section
523 of the Bankruptcy Code, 11 U.S.C. § 523, the findings in this Order are true and admitted by
Respondent, and further, any debt for disgorgement, prejudgment interest, civil penalty or other
amounts due by Respondent under this Order or any other judgment, order, consent order, decree
or settlement agreement entered in connection with this proceeding, is a debt for the violation by
Respondent of the federal securities laws or any regulation or order issued under such laws, as set
forth in Section 523(a)(19) of the Bankruptcy Code, 11 U.S.C. § 523(a)(19).
By the Commission.
Brent J. Fields
Secretary
4
[Finished in 0.5s]
I think that there might be an issue with how you are extracting the pages try making a loop and calling each page separately like so
for i in range(0 , number_of_pages ):
pageObj = pdfReader.getPage(i)
page = pageObj.extractText()

Categories