I'm trying to extract text from a PDF (https://www.sec.gov/litigation/admin/2015/34-76574.pdf) using PyPDF2, and the only result I'm getting is the following string:
b''
Here is my code:
import PyPDF2
import urllib.request
import io
url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urllib.request.urlopen(url).read()
memory_file = io.BytesIO(remote_file)
read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(1)
page_content = page.extractText()
print(page_content.encode('utf-8'))
This code worked correctly on a few of the PDFs I'm working with (e.g. https://www.sec.gov/litigation/admin/2016/34-76837-proposed-amended-distribution-plan.pdf), but the others like the file above didn't work. Any idea what's wrong?
I don't know why pypdf2 can't extract the information from that PDF, but the package pdftotext can:
import pdftotext
from six.moves.urllib.request import urlopen
import io
url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urlopen(url).read()
memory_file = io.BytesIO(remote_file)
pdf = pdftotext.PDF(memory_file)
# Iterate over all the pages
for page in pdf:
print(page)
Extracted
UNITED STATES OF AMERICA
Before the
SECURITIES AND EXCHANGE COMMISSION
SECURITIES EXCHANGE ACT OF 1934
Release No. 76574 / December 7, 2015
ADMINISTRATIVE PROCEEDING
File No. 3-16987
ORDER INSTITUTING CEASE-AND-DESIST
In the Matter of PROCEEDINGS, PURSUANT TO SECTION
21C OF THE SECURITIES EXCHANGE ACT
KEFEI WANG OF 1934, MAKING FINDINGS, AND
IMPOSING REMEDIAL SANCTIONS AND A
Respondent. CEASE-AND-DESIST ORDER
I.
The Securities and Exchange Commission (“Commission”) deems it appropriate and in the
public interest that cease-and-desist proceedings be, and hereby are, instituted pursuant to 21C of
the Securities Exchange Act of 1934 (“Exchange Act”) against Kefei Wang (“Respondent”).
II.
In anticipation of the institution of these proceedings, Respondent has submitted an Offer
of Settlement (the “Offer”) which the Commission has determined to accept. Solely for the
purpose of these proceedings and any other proceedings brought by or on behalf of the
Commission, or to which the Commission is a party, and without admitting or denying the findings
herein, except as to the Commission’s jurisdiction over him and the subject matter of these
proceedings, which are admitted, and except as provided herein in Section V, Respondent consents
to the entry of this Order Instituting Cease-and-Desist Proceedings, Pursuant to Section 21C of the
Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a
Cease-and-Desist Order (“Order”), as set forth below.
III.
On the basis of this Order and Respondent’s Offer, the Commission finds1 that:
Summary
1. Respondent violated Section 15(a)(1) of the Exchange Act by acting as an
unregistered broker-dealer in connection with his representation of clients who were seeking U.S.
residency through the Immigrant Investor Program. Respondent helped effect certain individuals’
securities purchases in an EB-5 Regional Center. Respondent received a commission from that
Regional Center for each investment he facilitated.
Respondent
2. Kefei Wang, age 39, is a resident of China. During the relevant time period, he was
a U.S. resident and an owner of Nautilus Global Capital, LLC , a now defunct entity that was based
in Fremont, California.
Background
3. The United States Congress created the Immigrant Investor Program, also known as
“EB-5,” in 1990 to stimulate the U.S. economy through job creation and capital investment by
foreign investors. The Program offers EB-5 visas to individuals who invest $1 million in a new
commercial enterprise that creates or preserves at least 10 full-time jobs for qualifying U.S.
workers (or $500,000 in an enterprise located in a rural area or an area of high unemployment). A
certain number of EB-5 visas are set aside for investors in approved Regional Centers. A Regional
Center is defined as “any economic unit, public or private, which is involved with the promotion of
economic growth, including increased export sales, improved regional productivity, job creation,
and increased domestic capital investment.” 8 C.F.R. § 204.6(e) (2015).
4. Typical Regional Center investment vehicles are offered as limited partnership
interests. The partnership interests are securities, usually offered pursuant to one or more
exemptions from the registration requirements of the U.S. securities laws. The Regional Centers
are often managed by a person or entity which acts as a general partner of the limited partnership.
The Regional Centers, the investment vehicles, and the managers are collectively referred to herein
as “EB-5 Investment Offerers.”
5. Various EB-5 Investment Offerers paid commissions to anyone who successfully
sold limited partnership interests to new investors.
1
The findings herein are made pursuant to Respondent’s Offer of Settlement and are not
binding on any other person or entity in this or any other proceeding.
2
Respondent Received Commissions for His Clients’ EB-5 Investments
6. From at least January 2010 through May 2014, Respondent received a portion of
commissions from one EB-5 Investment Offerer totaling $40,000. The commissions constituted
his portion of the commissions that were paid pursuant to a written Agency Agreement between
Nautilus Global Capital and the EB-5 Investment Offerer. On one or more occasions the
commission was paid to a foreign bank account identified by the Respondent despite the fact that
the Respondent was U.S.-based during the relevant time period.
7. Respondent performed activities necessary to effectuate the transaction, including
recommending the specific EB-5 Investment Offerer referenced in paragraph 6 to his clients;
acting as a liaison between the EB-5 Investment Offerer and the investors; and facilitating the
transfer and/or documentation of investment funds to the EB-5 Investment Offerer. Respondent
received his portion of transaction-based commissions due to Nautilus Global Capital for its
services from that EB-5 Investment Offerer.
8. As a result of the conduct described above, Respondent violated Section 15(a)(1) of
the Exchange Act which makes it unlawful for any broker or dealer which is either a person other
than a natural person or a natural person not associated with a broker or dealer to make use of the
mails or any means or instrumentality of interstate commerce “to effect any transactions in, or to
induce or attempt to induce the purchase or sale of, any security” unless such broker or dealer is
registered in accordance with Section 15(b) of the Exchange Act.
IV.
In view of the foregoing, the Commission deems it appropriate to impose the sanctions
agreed to in Respondent Kefei Wang’s Offer.
Accordingly, pursuant to Section 21C of the Exchange Act, it is hereby ORDERED that:
A. Respondent shall cease and desist from committing or causing any violations and
any future violations of Section 15(a)(1) of the Exchange Act.
B. Respondent shall, within ten (10) days of the entry of this Order, pay disgorgement
of $40,000, prejudgment interest of $1,590, and a civil money penalty of $25,000 to the Securities
and Exchange Commission for transfer to the general fund of the United States Treasury in
accordance with Exchange Act Section 21F(g)(3). If timely payment of disgorgement and
prejudgment interest is not made, additional interest shall accrue pursuant to SEC Rule of Practice
600 [17 C.F.R. § 201.600]. If timely payment of the civil money penalty is not made, additional
interest shall accrue pursuant to 31 U.S.C. § 3717. Payment must be made in one of the following
ways:
(1) Respondent may transmit payment electronically to the Commission, which will
provide detailed ACH transfer/Fedwire instructions upon request;
3
(2) Respondent may make direct payment from a bank account via Pay.gov through the
SEC website at http://www.sec.gov/about/offices/ofm.htm; or
(3) Respondent may pay by certified check, bank cashier’s check, or United States
postal money order, made payable to the Securities and Exchange Commission and
hand-delivered or mailed to:
Enterprise Services Center
Accounts Receivable Branch
HQ Bldg., Room 181, AMZ-341
6500 South MacArthur Boulevard
Oklahoma City, OK 73169
Payments by check or money order must be accompanied by a cover letter identifying
Kefei Wang as a Respondent in these proceedings, and the file number of these proceedings; a
copy of the cover letter and check or money order must be sent to Stephen L. Cohen, Associate
Director, Division of Enforcement, Securities and Exchange Commission, 100 F St., NE,
Washington, DC 20549-5553.
V.
It is further Ordered that, solely for purposes of exceptions to discharge set forth in Section
523 of the Bankruptcy Code, 11 U.S.C. § 523, the findings in this Order are true and admitted by
Respondent, and further, any debt for disgorgement, prejudgment interest, civil penalty or other
amounts due by Respondent under this Order or any other judgment, order, consent order, decree
or settlement agreement entered in connection with this proceeding, is a debt for the violation by
Respondent of the federal securities laws or any regulation or order issued under such laws, as set
forth in Section 523(a)(19) of the Bankruptcy Code, 11 U.S.C. § 523(a)(19).
By the Commission.
Brent J. Fields
Secretary
4
[Finished in 0.5s]
I think that there might be an issue with how you are extracting the pages try making a loop and calling each page separately like so
for i in range(0 , number_of_pages ):
pageObj = pdfReader.getPage(i)
page = pageObj.extractText()
Related
Currently I have many rows in one column similar to the string below. On python I have run the code to remove , <a, href=, and the url itself using this code
df["text"] = df["text"].str.replace(r'\s*https?://\S+(\s+|$)', ' ').str.strip()
df["text"] = df["text"].str.replace(r'\s*href=//\S+(\s+|$)', ' ').str.strip()
However, the output continues to remain the same. Please advise.
<p>On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this map.</p>
<p>In 2020 it was reported that the Huawei research and development center received approval by a local council despite the nation’s ongoing security concerns around the Chinese company.</p>
<p>Chinese state media later reported that Huawei's expansion in Cambridge 'is part of a five-year, £3 billion investment plan for the UK that [Huawei] announced alongside [then] British Prime Minister Theresa May' in February 2018.</p>
IIUC you want to replace the following html tags:
<p>, <a, href=, and the url
Code
df['text'] = df.text.replace(regex = {r'<p>': ' ', r'</p>': '', r'<a.*?\/a>': '+'})
Explanation
Regex dictionary does the following substitutions
<p> replaced by ' '
<a href = .../a> replaced by '+'
</p> replaced by ''
Example
Create Data
s = '''<p>On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a '400-person chip research and development factory' outside Cambridge. Planned to be operational by 2021, the factory will include an R&D centre and will be built on a 550-acre site reportedly purchased by Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted in the FT article cited Huawei's long-term collaboration with Cambridge University, which includes a five-year, £25 million research partnership with BT, which launched a joint research group at the University of Cambridge. Read more about that partnership on this map.</p>
<p>In 2020 it was reported that the Huawei research and development center received approval by a local council despite the nation’s ongoing security concerns around the Chinese company.</p>
<p>Chinese state media later reported that Huawei's expansion in Cambridge 'is part of a five-year, £3 billion investment plan for the UK that [Huawei] announced alongside [then] British Prime Minister Theresa May' in February 2018.</p>'''
data = {'text':s.split('\n')}
df = pd.DataFrame(data)
print(df.text[0]) # show first row pre-replacement
# Perform replacements
df['text'] = df.text.replace(regex = {r'<p>': ' ', r'</p>': '', r'<a.*?\/a>': '+'})
print(df.text[0]) # show first row post replacement
Output
The first row only
Before replacement
On 4 May 2019, The Financial Times (FT) reported that Huawei is planning to build a
'400-person chip research and development factory' outside Cambridge.
Planned to be operational by 2021, the factory will include an R&D
centre and will be built on a 550-acre site reportedly purchased by
Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted
in the FT article cited Huawei's long-term collaboration with
Cambridge University, which includes a five-year, £25 million
research partnership with BT, which launched a joint research group at
the University of Cambridge. Read more about that partnership on this
map.
Post Replacement
On 4 May 2019, + (FT) reported that Huawei is planning to build a
'400-person chip research and development factory' outside Cambridge.
Planned to be operational by 2021, the factory will include an R&D
centre and will be built on a 550-acre site reportedly purchased by
Huawei in 2018 for £37.5 million. A Huawei spokesperson quoted
in the FT article cited Huawei's long-term collaboration with
Cambridge University, which includes a five-year, £25 million
research partnership with BT, which launched a joint research group at
the University of Cambridge. Read more about that partnership on this +
You can use the following regex pattern instead:
<a href=(.*?)">
I successfully tested this using your test string on regex101.
Full code:
import re
df["text"] = df["text"].str.replace(r'<a href=(.*?)">', "").str.strip()
I don't think your regex is quite right. Try:
df["text"] = df["text"].str.replace(r'<a href=\".*?\">', ' ').str.strip()
df["text"] = df["text"].str.replace(r'</a>', ' ').str.strip()
Is there a way to scrape p tags that do not contain multiple classes? Here's my code so far (after compiling codes and researching StackOverflow):
import requests
import bs4
import re
url = 'https://www.sp2.upenn.edu/person/amy-hillier/'
req = requests.get(url).text
soup = bs4.BeautifulSoup(req,'html.parser')
regex = re.compile('^((?!Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital).)*$')
for texts in soup.find_all('div'):
for i in texts.findAll('p',{'class': regex}):
print(i)
So my thought process is that I've created a regex to list strings that if exist, then the web scraper will not scrape the paragraph. To put it simply, if any of these words pop up on the class section, then don't scrape them.
Someone also recommend me to use a css selector syntax with :not() pseudo class and * contains operator, which I interpreted as:
for texts in soup.find_all('div'):
for i in texts.select('p[class]:not([class*="Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital"])'):
print(i)
Unfortunately, neither of them works. Any help is greatly appreciated!
Edit
Adding examples of text:
<p class="sub has-white-color has-normal-font-size tw-pb-5">
The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more.
</p>
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>
I need to scrape the second and third paragraphs. My logic is since the first paragraph's class has the word 'color' in it, I can exclude that. The rest of the words that I listed on the regex variable are pretty much the words that I have found and needed to be excluded across multiple URLs. I hope that clarifies my question.
Perhaps you can use custom function when searching for the right <p> tags. For example:
from bs4 import BeautifulSoup
html_doc = """\
<p class="sub has-white-color has-normal-font-size tw-pb-5">
The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more.
</p>
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
words = ["color"]
for p in soup.find_all(
lambda t: t.name == "p"
and all(w not in c.lower() for c in t.get("class", []) for w in words)
):
print(p)
print("-" * 80)
Prints:
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
--------------------------------------------------------------------------------
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>
--------------------------------------------------------------------------------
I want to extract bios from a list of people from the website:
https://blueprint.connectiv.com/speakers/
I want to extract their title, company, and bio. However, bio is available only when you click each photo from the website.
Below is my coding to extract the title & company:
driver.find_element_by_xpath("//*[#id='speakers']/div/div/div/div/div/div/div").text.split('\n')
Can anyone help me extract the bios for each person? Any advice is appreciated!
You do not have to click the images as all the modals for each speaker are fully populated in the source. You can extract the content from these modals by using driver.execute_script:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://blueprint.connectiv.com/speakers/')
results = d.execute_script("""
var people = [];
for (var i of document.querySelectorAll('.modal.speakerCard')){
people.push({
name:i.querySelector('.description h4').textContent,
title:i.querySelector('p.title').textContent,
company:i.querySelector('p.company').textContent,
bio:i.querySelector('p.bio').textContent,
});
}
return people;
""")
Output (first 20 results):
[{'bio': 'Andrew is a recovering consultant turned serial entrepreneur, startup mentor and angel investor. He is the Managing Director of Dreamit Urbantech, investing in Proptech and Construction Tech. Andrew has written for Fortune, Forbes, Propmodo, CREtech, Builders Online, Architect Magazine, Multifamily Executive, AlleyWatch, Edsurge, The 74 Million, et. al. Andrew founded two companies and has a keen appreciation for how hard it is to build a successful startup, even under the best of circumstances.', 'company': 'Dreamit Ventures', 'name': 'Andrew Ackerman', 'title': 'Venture Partner'}, {'bio': 'Salman Ahmad is the CEO and co-founder of Mosaic, a construction technology company focused on making homebuilding scalable. By standardizing the process (homebuilding) and not the product (homes), Mosaic is delivering places people love and creating better communities. Salman holds a PhD in Electrical Engineering and Computer Science from MIT, focusing on Programming Language Design for Service-Oriented Systems, an MS in Computer Science from Stanford University focusing on Human Computer Interaction, and a BSE in Computer Systems Engineering from Arizona State University.\u2028 He also has 20 technical publications and patents in the areas of software systems, programming languages, machine learning, human-computer interaction, and sensor hardware. With a passion for construction, software, and computer science, Salman co-founded Mosaic to build places people love and make them widely available. ', 'company': 'Mosaic', 'name': 'Salman Ahmad', 'title': 'CEO and Co-Founder '}, {'bio': 'Dafna Akiva is a 10+ year veteran in the real estate investment, development, management and construction industries. Before assuming the role of Chief Revenue Officer at Veev, Dafna oversaw day-to-day operations and drove a number of company-scaling initiatives as Chief Operating Officer. Now, as Chief Revenue Officer, Dafna leads the development of new Veev projects that redefine customers’ living experiences, and drive revenue growth for the company’s bottom line. She oversees all real estate acquisitions and operation strategies, the real estate developments and account management, as well as sales, marketing, legal and HR.', 'company': 'Veev', 'name': 'Dafna Akiva', 'title': 'CRO & Co-Founder'}, {'bio': 'Min Alexander serves as CEO of PunchListUSA, the real estate platform digitizing home inspections for online ordering of repairs and lifecycle services. For the past decade, Min has been driving digital disruption to democratize real estate. She has led two national B2B2C platforms, field operations and created a top-10 U.S. brokerage, transforming the industry to increase access, quality and transparency.\n\nPrior to joining PunchListUSA, Min served as COO for Auction.com, as CEO and President of REALHome Services and Solutions and as SVP of Real Estate Services at Altisource. Min holds a BA from Duke and MBA from MIT. ', 'company': 'PunchlistUSA', 'name': 'Min Alexander', 'title': 'CEO & Co-Founder'}, {'bio': 'Nora Apsel is the Co-founder and CEO of Morty, the online mortgage marketplace. Morty provides homebuyers a place to evaluate competitive offers from multiple lenders, then lock and close their loans through an automated platform. Founded and led by engineers, Morty uses technology to forge a new path in mortgage: fully digital, free of legacy infrastructure, and backed by the flexible, scalable capital base of traditional lenders. As CEO, Nora is leading the Morty team through rapid, product-driven growth and nationwide expansion. Morty is a venture-backed company whose investors include Thrive Capital, Lerer Hippeau, MetaProp, March Capital, Prudence Holdings, FJ Labs and Rethink Impact. Trained as a software engineer before becoming an operator, Nora holds a M.S. in Computer Science from the University of Pennsylvania and a B.S. from Emory University.', 'company': 'Morty', 'name': 'Nora Apsel', 'title': 'CEO & Co-Founder'}, {'bio': 'Carey Armstrong is the co-founder and chief revenue officer of Tomo, a fintech startup that will provide the most customer-centric way to buy a home. Tomo was founded in the fall of 2020, raising an initial seed round of $40 million led by Ribbit Capital, NFX and Zigg Capital.\n\nCarey’s focus is on defining and delivering a delightful home buying experience for Tomo customers. She leads the development of our core transactional product offering as well as the growth and evolution of the business units that support it, including mortgage and brokerage. \n\nBefore co-founding Tomo, Carey was Vice President, Premier Agent, at Zillow Group, where she led business strategy, product strategy, and core operations for the $1B buyer services business. In this capacity, she was responsible for major leaps forward with initiatives including Connections, Home Tours, and Flex Select teams. \n\nPrior to Zillow, Carey was a strategy consultant and industry analyst with Boston Consulting Group and Forrester Research, respectively. Carey has a B.A. from Harvard University and an M.B.A. from the Tuck School of Business at Dartmouth. She and her family reside in Seattle.', 'company': 'Tomo', 'name': 'Carey Armstrong', 'title': 'CRO & Co-Founder'}, {'bio': 'Arie is the founder and CEO of WiredScore, the pioneer behind the international WiredScore certification system that evaluates and distinguishes best-in-class Internet connectivity in commercial buildings. Prior to founding WiredScore, Arie worked as a consultant with the Boston Consulting Group in New York City where he focused on the technology and media industries. Arie holds an MBA from the Wharton School and a BA and BS in Business and Political Science from the University of California, Berkeley.', 'company': 'WiredScore', 'name': 'Arie Barendrecht', 'title': 'CEO & Founder'}, {'bio': 'Demetrios Barnes is the Chief Operating Officer of SmartRent, where he leads the client engagement, supply chain and field operations teams. With over a decade of experience in property management operations, he is passionate about helping owners and operators understand the innovations technology can produce, while forging strong interpersonal relationships and participating in thought leadership discussions. Prior to co-founding SmartRent, he was Vice President of Technology for Colony Starwood Homes, Previously, Mr. Barnes was Director of Property Management and Technology with Beazer Pre-Owned Rental Homes, and a Regional Manager for several multifamily companies. Mr. Barnes holds a Bachelor of Science in Business Administration from Arizona State University.', 'company': 'SmartRent', 'name': 'Demetrios Barnes', 'title': 'COO & Co-Founder'}, {'bio': "Ryan J. S. Baxter is PropTech Advisor to the New York State Energy Research and Development Authority (NYSERDA), Cofounder of the PropTech Challenge, NYC Community Growth Lead for MetaProp NYC, and the founder of PASSNYC. Previously, Ryan served as a Vice President at the Real Estate Board of New York (REBNY). He is a native New Yorker who works passionately to make the City's built environment more educational.\n", 'company': 'Proptech Challenge', 'name': 'Ryan Baxter', 'title': 'Co-Founder'}, {'bio': 'Gary is CEO of Roofstock, a leading real estate investment marketplace which he co-founded in 2015. Gary has spent most of his career building businesses in the real estate, hospitality and tech sectors. After earning his BA in economics from Northwestern, Gary ventured west to earn his MBA from Stanford, where he caught the entrepreneurial bug and still serves as a regular guest lecturer. Previously Gary was instrumental in acquiring and integrating more than $800 million of resort properties for KSL Resorts, and spent five years as CFO of online brokerage pioneer ZipRealty, which he led through its successful IPO in 2004. Gary also served as CEO of Joie de Vivre Hospitality, then the second largest boutique hotel management company in the country. Immediately before starting Roofstock, Gary led one of the largest single-family rental platforms in the U.S. through its IPO as co-CEO of Starwood Waypoint Residential Trust, now part of Invitation Homes.', 'company': 'Roofstock', 'name': 'Gary Beasley', 'title': 'CEO & Co-Founder '}, {'bio': "Robyn has a track record of taking sophisticated climate and clean energy-related technical concepts and transforming them into commercially-oriented strategies that lead to impact, scale and results. She began her career in 2004 at Google in Mt View, CA, reporting directly to the co-founders working on strategic initiatives as they took the company public. Robyn went on to found Google's first business unit focused on incorporating clean energy generation across the company's global operations. In this capacity, she oversaw and catalyzed Google’s first clean energy initiatives, including large-scale clean energy procurement for data centers and the development and installation of a 1.7MW rooftop solar installation at the Mountain View HQ. Since then she has built, invested in, and raised $50M+ for new ventures and programs for Vestas Wind A/S in Copenhagen, Dean Kamen at DEKA R&D, and NRG Energy. Most recently she was an executive at Lennar Corp, where she built the firm’s first corporate venture platform while incubating Blueprint Power Technologies. Today, Robyn Beavers is the CEO and co-founder of Blueprint Power, a NYC-based real estate tech company that turns buildings into revenue-generating clean power plants. Robyn was named EY’s NY Entrepreneur of the Year in 2020. Robyn holds both a B.S. in Civil Engineering and an MBA from Stanford University.", 'company': 'Blueprint Power', 'name': 'Robyn Beavers', 'title': 'CEO & Co-Founder'}, {'bio': 'Liza Benson is a Partner with Moderne Ventures and helps lead and manage investment activity with particular focus on high-growth technology companies that can achieve rapid adoption and scale. Moderne Ventures is an early stage investment fund and industry immersion program which is focused on investing in technology companies in and around the multi-trillion dollar industries of real estate, mortgage, finance, insurance and home services.\n\nPrior to Moderne, Liza was a Partner with StarVest Partners, a $400M venture fund focused on expansion stage B2B SaaS investments. Previously, Liza was a Managing Director in the growth equity group at Highbridge Principals Strategies, a multi-billion asset manager. Before her experience at Highbridge, Liza was a Managing Director with Bear Stearns’ Constellation Growth Capital and an investment banker at Patricof & Co and First Union where she started her career.', 'company': 'Moderne Ventures', 'name': 'Liza Benson', 'title': 'Partner'}, {'bio': 'Jeremy Bernard is the CEO, North America at essensys, the world’s leading provider of software and technology to the flexible real estate industry. He has over 25 years of experience in the real estate and technology sectors. Most recently, Jeremy was the Global Head of Real Estate for Knotel where he grew and oversaw a portfolio of 5.5MM sq ft of flexible office space around the world. In previous roles, he has held C-level positions at real estate investment firms and launched several proptech companies. Jeremy resides in Westport, CT with his wife Jamie, daughter Morgan and son Brody.', 'company': 'essensys', 'name': 'Jeremy Bernard', 'title': 'CEO, North America'}, {'bio': "Benjamin Birnbaum is a Partner at Keyframe – a NYC based investment firm. His focus is primarily on how technology is causing market change across a number of physical infrastructure categories, like transportation and energy, inspired by earlier career experiences as an operating leader for one of the world's largest passenger transportation companies. Ben is also a co-founder of TeraWatt Infrastructure, a specialized owner of electric vehicle charging infrastructure focused on fleet electrification. ", 'company': 'Keyframe Capital', 'name': 'Ben Birnbaum', 'title': 'Partner'}, {'bio': 'Sean is the Co-Founder & CEO of BLACK, a tech-powered and cloud based CRE brokerage platform based in NYC. Prior to founding BLACK, Sean served as EVP of Real Estate and Enterprise Sales at WeWork, He has been involved in millions of square feet of commercial real estate leasing transactions over his 20 year tenure, and has worked at many of the world’s largest commercial brokerage firms including Cushman & Wakefield, JLL, Newmark, and Grubb & Ellis. ', 'company': 'BlackRE', 'name': 'Sean Black', 'title': 'CEO & Co-Founder'}, {'bio': 'As chief operating officer of CA Student Living, Steve Boyack is responsible for driving the performance and growth of CASL’s property management platform, as well as overseeing its corporate operational functions including technology, human resources, communications and culture. Steve leverages his decades of experience in the industry to develop and advance the people, processes and technologies that form the foundation of the business.\n\nBoyack previously served as global head of property management for CA Ventures, a parent company of CA Student Living, where he laid the foundation for the firm’s European student operating platform (Novel Student), global sustainability initiative, wellness program and innovation department. Prior to joining CA, Steve was a senior managing director at Greystar where he was responsible for overseeing real estate operations and leading the expansion of the company’s footprint in key Midwest markets. In addition, he oversaw Greystar’s national construction and maintenance operations and worked with their global innovation team.\n\nSteve earned a BS in Economics from the University of Iowa and a CPM® designation from the Institute of Real Estate Management. As a\xa0member of several industry advisory boards and associations, Steve is a\xa0recognized subject matter expert and thought leader, with particular focus on integrated property technology.', 'company': 'CA Ventures', 'name': 'Steve Boyack', 'title': 'COO, Student Living'}, {'bio': 'Laura Cain is the CEO and co-founder of Willow Servicing, a technology company focused on streamlining mortgage servicing. Willow’s platform automates core workflows, enabling lenders to provide digital-first borrower experiences while reducing operational costs and ensuring compliance with industry policies & regulations. Prior to Willow, Laura was a product manager at Snapdocs, where she built out their initial eClose product offering to lenders, and a venture investor at Thomvest, where she focused on early stage fintech investments.', 'company': 'Willow Servicing', 'name': 'Laura Cain', 'title': 'CEO & Co-Founder'}, {'bio': 'Madhu Chamarty is the co-founder and CEO of BeyondHQ, a startup that helps companies plan and scale distributed teams. An engineer and math nerd at heart, he has 15+ yrs of startup experience in Silicon Valley, as an early employee and co-founder at 3 high-growth B2B startups in digital media (Adify - Cox acq. # $300MM), employee communities (Dynamic Signal), and geospatial analytics (Descartes Labs). He has scaled sales & support teams globally, in both colocated and remote formats. He grew up in a fully distributed family across 4 countries, so believes he was destined to build BeyondHQ even before he knew it.', 'company': 'BeyondHQ', 'name': 'Madhu Chamarty', 'title': 'CEO & Co-Founder'}, {'bio': 'Alex Chatzielftheriou is a Greek entrepreneur and CEO and co-founder of Blueground — a real estate tech company founded in 2013. Blueground provides a network of fully-furnished, move-in ready apartments in 14 cities across the globe for stays of a month, a year, or longer. Having lived and worked in more than 15 cities around the world, Alex sought to provide business and leisure travelers with a hassle-free way to find places that feel like home — to show up and start living from day one. Along the way, Alex disrupted the traditional lease model, enabling flexible living to encourage travel and exploration of the world and its cultures while providing a place to feel "grounded" and call home. ', 'company': 'Blueground', 'name': 'Alex Chatzieleftheriou', 'title': 'CEO & Co-Founder'}, {'bio': 'Jit Kee Chin is the Chief Data & Innovation Officer and Executive Vice President at Suffolk. Ms. Chin is responsible for leveraging big data and advanced analytics to improve the organization’s core business. Ms. Chin is also responsible for helping to position Suffolk to achieve its vision of transforming the construction experience while working closely with the company’s Innovation and Strategy teams to fundamentally reinvent the future of construction in the digital age. \n\nPrior to her role at Suffolk, Ms. Chin spent 10 years with management consulting firm McKinsey and Company where she counseled senior executives on strategic, commercial and advanced analytics topics. Most recently, she was a Senior Expert in Analytics in McKinsey’s Boston office where she specialized in the design and implementation of end- to-end analytics transformations. Prior to that role, Ms. Chin was an Associate Principal in McKinsey’s London office where she helped organizations drive multi-year business transformations and change programs and developing strategies for profitable growth.', 'company': 'Suffolk Construction', 'name': 'Jit Kee Chin', 'title': 'Chief Data & Innovation Officer'}]
In pandas:
import pandas as pd
df = pd.DataFrame(results)
print(df)
Output:
bio company name title
0 Andrew is a recovering consultant turned seria... Dreamit Ventures Andrew Ackerman Venture Partner
1 Salman Ahmad is the CEO and co-founder of Mosa... Mosaic Salman Ahmad CEO and Co-Founder
2 Dafna Akiva is a 10+ year veteran in the real ... Veev Dafna Akiva CRO & Co-Founder
3 Min Alexander serves as CEO of PunchListUSA, t... PunchlistUSA Min Alexander CEO & Co-Founder
4 Nora Apsel is the Co-founder and CEO of Morty,... Morty Nora Apsel CEO & Co-Founder
.. ... ... ... ...
128 Ms. Wong joined Tishman Speyer in 2015. Jenny ... Tishman Speyer Jenny Wong Managing Director
129 Joseph is the Founder and CEO of Neighbor.com,... Neighbor Joseph Woodbury CEO & Founder
130 Based in Palo Alto, Michael Yang is a Managing... OMERS Ventures Michael Yang Managing Partner
131 Since joining RET Ventures as Partner in 2019,... RET Ventures Christopher Yip Partner & Managing Director
132 Chris Zlocki, Global Head of Client Experience... Colliers Chris Zlocki EVP, Occupier Services
[133 rows x 4 columns]
Instead of driver.execute_script, you can use BeautifulSoup:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://blueprint.connectiv.com/speakers/')
s = soup(d.page_source, 'html.parser').select('.modal.speakerCard')
r = [dict(zip(['name', 'title', 'company', 'bio'],
[b.text for b in i.select(':is(h4, p.title, p.company, p.bio)')])) for i in s]
If all the information you are looking for is within a paragraph tag <p> that has a class of bio (so <p class='bio'>), and all the modals are already present in the source code, then you can simply select all with:
bios = driver.find_elements_by_xpath('//p[#class="bio"]')
That will select all elements that are a <p> tag that also has a class equal to 'bio' and return it in a list. If some of the p tags have other classes in them (i.e. <p class='bio someotherclass'>), then you will need to use the contains() method in your xpath, like so:
bios = driver.find_elements_by_xpath('//p[contains(#class, "bio")]')
You can then loop through the results like so:
for bio in bios:
print(bio.text)
So I'm trying to grab all the jobs available in the UK from this site: https://www.ubisoft.com/en-us/company/careers/search?countries=gb and when going into the network setting there is a json file with the data needed https://avcvysejs1-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.8.4)%3B%20Browser%20(lite)%3B%20JS%20Helper%20(3.3.4)%3B%20react%20(16.12.0)%3B%20react-instantsearch%20(6.8.3)&x-algolia-api-key=1291fd5d5cd5a76a225fc6b00f7b296a&x-algolia-application-id=AVCVYSEJS1 and it uses Request Method: POST
However when I wrote a script to get that data
data = []
url = "https://avcvysejs1-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.8.4)%3B%20Browser%20(lite)%3B%20JS%20Helper%20(3.3.4)%3B%20react%20(16.12.0)%3B%20react-instantsearch%20(6.8.3)&x-algolia-api-key=1291fd5d5cd5a76a225fc6b00f7b296a&x-algolia-application-id=AVCVYSEJS1"
r = requests.post(url)
json = r.json()
print(json)
but get the result
{'message': 'No content in POST request', 'status': 400}
and when I change it to r = requests.post(url) I get the result
{'message': 'indexName is not valid', 'status': 400}
To get correct response from the server, send payload with the request:
import requests
from bs4 import BeautifulSoup
api_url = "https://avcvysejs1-dsn.algolia.net/1/indexes/*/queries"
params = {
"x-algolia-agent": "Algolia for JavaScript (4.8.4); Browser (lite); JS Helper (3.3.4); react (16.12.0); react-instantsearch (6.8.3)",
"x-algolia-api-key": "1291fd5d5cd5a76a225fc6b00f7b296a",
"x-algolia-application-id": "AVCVYSEJS1",
}
payload = """{"requests":[{"indexName":"jobs_en-us_default","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&query=&maxValuesPerFacet=100&page=0&facets=%5B%22jobFamily%22%2C%22team%22%2C%22countryCode%22%2C%22city%22%2C%22contractType%22%2C%22graduateProgram%22%5D&tagFilters=&facetFilters=%5B%5B%22countryCode%3Agb%22%5D%5D"},{"indexName":"jobs_en-us_default","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&query=&maxValuesPerFacet=100&page=0&hitsPerPage=1&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=countryCode"}]}"""
data = requests.post(api_url, params=params, data=payload).json()
for h in data["results"][0]["hits"]:
print(h["title"])
print(
BeautifulSoup(h["additionalInformation"], "html.parser").get_text(
strip=True, separator=" "
)
)
print("-" * 80)
Prints:
Lead Concept Artist [New IP] (433)
Benefits & Relocation Flexible working, 22 days annual leave + Christmas shutdown, private healthcare (with option to add immediate family), life insurance & income protection, workplace pension scheme, paid volunteering days, annual fitness & well-being allowance, games, technology & merchandise, subsidised travel and many more... Relocation assistance is available to anyone currently living 50 miles or more from the studio location. Please contact a member of the talent acquisition team to find out what we have to offer and how we can support with your move here... relocation really doesn't have to be a daunting prospect. Find out more about Ubisoft Reflections: https://reflections.ubisoft.com/about/ubisoft-reflections/ Facebook: https://www.facebook.com/pg/Ubisoft.Reflections Twitter: https://twitter.com/UbiReflections Ubisoft offers the same job opportunities to all, without any distinction of gender, ethnicity, religion, sexual orientation, social status, disability or age. Ubisoft ensures the development of an inclusive work environment which mirrors the diversity of our gamers community.
--------------------------------------------------------------------------------
Player Support Product Lead
Benefits With Ubisoft CRC, you'll receive a competitive salary along with: Personal performance bonus Private Health Insurance (including eye care and dental) Life Assurance Long Term Disability Insurance Pension Significant discount on the world’s best video games Access to Ubisoft's back catalogue on PC Perks: We work in the heart of Newcastle city centre, right on top of Haymarket metro station in a lively, international and creative space. We have a kitchen stocked with cereals, fruits, unlimited filtered water, teas, coffee Regular professional and social events Monthly Ubidrinks Flexible working hours A casual dress code Fun, we like to work hard but have a laugh too! For the safety of all our teams we are currently working remotely. We hope to return to our CRC home very soon and anticipate a blended working pattern combining office and home based working in the future Ubisoft is committed to creating an inclusive work environment that reflects the diversity of our player community. We are an equal opportunity employer. Qualified applicants will receive consideration for employment without regard to their race, ethnicity, religion, gender, sexual orientation, age or disability status.
--------------------------------------------------------------------------------
...and so on.
I have a gz format file. The file is very big and the first line is as follow:
{"originaltitle":"Leasing Specialist - WPM Real Estate Management","workexperiences":[{"company":"Home Properties","country":"US","customizeddaterange":"","daterange":{"displaydaterange":"","startdate":null,"enddate":null},"description":"Responsibilities: Inspect tour routes, models and show apartments daily to ensure cleanliness. Greeting prospective residents; determining the needs and preferences of the prospect and professionally present specific apartments while providing information regarding features and benefits. Answering incoming calls in a cheerful and professional manner. Handle each call accordingly whether it is a prospect call or an irate resident that just moved in. Develop and maintain Resident relations through the courtesy of on-site personnel, promptness of maintenance calls, and knowledge of community policies. Learn to develop professional sales and closing techniques. Accompany prospects to model apartments and discusses size and layout of rooms, available facilities, such as swimming pool and saunas, location of shopping centers, services available, and terms of lease. Demonstrate thorough knowledge and use of lead tracking system. Make follow-up calls to prospective Residents who did not fill out an application. Compile and update listings of available rental units.","location":"Baltimore, MD","normalizedtitle":"leasing specialist","title":"Leasing Specialist"},{"company":"WPM Real Estate Management","country":"US","customizeddaterange":"1 year, 3 months","daterange":{"displaydaterange":"July 2017 to October 2018","startdate":{"displaydate":"July 2017","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"October 2018","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Inspect tour routes, models and show apartments daily to ensure cleanliness. Greeting prospective residents; determining the needs and preferences of the prospect and professionally present specific apartments while providing information regarding features and benefits. Answering incoming calls in a cheerful and professional manner. Handle each call accordingly whether it is a prospect call or an irate resident that just moved in. Develop and maintain Resident relations through the courtesy of on-site personnel, promptness of maintenance calls, and knowledge of community policies. Learn to develop professional sales and closing techniques. Accompany prospects to model apartments and discusses size and layout of rooms, available facilities, such as swimming pool and saunas, location of shopping centers, services available, and terms of lease. Demonstrate thorough knowledge and use of lead tracking system. Make follow-up calls to prospective Residents who did not fill out an application. Compile and update listings of available rental units.","location":"Baltimore, MD","normalizedtitle":"leasing specialist","title":"Leasing Specialist"},{"company":"Westminster Management","country":"US","customizeddaterange":"1 year","daterange":{"displaydaterange":"June 2016 to June 2017","startdate":{"displaydate":"June 2016","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"June 2017","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Tour vacant units and model with future prospects.Process applications. Answer emails and incoming phone calls. Prepare lease agreement for signing. Collect all monies that is due on dateof move-in. Enter resident repair orders for resident. Walk vacant units to ensure that the unit is ready for show. Complete residency and employment verifications. Income qualify all applicants.","location":"Baltimore, MD","normalizedtitle":"leasing consultant","title":"Leasing Consultant"},{"company":"MARYLAND MANAGEMENT COMPANY","country":"US","customizeddaterange":"1 year, 1 month","daterange":{"displaydaterange":"April 2015 to May 2016","startdate":{"displaydate":"April 2015","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"May 2016","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Lease apartments, sign lease agreements, complete residence maintenance repairrequest, answer phones, customer service, processed prospects applications, opened and closedinventory, responded to Level One emails Accomplishments: I was able to successfully finish FairHousing requirements. The first month I was able to properly and accurately process a application and move-in documents. Skills Used: The skills I used while at Americana were strong team work, strongcommunication, interpersonal, and leadership.","location":"Glen Burnie, MD","normalizedtitle":"leasing agent","title":"Leasing Agent"},{"company":"Amazon.com","country":"US","customizeddaterange":"1 year, 5 months","daterange":{"displaydaterange":"September 2014 to February 2016","startdate":{"displaydate":"September 2014","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"February 2016","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: I assure customers are receiving the correct merchandise in a timely fashion.And evaluate inventoryAccomplishments:I exceeded Amazon expectations of receiving 2800 items per hour, which allowed me to train otherassociates, building confidence and skills.Skills Used:The skills i used while performing my task were strong leadership, strong communications, and beingdetailed orientated.","location":"Baltimore, MD","normalizedtitle":"customer service representative","title":"Customer Service Representative"},{"company":"Carmax Superstore","country":"US","customizeddaterange":"1 year, 2 months","daterange":{"displaydaterange":"February 2014 to April 2015","startdate":{"displaydate":"February 2014","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"April 2015","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities:Greet customersSearch for the right vehicle that best suits the customers needs and wantsSubmit financial applicationsAssist customer with the purchasing process and document signingEnter customers information for appraisal offerAssist customer with purchasing Car Max extended warrantiesConducted follow- up on a daily, weekly, and monthy basisAccomplishments:I was acknowledged by the district for having 100% in Car Max extended warranties. Also I wasacknowledged by the district for having one of the highest Voice Of Customer survey scores. I passedthe 6 week training, obtaining my sales licenseSkills Used:I demonstrate strong communication, interpersonal and listening skills. I also have strongorganizational skills.","location":"Nottingham, MD","normalizedtitle":"sales consultant","title":"Certified Sales Consultant"},{"company":"rue21","country":"US","customizeddaterange":"1 year, 8 months","daterange":{"displaydaterange":"June 2011 to February 2013","startdate":{"displaydate":"June 2011","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"February 2013","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Managed profit goals on a daily basisCustomer ServiceReceived Incoming shipmentDelivered daily bank depsoitsMaintained store appearanceOverlooked sales associates performanceCreated daily goals for each sales associateAccomplishments:The impact that I was able to have during my time at Rue21, I was able to build a strong team of individuals who were scored top in the region for Customer Service.Skills Used:I demonstrated strong leadership and verbal communication.","location":"Dundalk, MD","normalizedtitle":"assistant store manager","title":"Assistant Store Manager"},{"company":"Shaws Jewelers","country":"US","customizeddaterange":"1 year, 5 months","daterange":{"displaydaterange":"November 2009 to April 2011","startdate":{"displaydate":"November 2009","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"April 2011","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Customer serviceGeneral office( typing, faxing, )Made outgoing calls to valued customersCleaned and maintained show cases and lunch roomPrepared jewlery repair tickets for outgoing shipmentAccomplishments:During my time at Shaws Jewelers I was able to demonstrate excellent customer service.Also I wasable to achieve personal profit goals and credit application goals on a daily basis. I was acknowledged and rewarded by my DM for excellent team participation and over achieving the 6 standards on a dailybasis.Skills Used:I demonstrated strong verbal and listening skills. Also I have excellent interpersonal skills.","location":"Dundalk, MD","normalizedtitle":"sales associate","title":"Sales Associate"}],"skillslist":[{"monthsofexperience":0,"text":"yardi"},{"monthsofexperience":0,"text":"marketing"},{"monthsofexperience":0,"text":"outlook"},{"monthsofexperience":0,"text":"receptionist"},{"monthsofexperience":0,"text":"management"}],"url":"/r/Lashannon-Felton/1062d3b8cbb13886","additionalinfo":""}\n'
I am not familiar with gzip.GzipFile format.
Is there a way to make it a dictionary?
You will want to make use of the json module and the gzip module in Python, both of which are part of the Python Standard Library.
The gzip module provides the GzipFile class, as well as the open(),
compress() and decompress() convenience functions. The GzipFile class
reads and writes gzip-format files, automatically compressing or
decompressing the data so that it looks like an ordinary file object.
To read the compressed file, you can call gzip.open().
Opening the file with the default rb mode, will return a gzip.GzipFile object, from which you can obtain a bytes-like object by calling read().
Then, using json.loads(), you can convert the raw data into a usable Python object -- a dictionary.
The snippet below is a simple demonstration of this in action:
import gzip
import json
with gzip.open('gzipped_file.json.gz', 'rb') as f:
raw_json = f.read()
data = json.loads(raw_json)
print(type(data))
# Prints <class 'dict'>
print(data)
# Prints {'originaltitle': 'Leasing Specialist - WPM Real Estate Management', 'workexperience ...
print(data['workexperiences'][0]['company'])
# Prints Home Properties