Reading excel file with line breaks and tabs preserved using xlrd

Reading excel file with line breaks and tabs preserved using xlrd - python

I am trying to read excel file cells having multi line text in it. I am using xlrd 1.2.0. But when I print or even write the text in cell to .txt file it doesn't preserve line breaks or tabs i.e \n or \t.
Input:
File URL:
Excel file
Code:
import xlrd
filenamedotxlsx = '16.xlsx'
gall_artists = xlrd.open_workbook(filenamedotxlsx)
sheet = gall_artists.sheet_by_index(0)
bio = sheet.cell_value(0,1)
print(bio)
Output:
"Biography 2018-2019 Manoeuvre Textiles Atelier, Gent, Belgium 2017-2018 Thalielab, Brussels, Belgium 2017 Laboratoires d'Aubervilliers, Paris 2014-2015 Galveston Artist Residency (GAR), Texas 2014 MACBA, Barcelona & L'appartment 22, Morocco - Residency 2013 International Residence Recollets, Paris 2007 Gulbenkian & RSA Residency, BBC Natural History Dept, UK 2004-2006 Delfina Studios, UK Studio Award, London 1998-2000 De Ateliers, Post-grad Residency, Amsterdam 1995-1998 BA (Hons) Textile Art, Winchester School of Art UK "
Expected Output:
1975 Born in Hangzhou, Zhejiang, China
1980 Started to learn Chinese ink painting
2000 BA, Major in Oil Painting, China Academy of Art, Hangzhou, China
Curator, Hangzhou group exhibition for 6 female artists Untitled, 2000 Present
2007 MA, New Media, China Academy of Art, Hangzhou, China, studied under Jiao Jian
Lecturer, Department of Art, Zhejiang University, Hangzhou, China
2015 PhD, Calligraphy, China Academy of Art, Hangzhou, China, studied under Wang Dongling
Jury, 25th National Photographic Art Exhibition, China Millennium Monument, Beijing, China
2016 Guest professor, Faculty of Humanities, Zhejiang University, Hangzhou, China
Associate professor, Research Centre of Modern Calligraphy, China Academy of Art, Hangzhou, China
Researcher, Lanting Calligraphy Commune, Zhejiang, China
2017 Christie's produced a video about Chu Chu's art
2018 Featured by Poetry Calligraphy Painting Quarterly No.2, Beijing, China
Present Vice Secretary, Lanting Calligraphy Society, Hangzhou, China
Vice President, Zhejiang Female Calligraphers Association, Hangzhou, China
I have also used repr() to see if there are \n characters or not, but there aren't any.

Related

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico

Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

Columns must be same length as key Error python

I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.
"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"
cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000
file1.csv
research_groups_names_f1
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
University of California, Los Angeles
Harvard University
file2.csv
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
USA
The University of Queensland (UQ)
USA
University of California
USA
file_output.csv
research_groups_names_f1
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences
Chinese Academy of Sciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
USA
UQ
The University of Queensland (UQ)
Australia
Harvard University
Not found
USA
University of California, Los Angeles
University of California
USA
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names_f1"] == row["research_groups_names_f2"]
or row["research_groups_names_f1"] in n["research_groups_names_f2"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)

The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.
The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.
The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.
import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})
df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
'University of Michigan (U-M)',
'The University of Queensland (UQ)',
'University of California'],
'Locatio_f2': ['China', 'USA', 'USA', 'USA']})
df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)
df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)
df.merge(df2, on='key', how='left').drop(columns='key')
Output
research_groups_names_f1 research_groups_names_f2 Locatio_f2
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) USA
3 UQ The University of Queensland (UQ) USA
4 University of California, Los Angeles NaN NaN
5 Harvard University NaN NaN

When can `re.finditer` not return anything but string.index can?

Simply,
In [9]: [m.start() for m in re.finditer(answer_text, context)]
Out[9]: []
In [10]: context.index(answer_text)
Out[10]: 384
As you can see, re.finditer does not return a match, but the index method does. Is this expected?
In [18]: context
Out[18]: 'Fight for My Way (; lit. "Third-Rate My Way") is a South Korean television series starring Park Seo-joon and Kim Ji-won, with Ahn Jae-hong and Song Ha-yoon. It premiered on May 22, 2017 every Monday and Tuesday at 22:00 (KST) on KBS2. Kim Ji-won (Hangul: 김지원 ; Hanja: 金智媛 ; born October 19, 1992) is a South Korean actress. She gained attention through her roles in television series "The Heirs" (2013), "Descendants of the Sun" (2016) and "Fight for My Way" (2017). Yellow Hair 2 () is a 2001 South Korean film, written, produced, and directed by Kim Yu-min. It is the sequel to Kim\'s 1999 film "Yellow Hair", though it does not continue the same story or feature any of the same characters. The original film gained attention when it was refused a rating due to its sexual content, requiring some footage to be cut before it was allowed a public release. "Yellow Hair 2" attracted no less attention from the casting of transsexual actress Harisu in her first major film role. Ko Joo-yeon (born February 22, 1994) is a South Korean actress who has gained attention in the Korean film industry for her roles in "Blue Swallow" (2005) and "The Fox Family" (2006). In 2007 she appeared in the horror film "Epitaph" as Asako, a young girl suffering from overbearing nightmares and aphasia, becoming so immersed in the role that she had to deal with sudden nosebleeds while on set. Kyu Hyun Kim of "Koreanfilm.org" highlighted her performance in the film, saying, "[The cast\'s] acting thunder is stolen by the ridiculously pretty Ko Joo-yeon, another Korean child actress who we dearly hope continues her film career." Kim Ji-won (Hangul:\xa0김지원 ; born December 21, 1995), better known by his stage name Bobby (Hangul:\xa0바비 ) is a Korean-American rapper and singer. He is known as a member of the popular South Korean boy group iKON, signed under YG Entertainment. Descendants of the Sun () is a 2016 South Korean television series starring Song Joong-ki, Song Hye-kyo, Jin Goo, and Kim Ji-won. It aired on KBS2 from February 24 to April 14, 2016, on Wednesdays and Thursdays at 22:00 for 16 episodes. KBS then aired three additional special episodes from April 20 to April 22, 2016 containing highlights and the best scenes from the series, the drama\'s production process, behind-the-scenes footage, commentaries from cast members and the final epilogue. What\'s Up () is a 2011 South Korean television series starring Lim Ju-hwan, Daesung, Lim Ju-eun, Oh Man-seok, Jang Hee-jin, Lee Soo-hyuk, Kim Ji-won and Jo Jung-suk. It aired on MBN on Saturdays to Sundays at 23:00 for 20 episodes beginning December 3, 2011. The 2016 KBS Drama Awards (), presented by Korean Broadcasting System (KBS), was held on December 31, 2016 at KBS Hall in Yeouido, Seoul. It was hosted by Jun Hyun-moo, Park Bo-gum and Kim Ji-won. Gap-dong () is a 2014 South Korean television series starring Yoon Sang-hyun, Sung Dong-il, Kim Min-jung, Kim Ji-won and Lee Joon. It aired on cable channel tvN from April 11 to June 14, 2014 on Fridays and Saturdays at 20:40 for 20 episodes. Kim Ji-won (Hangul: 김지원; born 26 February 1995) is a South Korean female badminton player. In 2013, Kim and her national teammates won the Suhadinata Cup after beat Indonesian junior team in the final round of the mixed team event. She also won the girls\' doubles title partnered with Chae Yoo-jung.'
In [19]: answer_text
Out[19]: '"The Heirs" (2013)'

Python question about reading two txt and output one txt

week5_worldarea.txt: week5_worldpop.txt:
China 9388211 China 1415045928
India 2973190 India 1354051854
U.S. 9147420 U.S. 326766748
Indonesia 1811570 Indonesia 266794980
Brazil 8358140 Brazil 210867954
Pakistan 770880 Pakistan 200813818
Nigeria 910770 Nigeria 195875237
Bangladesh 130170 Bangladesh 166368149
Russia 16376870 Russia 143964709
Mexico 1943950 Mexico 130759074
Japan 364555 Japan 127185332
Ethiopia 1000000 Ethiopia 107534882
Philippines 298170 Philippines 106512074
Egypt 995450 Egypt 99375741
Viet-Nam 310070 Viet-Nam 96491146
DR-Congo 2267050 DR-Congo 84004989
Germany 348560 Germany 82293457
Iran 1628550 Iran 82011735
Turkey 769630 Turkey 81916871
Thailand 510890 Thailand 69183173
U.K. 241930 U.K. 66573504
France 547557 France 65233271
Italy 294140 Italy 59290969
Hello, I have two text files as you can see. I want to create a third txt which contains Country Name
and population density. For example:
China 150.7258
India 455.420
.....
To do this, I code a python file called untitled2 which contains these functions :
def get_area(x):
pos1=x.find(' ')
area=x[pos1+1:len(x)]
return area
def get_country(x):
pos2=x.find(' ')
country=x[0:pos2]
return country
def get_pop(x):
pos3=x.find(' ')
pop=x[pos3+1:len(x)]
return pop
and another python file called untitled3 is :
import untitled2
f1=open('week5_worldarea.txt','r')
f2=open('week5_worldpop.txt','r')
f3=open('week5_worlddensity1.txt','w')
for line1 in f1:
float(untitled2.get_area(line1))
for line2 in f2:
float(untitled2.get_pop(line2))
density=float(untitled2.get_pop(line2))/float(untitled2.get_area(line1))
for line3 in f1:
untitled2.get_country(line3)
f3.write(str(line3)+str(density))
f1.close()
f2.close()
f3.close()
I think there is a problem with loops but I don't know how to correct it. Also I use some expressions like pos1=x.find(' ') but is there any way to express it by using tabs? I mean, if I write pos1=x.find('\t'), will it be wrong? Thanks so much.

You want to read each of the two files in lockstep, and you can write the density for each country as you compute it. (I'm assuming the two input files have the same countries on corresponding lines.)
with open('week5_worldarea.txt') as area_f, \
open('week5_worldpop.txt') as pop_f, \
open('week5_worlddensity1.txt', 'w') as dens_f:
for area_l, pop_l in zip(area_f, pop_f):
country1, area = area_l.split()
country2, pop = pop_l.split()
assert country1 == country2
density = int(pop)/int(area)
dens_f.write(str(density))

Scrape website to only show populated categories

I am in the process of scraping a website and it pulls the contents of the page, but there are categories with headers that are technically empty, but it still shows the header. I would like to only see categories with events in them. Ideally I could even have the components of each transactions so I can choose which elements I want displayed.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for events in NHsoup.findAll('tr')[2:]:
print(events.text)
NHlist.append(events.text)
print(' '.join(NHlist))
Like I said, this works to get all of the information, but there are a lot of headers/empty space that doesn't need to be pulled. For example, at the time I'm writing this the 'acquisitions', 'conversions', and 'change in control' are empty, but the headers still come in and there's are relatively large blank space after the headers. I feel like a I need some sort of loop to go through each header ('td') and then get it's contents ('tr') but I'm just not quite sure how to do it.

You can use itertools.groupby to group elements and then filter out empty rows:
import requests
from itertools import groupby
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for _, g in groupby(NHsoup.select('tr'), lambda k, d={'g':0}: (d.update(g=d['g']+1), d['g']) if k.select('th') else (None, d['g'])):
s = [tag.get_text(strip=True, separator=' ') for tag in g]
if any(i == '' for i in s):
continue
NHlist.append(s)
# This is just pretty printing, all the data are already in NHlist:
l = max(map(len,(j for i in NHlist for j in i))) + 5
for item in NHlist:
print('{: <4} {}'.format(' ', item[0]))
print('-' * l)
for i, ev in enumerate(item[1:], 1):
print('{: <4} {}'.format(i, ev))
print()
Prints:
Scraping NH Dept of Banking...
New Bank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/11/18 The Millyard Bank
Interstate Bank Combination
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/16/19 Optima Bank & Trust Company with and into Cambridge Trust Company Portsmouth, NH 03/29/19
Amendment to Articles of Agreement or Incorporation; Business or Capital Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
2 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
3 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
4 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
5 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
6 5/10/19 AB Trust Company New York, NY 06/04/19
Reduction in Capital
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 03/07/19 Primary Bank Bedford, NH 04/10/19
Amendment to Bylaws
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
2 2/25/19 Members First Credit Union Manchester, NH 04/05/19
3 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
4 6/28/19 Bellwether Community Credit Union
Interstate Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
2 03/08/19 One Credit Union Newport, NH 03/29/19
3 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
4 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
5 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
6 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
7 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
Interstate Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
New Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
2 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
3 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
4 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
New Loan Production Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH
03766-1430 04/15/19
Loan Production Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
Trade Name Requests
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Management Trust Company" 04/24/19
New Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/19/19 Janney Trust Co., LLC
2 02/25/19 Darwin Trust Company of New Hampshire, LLC
3 07/15/`9 Harbor Trust Company
Dissolution of Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
Trust Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/10/19 Charter Trust Company Rochester, NH 05/20/19
New Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 06824 03/22/19
Relocation of Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Floor, Boston MA To: 100 Summer Street, 12th Flr, Boston, MA 02/01/19
2 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New York, NY 10017 To: 410 Park Avenue, Suite 900 New York, NY 10022 03/29/19
3 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY 10022 To: One Grand Central Place 60 East 42nd Street, Ste 1550 New York, NY 10165 04/23/19

You could test which rows contain all '\xa0' (appear blank) and exclude. I append to list and convert to pandas dataframe but you could just print the row direct.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.nh.gov/banking/corporate-activities/index.htm')
soup = bs(r.content, 'lxml')
results = []
for tr in soup.select('tr'):
row = [i.text for i in tr.select('th,td')]
if row.count('\xa0') != len(row):
results.append(row)
pd.set_option('display.width', 100)
df = pd.DataFrame(results)
df.style.set_properties(**{'text-align': 'left'})
df.columns = df.iloc[0]
df = df[1:]
df.fillna(value='', inplace=True)
print(df.head(20))

Not sure if this is how you want it, and there is probably a more elegant way, but I basically did was
Pandas to get the table
Pandas automatically assigns columns, so moved column to first row
Found were rows are all nulls
Dropped rows with all nulls and the previous row (it's sub header)
import pandas as pd
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
df = pd.read_html(NHurl)[0]
top_row = pd.DataFrame([df.columns], index=[-1])
df.columns = top_row.columns
df = df.append(top_row, sort=True).sort_index().reset_index(drop=True)
null_rows = df[df.isnull().values.all(axis=1)].index.tolist()
drop_hdr_rows = [x - 1 for x in null_rows ]
drop_rows = drop_hdr_rows + null_rows
new_df = df[~df.index.isin(drop_rows)]
Output:
print (new_df.to_string())
0 1 2 3
2 New Bank New Bank New Bank New Bank
3 12/11/18 The Millyard Bank NaN NaN
4 Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination
5 01/16/19 Optima Bank & Trust Company with and into Camb... Portsmouth, NH 03/29/19
12 Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor...
13 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
14 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
15 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
16 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
17 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
18 5/10/19 AB Trust Company New York, NY 06/04/19
19 Reduction in Capital Reduction in Capital Reduction in Capital Reduction in Capital
20 03/07/19 Primary Bank Bedford, NH 04/10/19
21 Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws
22 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
23 2/25/19 Members First Credit Union Manchester, NH 04/05/19
24 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
25 6/28/19 Bellwether Community Credit Union NaN NaN
26 Interstate Branch Office Interstate Branch Office Interstate Branch Office Interstate Branch Office
27 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
28 03/08/19 One Credit Union Newport, NH 03/29/19
29 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
30 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
31 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
32 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
33 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
34 Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure
35 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
36 New Branch Office New Branch Office New Branch Office New Branch Office
37 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
38 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
39 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
40 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
41 Branch Office Closure Branch Office Closure Branch Office Closure Branch Office Closure
42 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
43 New Loan Production Office New Loan Production Office New Loan Production Office New Loan Production Office
44 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH 03766-1430 04/15/19
45 Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure
46 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
51 Trade Name Requests Trade Name Requests Trade Name Requests Trade Name Requests
52 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Managem... 04/24/19
53 New Trust Company New Trust Company New Trust Company New Trust Company
54 02/19/19 Janney Trust Co., LLC NaN NaN
55 02/25/19 Darwin Trust Company of New Hampshire, LLC NaN NaN
56 07/15/`9 Harbor Trust Company NaN NaN
57 Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company
58 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
59 Trust Office Closure Trust Office Closure Trust Office Closure Trust Office Closure
60 5/10/19 Charter Trust Company Rochester, NH 05/20/19
61 New Trust Office New Trust Office New Trust Office New Trust Office
62 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 0... 03/22/19
63 Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office
64 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Fl... 02/01/19
65 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New Y... 03/29/19
66 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY ... 04/23/19

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading excel file with line breaks and tabs preserved using xlrd - python

Related

Draw a Map of cities in python

Columns must be same length as key Error python

When can `re.finditer` not return anything but string.index can?

Python question about reading two txt and output one txt

Scrape website to only show populated categories

Categories

Resources