This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I'm trying to split a text into sentences, whenever a terminal punctuation mark ( '.', '!', '?') appears. for instance if I have the following text :
Recognizing the rising opportunity Jerusalem Venture Partners opened
up their Cyber Labs incubator, giving a home to many of the city’s
promising young companies. International corporates like EMC have also
established major centers in the park, leading the way for others to
follow! On a visit last June, the park had already grown to two
buildings with the ground being broken for the construction of more in
the near future. this is really interesting! what do you think?
This should be splitted into 5 sentences (see the bold words above, as these words end with a punctuation mark).
Here's my code:
# split on: '.+'
splitted_article_content = []
# article_content contains all the article's paragraphs
for element in article_content:
splitted_article_content = splitted_article_content +re.split(".(?='.'+)", element)
# split on: '?+'
splitted_article_content_2 = []
for element in splitted_article_content:
splitted_article_content_2 = splitted_article_content_2 + re.split(".(?='?'+)", element)
# split on: '!+'
splitted_article_content_3 = []
for element in splitted_article_content_2:
splitted_article_content_3 = splitted_article_content_3 + re.split(".(?='!'+)", element)
My question is, is there any other efficient way to do the following, without using any external libraries ?
Thanks for the help guys.
I guess I see this as more of a look behind than a look ahead:
import re
# article_content contains all the article's paragraphs
# in this case, a single paragraph.
article_content = ["""Recognizing the rising opportunity Jerusalem Venture Partners opened up their Cyber Labs incubator, giving a home to many of the city’s promising young companies. International corporates like EMC have also established major centers in the park, leading the way for others to follow! On a visit last June, the park had already grown to two buildings with the ground being broken for the construction of more in the near future. This is really interesting! What do you think?"""]
split_article_content = []
for element in article_content:
split_article_content += re.split("(?<=[.!?])\s+", element)
print(*split_article_content, sep='\n\n')
OUTPUT
% python3 test.py
Recognizing the rising opportunity Jerusalem Venture Partners opened up their Cyber Labs incubator, giving a home to many of the city’s promising young companies.
International corporates like EMC have also established major centers in the park, leading the way for others to follow!
On a visit last June, the park had already grown to two buildings with the ground being broken for the construction of more in the near future.
This is really interesting!
What do you think?
%
Related
I´m trying to clean the following data:
from sklearn import datasets
data = datasets.fetch_20newsgroups(categories=['rec.autos', 'rec.sport.baseball', 'soc.religion.christian'])
texts, targets = data['data'], data['target']
Where texts is a list of articles and targets is a vector containing the index of the category to which each article belongs to.
I need to clean all articles. The cleaning task means:
Remove headers
Remove punctuation
Remove parenthesis
Consecutive blank spaces
Tokens emails with length 1
Line breaks
I'm quite new at Python but I've tried to remove all punctuation and everything using replace(). However, I think that an easy way to do this task must exist.
def clean_articles (article):
return ' '.join([x for x in article[article.find('\n\n'):].replace('.','').replace('[','')
clean_articles(data['data'][1])
For the following article:
print(data['data'][1])
Uncleaned Article:
'From: aas7#po.CWRU.Edu (Andrew A. Spencer)\nSubject: Re: Too fast\nOrganization: Case Western Reserve University, Cleveland, OH (USA)\nLines: 25\nReply-To: aas7#po.CWRU.Edu (Andrew A. Spencer)\nNNTP-Posting-Host: slc5.ins.cwru.edu\n\n\nIn a previous article, wrat#unisql.UUCP (wharfie) says:\n\n>In article <1qkon8$3re#armory.centerline.com> jimf#centerline.com (Jim Frost) writes:\n>>larger engine. That\'s what the SHO is -- a slightly modified family\n>>sedan with a powerful engine. They didn\'t even bother improving the\n>>brakes.\n>\n>\tThat shows how much you know about anything. The brakes on the\n>SHO are very different - 9 inch (or 9.5? I forget) discs all around,\n>vented in front. The normal Taurus setup is (smaller) discs front, \n>drums rear.\n\none i saw had vented rears too...it was on a lot.\nof course, the sales man was a fool..."titanium wheels"..yeah, right..\nthen later told me they were "magnesium"..more believable, but still\ncrap, since Al is so m uch cheaper, and just as good....\n\n\ni tend to agree, tho that this still doesn\'t take the SHO up to "standard"\nfor running 130 on a regular basis. The brakes should be bigger, like\n11" or so...take a look at the ones on the Corrados.(where they have\nbraking regulations).\n\nDREW\n'
Cleaned Article:
In previous article UUCP wharfie says In article centerline com com Jim Frost writes larger engine That's what the SHO is slightly modified family sedan with powerful engine They didn't even bother improving the *brakes That shows how much you know about anything The brakes on the SHO are very different inch or forget discs all around vented in front The normal Taurus setup is smaller discs front drums rear one saw had vented rears too it was on lot of course the sales man was fool titanium wheels yeah right then later told me they were magnesium more believable but still crap since Al is so uch cheaper and just as good tend to agree tho that this still doesn't take the SHO up to standard for running 130 on regular basis The brakes should be bigger like 11 or so take look at the ones on the Corrados where they have braking regulations DREW
note: this is not a complete answer, but the following will at least get you half way to:
remove punctuation
remove line breaks
remove consecutive white space
remove parentheses
import re
s = ';\n(a b.,'
print('before:', s)
s = re.sub('[.,;\n(){}\[\]]', '', s)
s = re.sub('\s+', ' ', s)
print('after:', s)
this will print:
before: ;
(a b.,
after: a b
i've got a CSV which contains article's text in different raws.
Like we have column 1:
Hello i am John
Tom has got a Dog
... more text.
I'm trying the extract the first names and surname from those text and i was able to do that if i copy and paste the single text in the code.
But i don't know how to read the csv in the code and then it has to processes the different texts in the raws extracting name and surname.
Here is my code working with the text in it:
import operator,collections,heapq
import csv
import pandas
import json
import nltk
from nameparser.parser import HumanName
def get_human_names(text):
tokens = nltk.tokenize.word_tokenize(text)
pos = nltk.pos_tag(tokens)
sentt = nltk.ne_chunk(pos, binary = False)
person_list = []
person = []
name = ""
for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
for leaf in subtree.leaves():
person.append(leaf[0])
if len(person) > 1: #avoid grabbing lone surnames
for part in person:
name += part + ' '
if name[:-1] not in person_list:
person_list.append(name[:-1])
name = ''
person = []
return (person_list)
text = """
M.F. Husain, Untitled, 1973, oil on canvas, 182 x 122 cm. Courtesy the Pundole Family Collection
In her essay ‘Worlding Asia: A Conceptual Framework for the First Delhi Biennale’, Arshiya Lokhandwala explores Gayatri Spivak’s provocation of ‘worlding’, which has been defined as imperialism’s epistemic violence of inscribing meaning upon a colonized space to bring it into the world through a Eurocentric framework. Lokhandwala extends this concept of worlding to two anti-cartographical terms: ‘de-worlding’, rejecting or debunking categories that are no longer useful such as the binaries of East-West, North-South, Orient-Occidental, and ‘re-worlding’, re-inscribing new meanings into the spaces that have been de-worlded to create one’s own worlds. She offers de-worlding and re-worlding as strategies for active resistance against epistemic violence of all forms, including those that stem from ‘colonialist strategies of imperialism’ or from ‘globalization disguised within neo-imperialist practices’.
Lokhandwala writes: Fourth World. The presence of Arshiya is really the main thing here.
Re-worlding allows us to reach a space of unease performing the uncanny, thereby locating both the object of art and the postcolonial subject in the liminal space, which prevents these categorizations as such… It allows an introspected view of ourselves and makes us seek our own connections, and look at ourselves through our own eyes.
In a recent exhibition on the occasion of the seventieth anniversary of India’s Independence, Lokhandwala employed the term to seemingly interrogate this proposition: what does it mean to re-world a country through the agonistic intervention of art and activism? What does it mean for a country and its historiography to re-world? What does this re-worlded India, in active resistance and a state of introspection, look like to itself?
The exhibition ‘India Re-Worlded: Seventy Years of Investigating a Nation’ at Gallery Odyssey in Mumbai (11 September 2017–21 February 2018) invited artists to select a year from the seventy years since the country’s independence that had personal import or resonated with them because of the significance of the events that occurred at the time. The show featured works that responded to or engaged with these chosen years. It captured a unique history of post-independent India told through the perspective of seventy artists. The works came together to collectively reflect on the history and persistence of violence from pre-independence to the present day and made reference to the continued struggle for political agency through acts of resistance, artistic and otherwise. Through the inclusion of subaltern voices, imagined geographies, particular experiences, solidarities and critical dissent, the exhibition offered counter-narratives and multiple histories.
Anita Dube, Missing Since 1992, 2017, wood, electrical wire, holders, bulbs, voltage stabilizers, 223 x 223 cm. Courtesy the artist and Gallery Odyssey
Lokhandwala says she had been thinking hard about an appropriate response to the seventy years of independence. ‘I wanted to present a new curatorial paradigm, a postcolonial critique of the colonisation and an affirmation of India coming into her own’, she says. ‘I think the fact that I tried to include seventy artists to [each take up] one year in the lifetime of the nation was also a challenging task to take on curatorially.’
Her previous undertaking ‘After Midnight: Indian Modernism to Contemporary India: 1947/1997’ at the Queens Museum in New York in 2015 juxtaposed two historical periods in Indian art: Indian modern art that emerged in the post-independence period from 1947 through the 1970s, and contemporary art from 1997 onwards when the country experienced the effects of economic liberalization and globalization. The 'India Re-Worlded' exhibition similarly presented art practices that emerged from the framework of postcolonial Indian modernity. It attempted to explore the self-reflexivity of the Indian artist as a postcolonial subject and, as Lokhandwala described in the curatorial note, the artists’ resulting ‘sense of agency and renewed connection with the world at large’. The exhibition included works by Progressive Artists' Group core members F.N. Souza, S.H. Raza, M.F. Husain and their peers Krishen Khanna, Tyeb Mehta and V.S. Gaitonde, presented under the year in which they were produced. Other important and pioneering pieces included work from Somnath Hore’s paper pulp print series Wounds (1970); a blowtorch on plywood work by abstractionist Jeram Patel, who was one of the founding members of Group 1890 ; and a video documenting one of Rummana Husain’s last performances.
The methodology of their display removed the didactic, art historical preoccupation with chronology and classification, instead opting to intersperse them amongst contemporary works. This fits in with Lokhandwala’s curatorial impulses and vision: to disrupt and resist single narratives, to stage dialogues and interactions between the works, to offer overlaps, intersections and nuances in the stories, but also in the artistic impetuses.
Jeram Patel, Untitled, 1970, blowtorch Fourht World on plywood, 61 x 61 cm. Courtesy the artist and Gallery Odyssey
The show opened with Jitish Kallat’s Death of Distance (2006), then we have Arshiya, which through lenticular prints presented two overlaid found texts from 2005 and 2006. One was a harrowing news story of a twelve-year-old Indian girl committing suicide after her mother tells her she cannot afford one rupee – two US cents – for a school meal. The other one was a news clipping in which the head of the state-run telecommunications company announces a new one-rupee-per-minute tariff plan for interstate phone calls and declares the scheme as ‘the death of distance’. The images offer two realities that are distant from and at odds with each other. They highlight an economic disparity heightened by globalization. A rupee coin, enlarged to a human scale and covered in black lead, stood poised on the gallery floor in front of the prints.
Bose Krishnamachari chose 1962, the year of his birth, to discuss the relationship between memory and age. As a visual representation of the country’s past through a timeline, within which he situated his own identity-questioning experiences as an artist, his work epitomized the themes and intentions of the exhibition. In Shilpa Gupta’s single channel video projection 100 Hand drawn Maps of India (2007–8) ordinary Indian people sketch outlines of the country from memory. The subjective maps based on the author’s impression and perception of space show how each person sees the country and articulates its borders. The work seems to ask, what do these incongruent representations reveal about our collective identities and our ideas about nationhood?
The repetition of some of the years selected, or even the absence of certain years, suggested that the parameters set by the curatorial concept sought to guide rather than clamp down on. This allowed greater freedom for the artists and curator, and therefore more considered and wide responses.
Surekha’s photographic series To Embrace (2017) celebrated the Chipko tree-hugging movement that originated on 25 March 1974, when 27 women from Reni village in Uttar Pradesh in northern India staged a self-organised, non-violent resistance to the felling of trees by clinging to them and linking arms around them. The photographs showed women embracing the branches of the giant, 400-year-old Dodda Alada Mara (Big Banyan Tree) in rural Bengaluru – paying a homage to both the pioneering eco-feminist environmental movement and the grand old tree.
Anita Dube’s Missing Since 1992 (2017) hung from the ceiling like a ghost of a terrible, dark past. Its electrical wires and bulbs outlined a sombre dome to represent the demolition of the Babri Masjid on 6 December 1992, which Dube calls ‘the darkest day I have experienced as a citizen’. This piece was one of several works in the exhibition that dealt with this event and the many episodes of communal riots that followed. These works document a decade when the country witnessed economic reform and growth but also the rise of a religious right-wing.
Riyas Komu, Fourth World, 2017, rubber and metal, 244 x 45 cm each. Courtesy the artist and Gallery Odyssey
Near the end of the exhibition, Riyas Komu’s sculptural installation Fourth World (2017) alerted us to the divisive forces that are threatening to dismantle the ethical foundations of the Republic symbolized by its official emblem, the Lion Capital – a symbol seen also on the blackened rupee coin featured in Kallat’s work – and in a way rounded off the viewing experience.
The seventy works that attempted to represent seventy years of the country’s history built a dense and complicated network of voices and stories, and also formed a cross section of the art emerging during this period. Although the show’s juxtaposition of modern and contemporary art made it seem like an extension of the themes presented in the curator’s previous exhibition at the Queens Museum, here the curatorial concept made the process of staging the exhibition more democratic blurring the sequence of modern and contemporary Indian art. Furthermore, the multi-pronged curatorial intentions brought renewed criticality to the events of past and present, always underscoring the spirit of resistance and renegotiation as the viewer could actively de-world and re-world.
"""
names = get_human_names(text)
print ("LAST, FIRST")
namex=[]
for name in names:
last_first = HumanName(name).last + ' ' + HumanName(name).first
print (last_first)
namex.append(last_first)
print (namex)
print('Saving the data to the json file named Names')
try:
with open('Names.json', 'w') as outfile:
json.dump(namex, outfile)
except Exception as e:
print(e)
So i would like to remove all the text from the code and want the code to process the text from my csv.
Thanks a lot :)
CSV stands for Comma Separated Values and is a text format used to represent tabular data in plain text. Commas are used as column separators and line breaks as row separators. Your string does not look like a real csv file. Nevermind the extension you can still read your text file like this:
with open('your_file.csv', 'r') as f:
my_text = f.read()
Your text file is now available as my_text in the rest of your code.
Pandas has read_csv command:
yourText= pandas.read_csv("csvFile.csv")
I am relatively beginner in python and I am looking for a solution for the following Problem.
I have to "scan" texts looking for concepts.
A concept looks like this: (electrical 3d car)
I have to look for appearences of the word "car" where in a proximity of 3 words ('3d') there is another being 'electrical' (for instance electrical conventional car, electrical propulsed car, electrical driven autonomous car etc)
I know that you can work with a text as if it would be a list of words and punctuations.
I thought about the following solution:
with open(filepath) as fp:
concept=['motor','3d','car']
concept_appearences=0 ## counters
concept_positions=[]
concept_list=[]
word1 = concept[0]
word2 = concept[2]
separator=concept[1]
distance=[int(s) for s in separator if s.isdigit()]
distance=int(distance[0])
distanceright=distance
if 'd' in separator: distanceleft=distance
if 'w' in separator: distanceleft=0
for line in fp:
## look for the concepts in every line which is like a paragraph
for index,word in enumerate(line.split()):
if word.upper()==word1.upper():
## i found the first concept-word
for i in range(index-distanceleft,index+distanceright,1):
if line.split()[i]==word2:
##print('thline.split()[i],word2)
print('i found the concept in postion', i )
start,end=i,index
if index<i:start,end=index,i
print('check:',line.split()[start:end+1])
concept_appearences +=1
concept_list.append(line.split()[start:end+1])
concept_positions.append(start)
print('the concept appeared {} times'.format(concept_appearences))
print('in positoins',concept_positions)
print('list of concepts',concept_list)
Note: Not yet implemented is the case where there is a point between both words which will exclude the hit from being a concept. (like: blah blah electrical. The car of my aunt blah blah.... that should not be a hit for obvious reasons)
Probably not a super-pythonic code, but it works so far.
The questions here are.
First:
This seems to me to be a quite common problem. Is there any library specific for that?
I don't even know the "technical" name for such a thing beyond than "proximity operators"
Note: I read quite a bit about NLTK (a NL library) but did not really find a solution for that.
Second:
Any idea how to make this code scalable? meaning this (electrical 3d car) could in itself become a concept within a concept, when looking for instance for (electrical 3d car) in the surroundings of "gasoline" being gasoline not more than 10 words away:
((electrical 3d car) 10w gasoline)
Third:
If there is no library for such a thing, any comment on speed is welcome, I have to look for thousands of concepts within a 100 pages text.
Thanks a lot.
Re-edit adding an input and outputfile as asked by #Mathieu thx.
INPUT TEXT:
Referring to FIGS. 1 to 3, an electrical car 1 comprises bodywork 2, pairs of ground engaging wheels 3, 4 front and rear, an electric motor 5 driving the front wheels 3 through a suitable transmission (not shown) and electrical battery sets 6A, 6B for supplying electrical power to the motor 5. Suitable control gear (not shown) adapted for operation by the car driver serves to control operation of the electric motor 5 and hence motion of the car 1. An old electrical driven car found in the garage was painted yellow instead of blue.
output:
the concept appeared 2 times
in positoins [7, 82]
list [['electrical', 'car'], ['electrical', 'driven', 'car']]
I'm trying to extract text from the online version of The Wealth of Nations and create a data frame where each observation is a page of the book. I do it in a roundabout way, trying to imitate something similar I did in R, but I was wondering if there was a way to do this directly in BeautifulSoup.
What I do is first get the entire text from the page:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm')
soup = BeautifulSoup(r.text,'html.parser')
But from here on, I'm just working with regular expressions and the text. I find the beginning and end of the book text:
beginning = [a.start() for a in re.finditer(r"BOOK I\.",soup.text)]
beginning
end = [a.start() for a in re.finditer(r"FOOTNOTES",soup.text)]
book = soup.text[beginning[1]:end[0]]
Then I remove the carriage returns and new lines and split on strings of the form "[Pg digits]" and put everything into a pandas data frame.
book = book.replace('\r',' ').replace('\n',' ')
l = re.compile('\[[P|p]g\s?\d{1,3}\]').split(book)
df = pd.DataFrame(l,columns=['col1'])
df['page'] = range(2,df.shape[0]+2)
There are indicators in the HTML code for page numbers of the form <span class='pagenum'><a name="Page_vii" id="Page_vii">[Pg vii]</a></span>. Is there a way I can do the text extraction in BeautifulSoup by searching for text between these "spans"? I know how to search for the page markers using findall, but I was wondering how I can extract text between those markers.
To get the page markers and the text associated with it, you can use bs4 with re. In order to match text between two markers, itertools.groupby can be used:
from bs4 import BeautifulSoup as soup
import requests
import re
import itertools
page_data = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm').text
final_data = [(i.find('a', {'name':re.compile('Page_\w+')}), i.text) for i in soup(page_data, 'html.parser').find_all('p')]
new_data = [list(b) for a, b in itertools.groupby(final_data, key=lambda x:bool(x[0]))][1:]
final_data = {new_data[i][0][0].text:'\n'.join(c for _, c in new_data[i+1]) for i in range(0, len(new_data), 2)}
Output (Sample, the actual result is too long for SO format):
{'[Pg vi]': "'In recompense for so many mortifying things, which nothing but truth\r\ncould have extorted from me, and which I could easily have multiplied to a\r\ngreater number, I doubt not but you are so good a christian as to return good\r\nfor evil, and to flatter my vanity, by telling me, that all the godly in Scotland\r\nabuse me for my account of John Knox and the reformation.'\nMr. Smith having completed, and given to the world his system of\r\nethics, that subject afterwards occupied but a small part of his lectures.\r\nHis attention was now chiefly directed to the illustration of\r\nthose other branches of science which he taught; and, accordingly, he\r\nseems to have taken up the resolution, even at that early period, of\r\npublishing an investigation into the principles of what he considered\r\nto be the only other branch of Moral Philosophy,—Jurisprudence, the\r\nsubject of which formed the third division of his lectures. At the\r\nconclusion of the Theory of Moral Sentiments, after treating of the\r\nimportance of a system of Natural Jurisprudence, and remarking that\r\nGrotius was the first, and perhaps the only writer, who had given any\r\nthing like a system of those principles which ought to run through,\r\nand be the foundation of the law of nations, Mr. Smith promised, in\r\nanother discourse, to give an account of the general principles of law\r\nand government, and of the different revolutions they have undergone\r\nin the different ages and periods of society, not only in what concerns\r\njustice, but in what concerns police, revenue, and arms, and whatever\r\nelse is the object of law.\nFour years after the publication of this work, and after a residence\r\nof thirteen years in Glasgow, Mr. Smith, in 1763, was induced to relinquish\r\nhis professorship, by an invitation from the Hon. Mr. Townsend,\r\nwho had married the Duchess of Buccleugh, to accompany the\r\nyoung Duke, her son, in his travels. Being indebted for this invitation\r\nto his own talents alone, it must have appeared peculiarly flattering\r\nto him. Such an appointment was, besides, the more acceptable,\r\nas it afforded him a better opportunity of becoming acquainted with\r\nthe internal policy of other states, and of completing that system of\r\npolitical economy, the principles of which he had previously delivered\r\nin his lectures, and which it was then the leading object of his studies\r\nto perfect.\nMr. Smith did not, however, resign his professorship till the day\r\nafter his arrival in Paris, in February 1764. He then addressed the\r\nfollowing letter to the Right Honourable Thomas Millar, lord advocate\r\nof Scotland, and then rector of the college of Glasgow:—", '[Pg vii]': "His lordship having transmitted the above to the professors, a meeting\r\nwas held; on which occasion the following honourable testimony\r\nof the sense they entertained of the worth of their former colleague\r\nwas entered in their minutes:—\n'The meeting accept of Dr. Smith's resignation in terms of the above letter;\r\nand the office of professor of moral philosophy in this university is therefore\r\nhereby declared to be vacant. The university at the same time, cannot\r\nhelp expressing their sincere regret at the removal of Dr. Smith, whose distinguished\r\nprobity and amiable qualities procured him the esteem and affection\r\nof his colleagues; whose uncommon genius, great abilities, and extensive\r\nlearning, did so much honour to this society. His elegant and ingenious\r\nTheory of Moral Sentiments having recommended him to the esteem of men\r\nof taste and literature throughout Europe, his happy talents in illustrating\r\nabstracted subjects, and faithful assiduity in communicating useful knowledge,\r\ndistinguished him as a professor, and at once afforded the greatest pleasure,\r\nand the most important instruction, to the youth under his care.'\nIn the first visit that Mr. Smith and his noble pupil made to Paris,\r\nthey only remained ten or twelve days; after which, they proceeded\r\nto Thoulouse, where, during a residence of eighteen months, Mr. Smith\r\nhad an opportunity of extending his information concerning the internal\r\npolicy of France, by the intimacy in which he lived with some of\r\nthe members of the parliament. After visiting several other places in\r\nthe south of France, and residing two months at Geneva, they returned\r\nabout Christmas to Paris. Here Mr. Smith ranked among his\r\nfriends many of the highest literary characters, among whom were\r\nseveral of the most distinguished of those political philosophers who\r\nwere denominated Economists."}
Here is an example substring from the text I'm trying to parse and a couple of the raw strings I'm trying to split this text with:
>>> test_string = "[shelter and transitional housing during shelter crisis - selection of sites;\nwaiver of certain requirements regarding contracting]\n\nsponsors: acting mayor breed; kim, ronen, sheehy and cohen\nordinance authorizing public works, the department of homelessness and supportive\nhousing, and the department of public health to enter into contracts without adhering to the\nadministrative code or environment code provisions regarding competitive bidding and other\nrequirements for construction work, procurement, and personal services relating to identified\nshelter crisis sites (1601 quesada avenue; 149-6th street; 125 bayshore boulevard; 13th\nstreet and south van ness avenue, southwest corner; 5th street and bryant street, northwest\ncorner; caltrans emergency shelter properties; and existing city navigation centers and\nshelters) that will provide emergency shelter or transitional housing to persons experiencing\nhomelessness; authorizing the director of property to enter into and amend leases or licenses\nfor the shelter crisis sites without adherence to certain provisions of the administrative code;\nauthorizing the director of public works to add sites to the list of shelter crisis sites subject to\nexpedited processing, procurement, and leasing upon written notice to the board of\nsupervisors, and compliance with conditions relating to environmental review and\nneighborhood notice; affirming the planning department’s determination under the californinenvironmental quality act; and making findings of consistency with the general plan, and the eight priority policies of planning code, section 101.1. assigned under 30 day rule to\nrules committee.\n[memorandum of understanding - service employees international union, local\n1021]\n\nsponsor: acting mayor breed"
>>> title = re.compile(r"\[([\s\S]*)\]")
>>> title = re.compile(r"\[.*\]")
What I want is to get a list of all strings enclosed in square brackets: []
>>> title.split(test_string)
['shelter and transitional housing during shelter crisis - selection of sites; waiver of certain requirements regarding contracting', 'memorandum of understanding - service employees international union, local 1021']
However, none of these raw strings split properly. It seems to me that re is including the closing criteria ] as part of the non-whitespace character set when it should the character that the string is split on.
I tried modifying the raw string to split on to be like this:
title = re.compile(r"\[([\s\S^\]]*)\]")
but that doesn't work either. I'm interpreting this last string to split on substrings that have [ in them, followed by any number of characters except for ], and followed by ].
How am I misunderstanding this?
[\s\S^\]] means: whitespace or non-whitespace or caret ^ or slash or ]. You cannot mix negated classes and regular ones. I think it's enough to use a class "all but closing ]": [^]], see example below.
You can also use -findall instead of split.
re.findall(r'\[([^]]*)\]', test_string)[0]