Parsing XML file with many children and grandchildren - python

I am very new to python. I have this very large xml file and I want to extract some data from it. Here is an excerpt:
<program>
<id>38e072a7-8fc9-4f9a-8eac-3957905c0002</id>
<programID>3853</programID>
<orchestra>New York Philharmonic</orchestra>
<season>1842-43</season>
<concertInfo>
<eventType>Subscription Season</eventType>
<Location>Manhattan, NY</Location>
<Venue>Apollo Rooms</Venue>
<Date>1842-12-07T05:00:00Z</Date>
<Time>8:00PM</Time>
</concertInfo>
<worksInfo>
<work ID="52446*">
<composerName>Beethoven, Ludwig van</composerName>
<workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle>
<conductorName>Hill, Ureli Corelli</conductorName>
</work>
<work ID="8834*4">
<composerName>Weber, Carl Maria Von</composerName>
<workTitle>OBERON</workTitle>
<movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="3642*">
<composerName>Hummel, Johann</composerName>
<workTitle>QUINTET, PIANO, D MINOR, OP. 74</workTitle>
<soloists>
<soloist>
<soloistName>Scharfenberg, William</soloistName>
<soloistInstrument>Piano</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Hill, Ureli Corelli</soloistName>
<soloistInstrument>Violin</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Derwort, G. H.</soloistName>
<soloistInstrument>Viola</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Boucher, Alfred</soloistName>
<soloistInstrument>Cello</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Rosier, F. W.</soloistName>
<soloistInstrument>Contrabass</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="0*">
<interval>Intermission</interval>
</work>
<work ID="8834*3">
<composerName>Weber, Carl Maria Von</composerName>
<workTitle>OBERON</workTitle>
<movement>Overture</movement>
<conductorName>Etienne, Denis G.</conductorName>
</work>
<work ID="8835*1">
<composerName>Rossini, Gioachino</composerName>
<workTitle>ARMIDA</workTitle>
<movement>Duet</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
<soloist>
<soloistName>Horn, Charles Edward</soloistName>
<soloistInstrument>Tenor</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="8837*6">
<composerName>Beethoven, Ludwig van</composerName>
<workTitle>FIDELIO, OP. 72</workTitle>
<movement>"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Horn, Charles Edward</soloistName>
<soloistInstrument>Tenor</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="8336*4">
<composerName>Mozart, Wolfgang Amadeus</composerName>
<workTitle>ABDUCTION FROM THE SERAGLIO,THE, K.384</workTitle>
<movement>"Ach Ich liebte," Konstanze (aria)</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="5543*">
<composerName>Kalliwoda, Johann W.</composerName>
<workTitle>OVERTURE NO. 1, D MINOR, OP. 38</workTitle>
<conductorName>Timm, Henry C.</conductorName>
</work>
</worksInfo>
</program>
<program>
What I would like to do is extract the following pieces of information: programID, orchestra, season, eventType, work ID, soloistName, solositInstrument, soloistRole
Here is the code I am using:
import csv
import xml.etree.cElementTree as ET
tree = ET.iterparse('complete.xml.txt')
#root = tree.getroot()
for program in root.iter('program'):
ID = program.findtext('id')
programID = program.findtext('programID')
orchestra = program.findtext('orchestra')
season = program.findtext('season')
for concert in program.findall('concertInfo'):
event = concert.findtext('eventType')
for worksInfo in program.findall('worksInfo'):
for work in worksInfo.iter('work'):
workid = work.get('ID')
for soloists in work.iter('soloists'):
for soloist in soloists.iter('soloist'):
soloname = soloist.findtext('soloistName')
soloinstrument = `soloist.findtext('soloistInstrument')`
solorole = soloist.findtext('soloistRoles')
#print(soloname, soloinstrument, solorole)
#print(workid)
#print(event)
#print(programID , " , " , orchestra , " , " , season)
with open("nyphil.txt","a") as nyphil:
nyphilwriter = csv.writer(nyphil)
nyphilwriter.writerow([programID, orchestra, season, event, workid, `soloname.encode('utf-8'), soloinstrument, solorole])
nyphil.close()
When I run this code I only get the last soloistName and soloistInstrumet. The outcome that I have in mind is sort of like a repeated observations for each program. So I'd have something like:
13918, New York Philharmonic, 1842-43, Subscription Season, 52446*, Otto, Antoinette, Soprano, S
13918,...., 3642*, Scharfenberg, William , Piano, A
13918,...., 3642*, Hill, Ureli Corelli , Violin, A
and so on until the last work ID:
13918,...., 8336*4 , Otto, Antoinette, Soprano, S
What I am getting is only the last work:
13918, New York Philharmonic, 1842-43, Subscription Season, 8336*, Otto, Antoinette, Soprano, S
In the file there are over 15,000 programs like the example I posted. I want to parse all of them and extract the information I mentioned above. I am not entirely sure how to go about doing this, I've scoured the internet for a way to do this, but everything I tried just doesn't work!!

Your problem here is that you misunderstand the way loops work. Specifically, the values only change while you're in the loop:
for x in range(10):
pass
print(x) # prints 9
vs
for x in range(10):
print(x)
Those are two different things. You're doing the former. What you need to do is something like this:
with open('nyphil.txt', 'w') as f:
nyphilwriter = csv.writer(f)
for program in root.iter('program'):
id_ = program.findtext('id')
program_id = program.findtext('programID')
orchestra = program.findtext('orchestra')
season = program.findtext('season')
for concert in program.findall('concertInfo'):
event = concert.findtext('eventType')
for info in program.findall('worksInfo'):
for work in info.iter('work'):
work_id = work.get('ID')
for soloists in work.iter('soloists'):
for soloist in soloists.iter('soloist'):
# Change this line to whatever you want to write out
nyphilwriter.writerow([id, program_id, orchestra, season, event, work_id, soloist.findtext('soloistName')])

The 13918 does not appear in your data. Leaving that aside, here's what I wrote, which appears to process your data successfully.
from lxml import etree
tree = etree.parse('test.xml')
programs = tree.xpath('.//program')
for program in programs:
programID, orchestra, season = [program.xpath(_)[0].text for _ in ['programID', 'orchestra', 'season']]
print (programID, orchestra, season)
works = program.xpath('worksInfo/work')
for work in works:
workID = work.attrib['ID']
soloistItems = work.xpath('soloists/soloist')
for soloistItem in soloistItems:
print (workID, soloistItem.find('soloistName').text, soloistItem.find('soloistInstrument').text, soloistItem.find('soloistRoles').text)
The script produces the following output.
3853 New York Philharmonic 1842-43
8834*4 Otto, Antoinette Soprano S
3642* Scharfenberg, William Piano A
3642* Hill, Ureli Corelli Violin A
3642* Derwort, G. H. Viola A
3642* Boucher, Alfred Cello A
3642* Rosier, F. W. Contrabass A
8835*1 Otto, Antoinette Soprano S
8835*1 Horn, Charles Edward Tenor S
8837*6 Horn, Charles Edward Tenor S
8336*4 Otto, Antoinette Soprano S
One other thing to note: I put a tag at the beginning of your XML and a at the end since the real data would contain multiple elements.

Related

Writing text to a txt file and it's adding an extra line

This is my code:
with open("recordsAudit.txt", "w") as databasefile:
for auditer, auditdate in zip(auditers, audit_Date):
databasefile.write("%s, %s\n" % (auditer, auditdate))
In the auditers and audit_Date list we can find:
auditers = [Frank Price, Robert Real]
audit_Date = [6/30/2022, 7/5/2022]
When I open "recordsAudit.txt" I find this:
William Silva, 6/30/2022
Robert Real, 7/5/2022
But I want this:
William Silva, 6/30/2022
Robert Real, 7/5/2022
If I take away the "\n" at the end, it turns to this:
William Silva, 6/30/2022Robert Real, 7/5/2022

Get information from a xml imdb response with no tags

I'm working on a movie data base, getting responses from imdb. I'm getting the response in a xml format, but it has no tags, just the information mixed. How can I get each of the data in there?
Here's how the respone shows up:
<?xml version="1.0" encoding="UTF-8"?><root response="True">
<movie title="Batman" year="1989" rated="PG-13" released="23 Jun 1989" runtime="126 min" genre="Action, Adventure" director="Tim Burton" writer="Bob Kane (Batman characters), Sam Hamm (story), Sam Hamm (screenplay), Warren Skaaren (screenplay)" actors="Michael Keaton, Jack Nicholson, Kim Basinger, Robert Wuhl" plot="Gotham City. Crime boss Carl Grissom (Jack Palance) effectively runs the town but there's a new crime fighter in town - Batman (Michael Keaton). Grissom's right-hand man is Jack Napier (Jack Nicholson), a brutal man who is not entirely sane... After falling out between the two Grissom has Napier set up with the Police and Napier falls to his apparent death in a vat of chemicals. However, he soon reappears as The Joker and starts a reign of terror in Gotham City. Meanwhile, reporter Vicki Vale (Kim Basinger) is in the city to do an article on Batman. She soon starts a relationship with Batman's everyday persona, billionaire Bruce Wayne." language="English, French, Spanish" country="USA, UK" awards="Won 1 Oscar. Another 8 wins & 26 nominations." poster="https://m.media-amazon.com/images/M/MV5BMTYwNjAyODIyMF5BMl5BanBnXkFtZTYwNDMwMDk2._V1_SX300.jpg" metascore="69" imdbRating="7.6" imdbVotes="302,842" imdbID="tt0096895" type="movie" />
</root>
Here is my answer to your question
xmlRaw="""< ?xml
version = "1.0"
encoding = "UTF-8"? > < root
response = "True" >
< movie
title = "Batman"
year = "1989"
rated = "PG-13"
released = "23 Jun 1989"
runtime = "126 min"
genre = "Action, Adventure"
director = "Tim Burton"
writer = "Bob Kane (Batman characters), Sam Hamm (story), Sam Hamm (screenplay), Warren Skaaren (screenplay)"
actors = "Michael Keaton, Jack Nicholson, Kim Basinger, Robert Wuhl"
plot = "Gotham City. Crime boss Carl Grissom (Jack Palance) effectively runs the town but there's a new crime fighter in town - Batman (Michael Keaton). Grissom's right-hand man is Jack Napier (Jack Nicholson), a brutal man who is not entirely sane... After falling out between the two Grissom has Napier set up with the Police and Napier falls to his apparent death in a vat of chemicals. However, he soon reappears as The Joker and starts a reign of terror in Gotham City. Meanwhile, reporter Vicki Vale (Kim Basinger) is in the city to do an article on Batman. She soon starts a relationship with Batman's everyday persona, billionaire Bruce Wayne."
language = "English, French, Spanish"
country = "USA, UK"
awards = "Won 1 Oscar. Another 8 wins & 26 nominations."
poster = "https://m.media-amazon.com/images/M/MV5BMTYwNjAyODIyMF5BMl5BanBnXkFtZTYwNDMwMDk2._V1_SX300.jpg"
metascore = "69"
imdbRating = "7.6"
imdbVotes = "302,842"
imdbID = "tt0096895"
type = "movie" / >
< / root >"""
def getValue(xml, value):
textString = xmlRaw.split('\n')
for line in textString:
if value in line:
returnData = line
return returnData
print (getValue(xmlRaw, 'title'))
print (getValue(xmlRaw, 'year'))
print (getValue(xmlRaw, 'rated'))
print (getValue(xmlRaw, 'released'))
#add more as you need the data

Python file parsing, can't catch strings in new line

So Parsing a large text file with 56,900 book titles with authors and a etext no.
Trying to find the authors. By parsing the file.
The file is a like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887
~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~
TITLE and AUTHOR ETEXT NO.
The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886
Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]
"Trip to the Sunny South" in March, 1885, by L. S. D 56884
Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]
Susien saaliina, mennessä Jack London 56882
[Language: Finnish]
Forged Egyptian Antiquities, by T. G. Wakeling 56881
The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]
No Posting 56879
Author name usually starts after "by" or when there is no "by" in line then author name starts after a comma ","...However the "," can be a part of the title if the line has a by.
So, I parsed it for by first then for comma.
Here is what I tried:
def search_by_author():
fhand = open('GUTINDEX.ALL')
print("Search by Author:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("TITLE"):
if not line.startswith("~"):
words = line.rstrip()
words = line.lstrip()
words = words[:-6]
if ", by" in words:
words = words[words.find(', by'):]
words = words[5:]
print (words)
else:
words = words[words.find(', '):]
words = words[5:]
if "," in words:
words = words[words.find(', '):]
if words.startswith(','):
words =words[words.find(','):]
print (words)
else:
print (words)
else:
print (words)
if " by" in words:
words = words[words.find('by')]
print(words)
search_by_author()
However it can't seem to find the author name for lines like
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
As per your file, info about a book can be spread across multiple lines. There is a blank line after each book info. I used that to gather all info about a book and then parse it to get the author info.
import re
def search_by_author():
fhand = open('GUTINDEX.ALL')
book_info = ''
for line in fhand:
line = line.rstrip()
if (line.startswith('TITLE') or line.startswith('~')):
continue
if (len(line) == 0):
# remove info in square bracket from book_info
book_info = re.sub(r'\[.*$', '', book_info)
if ('by ' in book_info):
tokens = book_info.split('by ')
else:
tokens = book_info.split(',')
if (len(tokens) > 1):
authors = tokens[-1].strip()
print(authors)
book_info = ''
else:
# remove ETEXT NO. from line
line = re.sub(r'\d+$', '', line)
book_info += ' ' + line.rstrip()
search_by_author()
Output:
Robert Lloyd Praeger
Sabine Baring-Gould
mennessä Charles T. Russell
mennessä Charles T. Russell
Horatio Alger, Jr.
Al Avery and George Rutherford Montgomery
Lillian Garis
Boris Sidis
par Francis Picabia
Frederick Strauss
di Luigi Amabile
Fletcher Pratt
di Teodoro Pascal
Various
Various
mennessä Sven Elvestad
L. S. D
Joel Chandler Harris
mennessä Jack London
T. G. Wakeling
Helena Petrovna Blavatsky

Python regex to capture a comma-delimited list of items

I have a list of weather forecasts that start with a similar prefix that I'd like to remove. I'd also like to capture the city names:
Some Examples:
If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff,
Salt Lake City, Park City, Denver, Estes Park, Colorado Springs,
Pueblo, or Albuquerque, the week will...
If you have vacation or wedding plans for Miami, Jacksonville, Macon,
Charlotte, or Charleston, expect a couple systems...
If you have vacation or wedding plans in Pittsburgh, Philadelphia,
Atlantic City, Newark, Baltimore, D.C., Richmond, Charleston, or
Dover, expect the week...
The strings start with a common prefix "If you have vacation or wedding plans in" and the last city has "or" before it. The list of cities is of variable length.
I've tried this:
>>> text = 'If you have vacation or wedding plans in NYC, Boston, Manchester, Concord, Providence, or Portland'
>>> re.search(r'^If you have vacation or wedding plans in ((\b\w+\b), ?)+ or (\w+)', text).groups()
('Providence,', 'Providence', 'Portland')
>>>
I think I'm pretty close, but obviously it's not working. I've never tried to do something with a variable number of captured items; any guidance would be greatly appreciated.
Alternative solution here (probably just for sharing and educational purposes).
If you were to solve it with nltk, it would be called a Named Entity Recognition problem. Using the snippet based on nltk.chunk.ne_chunk_sents(), provided here:
import nltk
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
sample = "If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will..."
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
entity_names = []
for tree in chunked_sentences:
entity_names.extend(extract_entity_names(tree))
print entity_names
It prints exactly the desired result:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
Here is my approach: use the csv module to parse the lines (I assume they are in a text file named data.csv, please change to suite your situation). After parsing each line:
Discard the last cell, it is not a city name
Remove 'If ...' from the first cell
Remove or 'or ' from the last cell (used to be next-to-last)
Here is the code:
import csv
def cleanup(row):
new_row = row[:-1]
new_row[0] = new_row[0].replace('If you have vacation or wedding plans in ', '')
new_row[0] = new_row[0].replace('If you have vacation or wedding plans for ', '')
new_row[-1] = new_row[-1].replace('or ', '')
return new_row
if __name__ == '__main__':
with open('data.csv') as f:
reader = csv.reader(f, skipinitialspace=True)
for row in reader:
row = cleanup(row)
print row
Output:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
['Pittsburgh', 'Philadelphia', 'Atlantic City', 'Newark', 'Baltimore', 'D.C.', 'Richmond', 'Charleston', 'Dover']
import re
s = "If you have vacation or wedding plans for Miami, Jacksonville, Macon, Charlotte, or Charleston, expect a couple systems"
p = re.compile(r"If you have vacation or wedding plans (in|for) ((\w+, )+)or (\w+)")
m = p.match(s)
print m.group(2) # output: Miami, Jacksonville, Macon, Charlotte,
cities = m.group(2).split(", ") # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', '']
cities[-1] = m.group(4) # add the city after or
print cities # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
the city can be matched by pattern (\w+, ) and or (\w+)
and split cities by pattern ,
btw, as the pattern is used to many data, it is preferred to work with the compiled object
PS: the word comes after plan can be for or in, according to examples you provide
How about this
>>> text = 'If you have vacation or wedding plans for Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will'
>>> match = re.search(r'^If you have vacation or wedding plans (in?|for?) ([\w+ ,]+)',text).groups()[1].split(", ")
Output
>>> match
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'or Albuquerque', 'the week will']

How to apply a regex to each sublists of a list?

Let's say I have a list of lists like this:
lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
['I just became the mayor of Porta Romana on #username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "#username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
]
I would like to remove the links of each sublist, so I tried with this regular expression:
new_list = re.sub(r'^https?:\/\/.*[\r\n]*', '', tweets, flags=re.MULTILINE)
I used the MULTILINE flag since when I print list_ it looks like:
[]
[]
[]
...
[]
The problem with the above aproach is that I got an TypeError: expected string or buffer clearly I can not pass like this the sublists to the regex. How can I apply the above regex to the set of sublists in list_ ? in order to get something like this (i.e. the sublists without any type of link):
[['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware'],
['I just became the mayor of Porta Romana on #username! \t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated \t\t", "#username Don't use my family surname for your app ????\t\t"]
]
Does this can be done with a map or is there any other efficient aproach?.
Thanks in advance guys
You need to use \b instead of start of the line anchor.
>>> lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
['I just became the mayor of Porta Romana on #username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "#username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
]
>>> [[re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', i)] for x in lis_ for i in x]
[['"Fun is the enjoyment of pleasure"\t\t'], ['#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t'], ['Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on #username! '], ["RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated "], ["#username Don't use my family surname for your app ???? "]]
OR
>>> l = []
>>> for i in lis_:
m = []
for j in i:
m.append(re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', j))
l.append(m)
>>> l
[['"Fun is the enjoyment of pleasure"\t\t', '#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t', 'Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on #username! ', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated ", "#username Don't use my family surname for your app ???? "]]
It seems that you have a list of lists of strings.
In that case, you simply need to iterate over these lists the proper way:
list_ = [['blablablalba', 'blabalbablbla', 'blablala', 'http://t.co/xSnsnlNyq5'], ['blababllba', 'blabalbla', 'blabalbal'],['http://t.co/xScsklNyq5'], ['blablabla', 'http://t.co/xScsnlNyq3']]
def remove_links(sublist):
return [s for s in sublist if not re.search(r'https?:\/\/.*[\r\n]*', s)]
final_list = map(remove_links, list_)
# [['blablablalba', 'blabalbablbla', 'blablala'], ['blababllba', 'blabalbla', 'blabalbal'], [], ['blablabla']]
If you want to remove any empty sub-lists afterwards:
final_final_list = [l for l in final_list if l]

Categories