Using regex to extract information from a string - python

This is a follow-up and complication to this question: Extracting contents of a string within parentheses.
In that question I had the following string --
"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"
And I wanted to get a list of tuples in the form of (actor, character) --
[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]
To generalize matters, I have a slightly more complicated string, and I need to extract the same information. The string I have is --
"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary),
with Stephen Root and Laura Dern (Delilah)"
I need to format this as follows:
[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Stephen Root',''), ('Lauren Dern', 'Delilah')]
I know I can replace the filler words (with, and, &, etc.), but can't quite figure out how to add a blank entry -- '' -- if there is no character name for the actor (in this case Stephen Root). What would be the best way to go about doing this?
Finally, I need to take into account if an actor has multiple roles, and build a tuple for each role the actor has. The final string I have is:
"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"
And I need to build a list of tuples as follows:
[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]
Thank you.

import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"""
# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")
# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")
# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")
characters = splitre.split(credits)
pairs = []
for character in characters:
if character:
match = matchre.match(character)
if match:
actor = match.group(1).strip()
if match.group(2):
parts = splitparts.split(match.group(2))
for part in parts:
pairs.append((actor, part))
else:
pairs.append((actor, ""))
print(pairs)
Output:
[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'),
('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''),
('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

Tim Pietzcker's solution can be simplified to (note that patterns are modified too):
import re
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"""
# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")
# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")
# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")
pairs = []
for character in splitre.split(credits):
gr = matchre.match(character).groups('')
for part in splitparts.split(gr[1]):
pairs.append((gr[0], part))
print(pairs)
Then:
import re
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"""
# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")
# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")
# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")
gen = (matchre.match(character).groups('') for character in splitre.split(credits))
pp = [ (gr[0], part) for gr in gen for part in splitparts.split(gr[1])]
print pp
The trick is to use groups('') with an argument ''

What you want is identify sequences of words starting with a capital letter, plus some complications (IMHO you cannot assume each name is made of Name Surname, but also Name Surname Jr., or Name M. Surname, or other localized variation, Jean-Claude van Damme, Louis da Silva, etc.).
Now, this is likely to be overkill for the sample input you posted, but as I wrote above I assume things will soon get messy, so I would tackle this using nltk.
Here's a very crude and not very well tested snippet, but it should do the job:
import nltk
from nltk.chunk.regexp import RegexpParser
_patterns = [
(r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'), # proper nouns
(r'^[(]$', 'O'),
(r'[,]', 'COMMA'),
(r'^[)]$', 'C'),
(r'.+', 'NN') # nouns (default)
]
_grammar = """
NAME: {<NNP> <COMMA> <NNP>}
NAME: {<NNP>+}
ROLE: {<O> <NAME>+ <C>}
"""
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
tagger = nltk.RegexpTagger(_patterns)
chunker = RegexpParser(_grammar)
text = text.replace('(', '( ').replace(')', ' )').replace(',', ' , ')
tokens = text.split()
tagged_text = tagger.tag(tokens)
tree = chunker.parse(tagged_text)
for n in tree:
if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']:
print n
# output is:
# (NAME Will/NNP Ferrell/NNP)
# (ROLE (/O (NAME Nick/NNP Halsey/NNP) )/C)
# (NAME Rebecca/NNP Hall/NNP)
# (ROLE (/O (NAME Samantha/NNP) )/C)
# (NAME Glenn/NNP Howerton/NNP)
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP) )/C)
# (NAME Stephen/NNP Root/NNP)
# (NAME Laura/NNP Dern/NNP)
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP) )/C)
You must then process the tagged output and put names and roles in a list instead of printing, but you get the picture.
What we do here is do a first pass where we tag each token according to the regex in _patterns, and then do a second pass to build more complex chunks according to your simple grammar. You can complicate the grammar and the patterns as you want, ie. catch variations of names, messy inputs, abbreviations, and so on.
I think doing this with a single regex pass is going to be a pain for non-trivial inputs.
Otherwise, Tim's solution is solving the issue nicely for the input you posted, and without the nltk dependency.

In case you want a non-regex solution ... (Assumes no nested parenthesis.)
in_string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
in_list = []
is_in_paren = False
item = {}
next_string = ''
index = 0
while index < len(in_string):
char = in_string[index]
if in_string[index:].startswith(' and') and not is_in_paren:
actor = next_string
if actor.startswith(' with '):
actor = actor[6:]
item['actor'] = actor
in_list.append(item)
item = {}
next_string = ''
index += 4
elif char == '(':
is_in_paren = True
item['actor'] = next_string
next_string = ''
elif char == ')':
is_in_paren = False
item['part'] = next_string
in_list.append(item)
item = {}
next_string = ''
elif char == ',':
if is_in_paren:
item['part'] = next_string
next_string = ''
in_list.append(item)
item = item.copy()
item.pop('part')
else:
next_string = "%s%s" % (next_string, char)
index += 1
out_list = []
for dict in in_list:
actor = dict.get('actor')
part = dict.get('part')
if part is None:
part = ''
out_list.append((actor.strip(), part.strip()))
print out_list
Output:
[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), ('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

Related

remove content between tags in python using regex

I was trying to clean up wikitext. Specifically I was trying to remove all the {{.....}} and <..>...</..> in the wikitext. For example, for this wikitext:
"{{Infobox UK place\n|country = England\n|official_name =
Morcombelake\n|static_image_name = Morecombelake from Golden Cap -
geograph.org.uk - 1184424.jpg\n|static_image_caption = Morcombelake as
seen from Golden Cap\n|coordinates =
{{coord|50.74361|-2.85153|display=inline,title}}\n|map_type =
Dorset\n|population = \n|population_ref = \n|shire_district = [[West
Dorset]]\n|shire_county = [[Dorset]]\n|region = South West
England\n|constituency_westminster = West Dorset\n|post_town =
\n|postcode_district = \n|postcode_area = DT\n|os_grid_reference =
SY405938\n|website = \n}}\n'''Morcombelake''' (also spelled
'''Morecombelake''') is a small village near [[Bridport]] in
[[Dorset]], [[England]], within the ancient parish of [[Whitchurch
Canonicorum]]. [[Golden Cap]], part of the [[Jurassic Coast]] World
Heritage Site, is nearby.{{cite
web|url=http://www.nationaltrust.org.uk/golden-cap/|title=Golden
Cap|publisher=National Trust|accessdate=2014-05-04}}\n\n==
References ==\n{{reflist}}\n\n{{West
Dorset}}\n\n\n{{Dorset-geo-stub}}\n[[Category:Villages in
Dorset]]\n\n== External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n"
How can I use regular expressions in python to produce output like this:
\n'''Morcombelake''' (also spelled '''Morecombelake''') is a small
village near [[Bridport]] in [[Dorset]], [[England]], within the
ancient parish of [[Whitchurch Canonicorum]]. [[Golden Cap]], part of
the [[Jurassic Coast]] World Heritage Site, is nearby.\n\n==
References ==\n\n\n\n\n\n\n[[Category:Villages in Dorset]]\n\n==
External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n
As the tags are nested into each other, you can find and remove them in a loop:
n = 1
while n > 0:
s, n = re.subn('{{(?!{)(?:(?!{{).)*?}}|<[^<]*?>', '', s, flags=re.DOTALL)
s is a string containing the wikitext.
There are no the <...> tags in your example, but they should be removed as well.

File content into dictionary

I need to turn this file content into a dictionary, so that every key in the dict is a name of a movie and every value is the name of the actors that plays in it inside a set.
Example of file content:
Brad Pitt, Sleepers, Troy, Meet Joe Black, Oceans Eleven, Seven, Mr & Mrs Smith
Tom Hanks, You have got mail, Apollo 13, Sleepless in Seattle, Catch Me If You Can
Meg Ryan, You have got mail, Sleepless in Seattle
Diane Kruger, Troy, National Treasure
Dustin Hoffman, Sleepers, The Lost City
Anthony Hopkins, Hannibal, The Edge, Meet Joe Black, Proof
This should get you started:
line = "a, b, c, d"
result = {}
names = line.split(", ")
actor = names[0]
movies = names[1:]
result[actor] = movies
Try the following:
res_dict = {}
with open('my_file.txt', 'r') as f:
for line in f:
my_list = [item.strip() for item in line.split(',')]
res_dict[my_list[0]] = my_list[1:] # To make it a set, use: set(my_list[1:])
Explanation:
split() is used to split each line to form a list using , separator
strip() is used to remove spaces around each element of the previous list
When you use with statement, you do not need to close your file explicitly.
[item.strip() for item in line.split(',')] is called a list comprehension.
Output:
>>> res_dict
{'Diane Kruger': ['Troy', 'National Treasure'], 'Brad Pitt': ['Sleepers', 'Troy', 'Meet Joe Black', 'Oceans Eleven', 'Seven', 'Mr & Mrs Smith'], 'Meg Ryan': ['You have got mail', 'Sleepless in Seattle'], 'Tom Hanks': ['You have got mail', 'Apollo 13', 'Sleepless in Seattle', 'Catch Me If You Can'], 'Dustin Hoffman': ['Sleepers', 'The Lost City'], 'Anthony Hopkins': ['Hannibal', 'The Edge', 'Meet Joe Black', 'Proof']}

Python Re: Overwrite Issue

I am having an issue with replacing a part of a string. Right now this code. My goal is for every string that includes a key in this dictionary.
mapping = { "St": "Street",
"St.": "Street",
'Rd': 'Road',
'Rd.': 'Road',
'Ave': 'Avenue',
'Ave.': 'Avenue',
'Ln':'Lane',
'Ln.':'Lane',
'Dr':'Drive',
'Dr.':'Drive',
'Pl':'Place',
'Pl.':'Place',
'Pkwy':'Parkway',
'Blvd.': 'Boulevard',
'Blvd': 'Boulevard'
}
To replace that part of the string with the value in the dictionary.
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
def update_name(name, mapping):
for key,value in mapping.iteritems():
if key in name:
newname = re.sub(street_type_re,value,name)
print name,'==>',newname
return name
Right now the code is doing stuff like this
National Rd SW ==> National Rd Road
I need to fix it so that it returns this
National Rd SW ==> National Road SW
newname = re.sub(key,value,name)
You can simply replace key instead of matching it with precompiled regex or
newname = re.sub(r"\b"+key+r"\b",value,name)
Yours replaces the last as you have $

Append items to dictionary Python

I am trying to write a function in python that opens a file and parses it into a dictionary. I am trying to make the first item in the list block the key for each item in the dictionary data. Then each item is supposed to be the rest of the list block less the first item. For some reason though, when I run the following function, it parses it incorrectly. I have provided the output below. How would I be able to parse it like I stated above? Any help would be greatly appreciated.
Function:
def parseData() :
filename="testdata.txt"
file=open(filename,"r+")
block=[]
for line in file:
block.append(line)
if line in ('\n', '\r\n'):
album=block.pop(1)
data[block[1]]=album
block=[]
print data
Input:
Bob Dylan
1966 Blonde on Blonde
-Rainy Day Women #12 & 35
-Pledging My Time
-Visions of Johanna
-One of Us Must Know (Sooner or Later)
-I Want You
-Stuck Inside of Mobile with the Memphis Blues Again
-Leopard-Skin Pill-Box Hat
-Just Like a Woman
-Most Likely You Go Your Way (And I'll Go Mine)
-Temporary Like Achilles
-Absolutely Sweet Marie
-4th Time Around
-Obviously 5 Believers
-Sad Eyed Lady of the Lowlands
Output:
{'-Rainy Day Women #12 & 35\n': '1966 Blonde on Blonde\n',
'-Whole Lotta Love\n': '1969 II\n', '-In the Evening\n': '1979 In Through the Outdoor\n'}
You can use groupby to group the data using the empty lines as delimiters, use a defaultdict for repeated keys extending the rest of the values from each val returned from groupby after extracting the key/first element.
from itertools import groupby
from collections import defaultdict
d = defaultdict(list)
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
# if k is True we have a section
if k:
# get key "k" which is the first line
# from each section, val will be the remaining lines
k,*v = val
# add or add to the existing key/value pairing
d[k].extend(map(str.rstrip,v))
from pprint import pprint as pp
pp(d)
Output:
{'Bob Dylan\n': ['1966 Blonde on Blonde',
'-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands'],
'Led Zeppelin\n': ['1979 In Through the Outdoor',
'-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl",
'1969 II',
'-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home']}
For python2 the unpack syntax is slightly different:
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, v = next(val), val
d[k].extend(map(str.rstrip, v))
If you want to keep the newlines remove the map(str.rstrip..
If you want the album and songs separately for each artist:
from itertools import groupby
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, alb, songs = next(val),next(val), val
d[k.rstrip()][alb.rstrip()] = list(map(str.rstrip, songs))
from pprint import pprint as pp
pp(d)
{'Bob Dylan': {'1966 Blonde on Blonde': ['-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or '
'Later)',
'-I Want You',
'-Stuck Inside of Mobile with the '
'Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
'-Most Likely You Go Your Way '
"(And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': {'1969 II': ['-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': ['-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}
I guess this is what you want?
Even if this is not the format you wanted, there are a few things you might learn from the answer:
use with for file handling
nice to have:
PEP8 compilant code, see http://pep8online.com/
a shebang
numpydoc
if __name__ == '__main__'
And SE does not like a list being continued by code...
#!/usr/bin/env python
""""Parse text files with songs, grouped by album and artist."""
def add_to_data(data, block):
"""
Parameters
----------
data : dict
block : list
Returns
-------
dict
"""
artist = block[0]
album = block[1]
songs = block[2:]
if artist in data:
data[artist][album] = songs
else:
data[artist] = {album: songs}
return data
def parseData(filename='testdata.txt'):
"""
Parameters
----------
filename : string
Path to a text file.
Returns
-------
dict
"""
data = {}
with open(filename) as f:
block = []
for line in f:
line = line.strip()
if line == '':
data = add_to_data(data, block)
block = []
else:
block.append(line)
data = add_to_data(data, block)
return data
if __name__ == '__main__':
data = parseData()
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(data)
which gives:
{ 'Bob Dylan': { '1966 Blonde on Blonde': [ '-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': { '1969 II': [ '-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': [ '-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}

How to apply a regex to each sublists of a list?

Let's say I have a list of lists like this:
lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
['I just became the mayor of Porta Romana on #username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "#username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
]
I would like to remove the links of each sublist, so I tried with this regular expression:
new_list = re.sub(r'^https?:\/\/.*[\r\n]*', '', tweets, flags=re.MULTILINE)
I used the MULTILINE flag since when I print list_ it looks like:
[]
[]
[]
...
[]
The problem with the above aproach is that I got an TypeError: expected string or buffer clearly I can not pass like this the sublists to the regex. How can I apply the above regex to the set of sublists in list_ ? in order to get something like this (i.e. the sublists without any type of link):
[['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware'],
['I just became the mayor of Porta Romana on #username! \t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated \t\t", "#username Don't use my family surname for your app ????\t\t"]
]
Does this can be done with a map or is there any other efficient aproach?.
Thanks in advance guys
You need to use \b instead of start of the line anchor.
>>> lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
['I just became the mayor of Porta Romana on #username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "#username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
]
>>> [[re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', i)] for x in lis_ for i in x]
[['"Fun is the enjoyment of pleasure"\t\t'], ['#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t'], ['Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on #username! '], ["RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated "], ["#username Don't use my family surname for your app ???? "]]
OR
>>> l = []
>>> for i in lis_:
m = []
for j in i:
m.append(re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', j))
l.append(m)
>>> l
[['"Fun is the enjoyment of pleasure"\t\t', '#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t', 'Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on #username! ', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated ", "#username Don't use my family surname for your app ???? "]]
It seems that you have a list of lists of strings.
In that case, you simply need to iterate over these lists the proper way:
list_ = [['blablablalba', 'blabalbablbla', 'blablala', 'http://t.co/xSnsnlNyq5'], ['blababllba', 'blabalbla', 'blabalbal'],['http://t.co/xScsklNyq5'], ['blablabla', 'http://t.co/xScsnlNyq3']]
def remove_links(sublist):
return [s for s in sublist if not re.search(r'https?:\/\/.*[\r\n]*', s)]
final_list = map(remove_links, list_)
# [['blablablalba', 'blabalbablbla', 'blablala'], ['blababllba', 'blabalbla', 'blabalbal'], [], ['blablabla']]
If you want to remove any empty sub-lists afterwards:
final_final_list = [l for l in final_list if l]

Categories