Splitting a String URL into words Using Python

Splitting a String URL into words Using Python - python

How can I get various words from a string(URL) in python?
From a URL like:
http://www.sample.com/level1/level2/index.html?id=1234
I want to get words like:
http, www, sample, com, level1, level2, index, html, id, 1234
Any solutions using python.
Thanks.

This is how you may do it for all URL
import re
def getWordsFromURL(url):
return re.compile(r'[\:/?=\-&]+',re.UNICODE).split(url)
Now you may use it as
url = "http://www.sample.com/level1/level2/index.html?id=1234"
words = getWordsFromURL(url)

just regex-split according to the biggest sequence of non-alphanums:
import re
l = re.split(r"\W+","http://www.sample.com/level1/level2/index.html?id=1234")
print(l)
yields:
['http', 'www', 'sample', 'com', 'level1', 'level2', 'index', 'html', 'id', '1234']
This is simple but as someone noted, doesn't work when there are _, -, ... in URL names. So the less fun solution would be to list all possible tokens that can separate path parts:
l = re.split(r"[/:\.?=&]+","http://stackoverflow.com/questions/41935748/splitting-a-stri‌ng-url-into-words-us‌ing-python")
(I admit that I may have forgotten some separation symbols)

Related

How do I match exact strings from a list to a larger string taking white spaces into account?

I have a large list of strings and I want to check whether a string occurs in a larger string. The list contains of strings of one word and also strings of multiple words. To do so I have written the following code:
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache"
emptylist = []
for i in example_text:
res = [ele for ele in example_list if(ele in i)]
emptylist.append(res)
However the problem is here is 'pain' is also added to emptylist which it should not as I only want something from the example_list to be added if exactly matches the text. I also tried using sets:
word_set = set(example_list)
phrase_set = set(example_text.split())
word_set.intersection(phrase_set)
This however chops op 'morning sickness' into 'morning' and 'sickness'. Does anyone know what is the correct way to tackle this problem?

Nice examples have already been provided in this post by members.
I made the matching_text a little more challenging where the pain occurred more than once. I also aimed for a little more information about where the match location starts. I ended up with the following code.
I worked on the following sentence.
"The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
import re
from collections import defaultdict
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
TruthFalseDict = defaultdict(list)
for i in example_list:
MatchedTruths = re.finditer(r'\b%s\b'%i, example_text)
if MatchedTruths:
for j in MatchedTruths:
TruthFalseDict[i].append(j.start())
print(dict(TruthFalseDict))
The above gives me the following output.
{'pain': [55, 69], 'headache': [38], 'sickness': [78]}

Using PyParsing:
import pyparsing as pp
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache morning sickness"
list_of_matches = []
for word in example_list:
rule = pp.OneOrMore(pp.Keyword(word))
for t, s, e in rule.scanString(example_text):
if t:
list_of_matches.append(t[0])
print(list_of_matches)
Which yields:
['headache', 'sickness', 'morning sickness']

You should be able to use a regex using word boundaries
>>> import re
>>> [word for word in example_list if re.search(r'\b{}\b'.format(word), example_text)]
['headache']
This will not match 'pain' in 'kneepain' since that does not begin with a word boundary. But it would properly match substrings that contained whitespace.

Regex Contains No Capture Groups?

I've got a series of malformed JSON data that I need to use Regex to get the data I need out of it, then I need to use regex again to remove a specific aspect of the data i.e. the main category, in the example below it's 'games'.
Part 1 works, the second part does not.
I've limited experience with Python, and next to no experience with Regex.
Final Output: games
I'm getting the error:
ValueError: pattern contains no capture groups
The series of data contains information formated like this:
{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/games/playing%20cards"}},"color":51627,"parent_id":12,"name":"Playing Cards","id":273,"position":4,"slug":"games/playing cards"}
The Python call I'm using is this:
First I remove the slug from the JSON.
ksdata.cat_slug_raw = ksdata.category.str.extract('\"slug\"\:\"(.+?)\"', expand=False)
Then I remove everything before the /
ksdata.cat_slug = ksdata.cat_slug_raw.str.extract('^[^/]+(?=/)', expand=False)
I'd really appreciate some help with where I'm going wrong...and if you think my solution as a whole sux please tell me :)

You can use ast.literal_eval:
s = '{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/games/playing%20cards"}},"color":51627,"parent_id":12,"name":"Playing Cards","id":273,"position":4,"slug":"games/playing cards"}'
import ast
final_data = ast.literal_eval(s)
Output:
{'name': 'Playing Cards', 'color': 51627, 'slug': 'games/playing cards', 'parent_id': 12, 'urls': {'web': {'discover': 'http://www.kickstarter.com/discover/categories/games/playing%20cards'}}, 'position': 4, 'id': 273}

Based on an amended suggestion from TomSitter I used
ksdata.cat_slug_raw.str.split('/').str[0]
This was the simplest way to get around it.

Why isn't there function to count Document Frequency (DF) in NLTK?

I am looking for a function to get the DF of certain term (meaning how many documents contain a certain word in a corpus), but I can't seem to find the function here. The page only has function to get values of tf, idf, and tf_idf. I am looking specifically for DF only. I copied the code below from the documentation,
matches = len([True for text in self._texts if term in text])
but I don't like the result it gives. For example if I have a list of strings and I am looking for the word Pete, it also includes the name Peter which is not I want. For example.
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
So I am looking for pete which appears TWICE, but the code I showed above will tell you that there are THREE pete's because it also counts peter. How do I solve this? Thanks.

Your description is incorrect. The expression you posted does indeed give 1, not 3, when you search for pete in texts:
>>> texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
>>> len([True for text in texts if 'pete' in text])
1
The only way you could have matched partial words is if your texts were not tokenized (i.e. if texts is a list of strings, not a list of token lists).
But the above code is terrible, it builds a list for no reason at all. A better (and more conventional) way to count hits is this:
>>> sum(1 for text in texts if 'pete' in text))
1

As for the question that you pose (Why (...)?) : I don't know.
As a solution to your example (noting that peter occurs twice and pete just once:
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
def flatten(l):
out = []
for item in l:
if isinstance(item, (list, tuple)):
out.extend(flatten(item))
else:
out.append(item)
return out
flat = flatten(texts)
len([c for c in flat if c in ['pete']])
len([c for c in flat if c in ['peter']])
Compare the two results
Edit:
import collections
def counts(listr, word):
total = []
for i in range(len(texts)):
total.append(word in collections.Counter(listr[i]))
return(sum(total))
counts(texts,'peter')
#2

What's a more efficient way of looping with regex?

I have a list of names which I'm using to pull out of a target list of strings. For example:
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
output = ['Chris Smith', 'Kim', 'CHRIS']
So the rules so far are:
Case insensitive
Cannot match partial word ('ie Christmas/hijacked shouldn't match Chris/Jack)
Other words in string are okay as long as name is found in the string per the above criteria.
To accomplish this, another SO user suggested this code in this thread:
[targ for targ in target_list if any(re.search(r'\b{}\b'.format(name), targ, re.I) for name in first_names)]
This works very accurately so far, but very slowly given the names list is ~5,000 long and the target list ranges from 20-100 lines long with some strings up to 30 characters long.
Any suggestions on how to improve performance here?
SOLUTION: Both of the regex based solutions suffered from OverflowErrors so unfortunately I could not test them. The solution that worked (from #mglison's answer) was:
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
This provided a tremendous increase in performance from 15 seconds to under 1 second.

Seems like you could combine them all into 1 super regex:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex_string = '|'.join(r"(?:\b"+re.escape(x)+r"\b)" for x in names)
print regex_string
regex = re.compile(regex_string,re.I)
print [t for t in target if regex.search(t)]
A non-regex solution which will only work if the names are a single word (no whitespace):
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
the any expression could also be written as:
any(x in new_names for x in t.lower().split())
or
any(x.lower() in new_names for x in t.split())
or, another variant which relies on set.intersection (suggested by #DSM below):
[ t for t in target if new_names.intersection(t.lower().split()) ]
You can profile to see which performs best if performance is really critical, otherwise choose the one that you find to be easiest to read/understand.
*If you're using python2.x, you'll probably want to use itertools.imap instead of map if you go that route in the above to get it to evaluate lazily -- It also makes me wonder if python provides a lazy str.split which would have performance on par with the non-lazy version ...

this one is the simplest one i can think of:
[item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
all together:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
results = [item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
print results
>>>
['Chris Smith', 'Kim']
and to make it more efficient, you can compile the regex first.
regex = re.compile( r'\b(%s)\b' % '|'.join(names) )
[item for item in target if regex.search(item)]
edit
after considering the question and looking at some comments, i have revised the 'solution' to the following:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex = re.compile( r'\b((%s))\b' % ')|('.join([re.escape(name) for name in names]), re.I )
results = [item for item in target if regex.search(item)]
results:
>>>
['Chris Smith', 'Kim', 'CHRIS']

You're currently doing one loop inside another, iterating over two lists. That's always going to give you quadratic performance.
One local optimisation is to compile each name regex (which will make applying each regex faster). However, the big win is going to be to combine all of your regexes into one regex which you apply to each item in your input. See #mgilson's answer for how to do that. After that, your code performance should scale linearly as O(M+N), rather than O(M*N).

Parse out elements from a pattern

I am trying to parse the result output from a natural language parser (Stanford parser).
Some of the results are as below:
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
The result am trying to get are:
['dep', 'Company', 'rent']
['conj_or', 'rent', 'share']
['amod', 'information', 'personal']
...
['amod', 'companies', 'non-affiliated']
...
['aux', 'requested', "'ve"]
First I tried to directly get these elements out, but failed.
Then I realized regex should be the right way forward.
However, I am totally unfamiliar with regex. With some exploration, I got:
m = re.search('(?<=())\w+', line)
m2 =re.search('(?<=-)\d', line)
and stuck.
The first one can correctly get the first elements, e.g. 'dep', 'amod', 'conj_or', but I actually have not totally figured out why it is working...
Second line is trying to get the second elements, e.g. 'Company', 'rent', 'information', but I can only get the number after the word. I cannot figure out how to lookbefore rather than lookbehind...
BTW, I also cannot figure out how to deal with exceptions such as 'non-affiliated' and "'ve".
Could anyone give some hints or help. Highly appreciated.

It is difficult to give an optimal answer without knowing the full range of possible outputs, however, here's a possible solution:
>>> [re.findall(r'[A-Za-z_\'-]+[^-\d\(\)\']', line) for line in s.split('\n')]
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]
It works by finding all the groups of contiguous letters ([A-Za-z] represent the interval between capital A and Z and small a and z) or the characters "_" and "'" in the same line.
Furthermore it enforce the rule that your matched string must not have in the last position a given list of characters ([^...] is the syntax to say "must not contain any of the characters (replace "..." with the list of characters)).
The character \ escapes those characters like "(" or ")" that would otherwise be parsed by the regex engine as instructions.
Finally, s is the example string you gave in the question...
HTH!

Here is something you're looking for:
([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)
The parenthesis around [\w-]* are for grouping, so that you can access data as:
ex = r'([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)'
m = re.match(ex, line)
print(m.group(0), m.group(1), m.group(2))
Btw, I recommend using "Kodos" program written in Python+PyQT to learn and test regular expressions. It's my favourite tool to test regexs.

If the results from the parser are as regular as suggested, regexes may not be necessary:
from pprint import pprint
source = """
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
"""
items = []
for line in source.splitlines():
head, sep, tail = line.partition('(')
if head:
item = [head]
head, sep, tail = tail.strip('()').partition(', ')
item.append(head.rpartition('-')[0])
item.append(tail.rpartition('-')[0])
items.append(item)
pprint(items)
Output:
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a String URL into words Using Python - python

How can I get various words from a string(URL) in python? From a URL like: http://www.sample.com/level1/level2/index.html?id=1234 I want to get words like: http, www, sample, com, level1, level2, index, html, id, 1234 Any solutions using python. Thanks.

This is how you may do it for all URL import re def getWordsFromURL(url): return re.compile(r'[\:/?=\-&]+',re.UNICODE).split(url) Now you may use it as url = "http://www.sample.com/level1/level2/index.html?id=1234" words = getWordsFromURL(url)

Related

How do I match exact strings from a list to a larger string taking white spaces into account?

Regex Contains No Capture Groups?

Why isn't there function to count Document Frequency (DF) in NLTK?

What's a more efficient way of looping with regex?

Parse out elements from a pattern

Categories

Resources