split by regex and add matches to dictionary

split by regex and add matches to dictionary - python

first time posting here.
I'd like to 1) parse the following text:"keyword: some keywords concept :some concepts"
and 2) store into the dictionary: ['keyword']=>'some keywords', ['concept']=>'some concepts'.
There may be 0 or 1 'space' before each 'colon'. The following is what I've tried so far.
sample_text = "keyword: some keywords concept :some concepts"
p_res = re.compile("(\S+\s?):").split(sample_text) # Task 1
d_inc = dict([(k, v) for k,v in zip (p_res[::2], p_res[1::2])]) # Task 2
However, the list result p_res is wrong , with empty entry at the index 0, which consequently produce wrong dict. Is there something wrong with my regex?

Use re.findall to capture list of groups in a match. And then apply dict to convert list of tuples to dict.
>>> import re
>>> s = 'keyword: some keywords concept :some concepts'
>>> dict(re.findall(r'(\S+)\s*:\s*(.*?)\s*(?=\S+\s*:|$)', s))
{'concept': 'some concepts', 'keyword': 'some keywords'}
>>>
Above regex would capture key and it's corresponding value in two separate groups.
I assume that the input string contain only key value pair and the key won't contain any space character.
DEMO

Simply replace Task1 by this line:
p_res = re.compile("(\S+\s?):").split(sample_text)[1:] # Task 1
This will always ignore the (normally empty) element that is returned by re.split.
Background: Why does re.split return the empty first result?
What should the program do with this input:
sample_text = "Hello! keyword: some keywords concept :some concepts"
The text Hello! at the beginning of the input doesn't fit into the definition of your problem (which assumes that the input starts with a key).
Do you want to ignore it? Do you want to raise an exception if it appears? Do you want to want to add it to your dictionary with a special key?
re.split doesn't want to decide this for you: It returns whatever information appears and you make your decision. In our solution, we simply ignore whatever appears before the first key.

Related

How to understand the flaw in my simple three part python code?

My Python exercise in 'classes' is as follows:
You have been recruited by your friend, a linguistics enthusiast, to create a utility tool that can perform analysis on a given piece of text. Complete the class "analyzedText" with the following methods:
Constructor (_init_) - This method should take the argument text, make is lowercase and remove all punctuation. Assume only the following punctuation is used: period (.), exclamation mark (!), comma (,), and question mark (?). Assign this newly formatted text to a new attribute called fmtText.
freqAll - This method should create and return dictionary of all unique words in the text along with the number of times they occur in the text. Each key in the dictionary should be the unique word appearing in the text and the associated value should be the number of times it occurs in the text. Create this dictionary from the fmtText attribute.
This was my code:
class analysedText(object)
def __init__ (self, text):
formattedText = text.replace('.',' ').replace(',',' ').replace('!',' ').replace('?',' ')
formattedText = formattedText.lower()
self.fmtText = formattedText
def freqAll(self):
wordList = self.fmtText.split(' ')
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList(word)
return wordDict
I get errors on both of these and I can't seem to figure it out after a lot of little adjustments. I suspect the issue in the first part is when I try to assign a value to the newly formatted text but I cannot think of a workable solution. As for the second part, I am at a complete loss - I was wrongfully confident my answer was correct but I received a fail error when I ran it through the classroom's code cell to test it.

On the assumption that by 'errors' you mean a TypeError, this is caused because of line 13, wordDict[word] = wordList(word).
wordList is a list, and by using the ()/brackets you're telling Python that you want to call that list as a function. Which it cannot do.
According to your task, you are to instead find the occurrences of words in the list, which you could achieve with the .count() method. This method basically returns the total number of occurrences of an element in a list. (Feel free to read more about it here)
With this modification, (this is assuming you want wordDict to contain a dictionary with the word as the key, and the occurrence as the value) your freqAll function would look something like this:
def freqAll(self):
wordList = self.fmtText.split()
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList.count(word) # wordList.count(word) returns the number of times the string word appears as an element in wordList
return wordDict
Although you could also achieve this same task with a class known as collections.Counter, (of course this means you have to import collections) which you can read more about here

best way to find substring using regex in python 3

I was trying to find out the best way to find the specific substring in key value pair using re for the following:
some_string-variable_length/some_no_variable_digit/some_no1_variable_digit/some_string1/some_string2
eg: aba/101/11111/cde/xyz or aaa/111/1119/cde/xzx or ada/21111/5/cxe/yyz
here everything is variable and what I was looking for is something like below in key value pair:
`cde: 2` as there are two entries for cde
cxe: 1 as there is only one cxe
Note: everything is variable here except /. ie cde or cxe or some string will be there exactly after two / in each case
input:aba/101/11111/cde/xyz/blabla
output: cde:xyz/blabla
input: aaa/111/1119/cde/xzx/blabla
output: cde:xzx/blabla
input: aahjdsga/11231/1119/gfts/sjhgdshg/blabla
output: gfts:sjhgdshg/blabla
If you notice here, my key is always the first string after 3rd / and value is always the substring after key

Here are a couple of solutions based on your description that "key is always the first string after 3rd / and value is always the substring after key". The first uses str.split with a maxsplit of 4 to collect everything after the fourth / into the value. The second uses regex to extract the two parts:
inp = ['aba/101/11111/cde/xyz/blabla',
'aaa/111/1119/cde/xzx/blabla',
'aahjdsga/11231/1119/gfts/sjhgdshg/blabla'
]
for s in inp:
parts = s.split('/', 4)
key = parts[3]
value = parts[4]
print(f'{key}:{value}')
import re
for s in inp:
m = re.match(r'^(?:[^/]*/){3}([^/]*)/(.*)$', s)
if m is not None:
key = m.group(1)
value = m.group(2)
print(f'{key}:{value}')
For both pieces of code the output is
cde:xyz/blabla
cde:xzx/blabla
gfts:sjhgdshg/blabla

Others have already posted various regexes; a more broad question — is this problem best solved using a regex? Depending on how the data is formatted overall, it may be better parsed using
the .split('/') method on the string; or
csv.reader(..., delimiter='/') or csv.DictReader(..., delimiter='/') in the csv module.

Try (?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*)
demo
Try new per commnt
(?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*)(?:/(\S*))?
demo2

duplicate item_01 and rename to item_02

I'm a complete beginner in Python, and I'm trying to use the language to make scripts in a program called Modo.
I'm attempting to make a script that duplicates an item which has a suffixed number, and adds 1 to the suffix of the new duplicated item.
So for example; duplicate 'item_01', which will create 'item_01 (2)', then rename 'item_01 (2)' to 'item_02'.
I'm having trouble finding out how to get Python to take the '_01' part of the previous item's name, then add 1 to it and using the sum of that as the suffix for the new item name.
Also, this is my first question on this great site, so if additional information is needed, please let me know.

I'm interpreting your question as "I have a string ending in a digit sequence, for example "item_01". I want to get a string with the same form as the original string, but with the digit incremented by one, for example "item_02"."
You could use re.sub to replace the digit sequence with a new one:
>>> import re
>>>
>>> s = "item_01"
>>> result = re.sub(
... r"\d+$", #find all digits at the end of the string,
... lambda m: str( #replacing them with a string
... int(m.group())+1 #equal to one plus the original value,
... ).zfill(len(m.group())), #with at least as much padding as the original value.
... s
... )
>>>
>>> print(result)
item_02
In one line, that would be
result = re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), s)
Note that the resulting string may be longer than the original string, if the original value is all nines:
>>> re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), "item_99")
'item_100'
And it will only increment the digit sequence at the very end of the string, and not any intermediary sequences.
>>> re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), "item_23_42")
'item_23_43'
And if the string has no suffix digit sequence, it will simply return the original value unaltered.
>>> re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), "item_foobar")
'item_foobar'

Getting the name out of a variable is not something you do in Python.
What you want to achieve here, reading what you wrote,
I'm attempting to make a script that duplicates an item which has a suffixed number, and adds 1 to the suffix of the new duplicated item.
So for example; duplicate 'item_01', which will create 'item_01 (2)', then rename 'item_01 (2)' to 'item_02'.
would be more convenient such as below :
some_var = 1
some_other_var = some_var + 1
You could have a function doing "I am adding one to the parameter I received and I return the value" !
def add_one(var):
return (var + 1)
some_var = 1
some_other_var = add_one(some_var)
If you want to "name" your variables, and be able to change them, even if I don't see why you would want to do this, I believe what you are looking for is a dict.
I am letting you look at the reference for the dictionnary though. :)

How can I make multiple replacements in a string using a dictionary?

Suppose we have:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
How can I replace each appearance within s of any of d's keys, with the corresponding value (in this case, the result would be 'Досуг englishA')?

Using re:
import re
s = 'Спорт not russianA'
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
keys = (re.escape(k) for k in d.keys())
pattern = re.compile(r'\b(' + '|'.join(keys) + r')\b')
result = pattern.sub(lambda x: d[x.group()], s)
# Output: 'Досуг not englishA'
This will match whole words only. If you don't need that, use the pattern:
pattern = re.compile('|'.join(re.escape(k) for k in d.keys()))
Note that in this case you should sort the words descending by length if some of your dictionary entries are substrings of others.

You could use the reduce function:
reduce(lambda x, y: x.replace(y, dict[y]), dict, s)

Solution found here (I like its simplicity):
def multipleReplace(text, wordDict):
for key in wordDict:
text = text.replace(key, wordDict[key])
return text

one way, without re
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'.split()
for n,i in enumerate(s):
if i in d:
s[n]=d[i]
print ' '.join(s)

Almost the same as ghostdog74, though independently created. One difference,
using d.get() in stead of d[] can handle items not in the dict.
>>> d = {'a':'b', 'c':'d'}
>>> s = "a c x"
>>> foo = s.split()
>>> ret = []
>>> for item in foo:
... ret.append(d.get(item,item)) # Try to get from dict, otherwise keep value
...
>>> " ".join(ret)
'b d x'

With the warning that it fails if key has space, this is a compressed solution similar to ghostdog74 and extaneons answers:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
' '.join(d.get(i,i) for i in s.split())

I used this in a similar situation (my string was all in uppercase):
def translate(string, wdict):
for key in wdict:
string = string.replace(key, wdict[key].lower())
return string.upper()
hope that helps in some way... :)

Using regex
We can build a regular expression that matches any of the lookup dictionary's keys, by creating regexes to match each individual key and combine them with |. We use re.sub to do the substitution, by giving it a function to do the replacement (this function, of course, will do the dict lookup). Putting it together:
import re
# assuming global `d` and `s` as in the question
# a function that does the dict lookup with the global `d`.
def lookup(match):
return d[match.group()]
# Make the regex.
joined = '|'.join(re.escape(key) for key in d.keys())
pattern = re.compile(joined)
result = pattern.sub(lookup, s)
Here, re.escape is used to escape any characters with special meaning in the replacements (so that they don't interfere with building the regex, and are matched literally).
This regex pattern will match the substrings anywhere they appear, even if they are part of a word or span across multiple words. To avoid this, modify the regex so that it checks for word boundaries:
# pattern = re.compile(joined)
pattern = re.compile(rf'\b({joined})\b')
Using str.replace iteratively
Simply iterate over the .items() of the lookup dictionary, and call .replace with each. Since this method returns a new string, and does not (cannot) modify the string in place, we must reassign the results inside the loop:
for to_replace, replacement in d.items():
s = s.replace(to_replace, replacement)
This approach is simple to write and easy to understand, but it comes with multiple caveats.
First, it has the disadvantage that it works sequentially, in a specific order. That is, each replacement has the potential to interfere with other replacements. Consider:
s = 'one two'
s = s.replace('one', 'two')
s = s.replace('two', 'three')
This will produce 'three three', not 'two three', because the 'two' from the first replacement will itself be replaced in the second step. This is normally not desirable; however, in the rare case when it should work this way, this approach is the only practical one.
This approach also cannot easily be fixed to respect word boundaries, because it must match literal text, and a "word boundary" can be marked in multiple different ways - by varying kinds of whitespace, but also without text at the beginning and end of the string.
Finally, keep in mind that a dict is not an ideal data structure for this approach. If we will iterate over the dict, then its ability to do key lookup is useless; and in Python 3.5 and below, the order of dicts is not guaranteed (making the sequential replacement problem worse). Instead, it would be better to specify a list of tuples for the replacements:
d = [('Спорт', 'Досуг'), ('russianA', 'englishA')]
s = 'Спорт russianA'
for to_replace, replacement in d: # no more `.items()` call
s = s.replace(to_replace, replacement)
By tokenization
The problem becomes much simpler if the string is first cut into pieces (tokenized), in such a way that anything that should be replaced is now an exact match for a dict key. That would allow for using the dict's lookup directly, and processing the entire string in one go, while also not building a custom regex.
Suppose that we want to match complete words. We can use a simpler, hard-coded regex that will match whitespace, and which uses a capturing group; by passing this to re.split, we split the string into whitespace and non-whitespace sections. Thus:
import re
tokenizer = re.compile('([ \t\n]+)')
tokenized = tokenizer.split(s)
Now we look up each of the tokens in the dictionary: if present, it should be replaced with the corresponding value, and otherwise it should be left alone (equivalent to replacing it with itself). The dictionary .get method is a natural fit for this task. Finally, we join the pieces back up. Thus:
s = ''.join(d.get(token, token) for token in tokenized)
More generally, for example if the strings to replace could have spaces in them, a different tokenization rule will be needed. However, it will usually be possible to come up with a tokenization rule that is simpler than the regex from the first section (that matches all the keys by brute force).
Special case: replacing single characters
If the keys of the dict are all one character (technically, Unicode code point) each, there are more specific techniques that can be used. See Best way to replace multiple characters in a string? for details.

Using Python to check words

I'm stuck on a simple problem. I've got a dictionary of words in the English language, and a sample text that is to be checked. I've got to check every word in the sample against the dictionary, and the code I'm using is wrong.
for word in checkList: # iterates through every word in the sample
if word not in refDict: # checks if word is not in the dictionary
print word # just to see if it's recognizing misspelled words
The only problem is, as it goes through the loop it prints out every word, not just the misspelled ones. Can someone explain this and offer a solution possibly? Thank you so much!

The snippet you have is functional. See for example
>>> refDict = {'alpha':1, 'bravo':2, 'charlie':3, 'delta':4}
>>> s = 'he said bravo to charlie O\'Brian and jack Alpha'
>>> for word in s.split():
... if word not in refDict:
... print(repr(word)) # by temporarily using repr() we can see exactly
... # what the words are like
...
'he'
'said'
'to'
"O'Brian"
'and'
'jack'
'Alpha' # note how Alpha was not found in refDict (u/l case difference)
Therefore, the dictionary contents must differ from what you think, or the words out of checklist are not exactly as they appear (eg. with whitespace or capitalization; see the use of repr() (*) in print statement to help identify cases of the former).
Debugging suggestion: FOCUS on the first word from checklist (or the first that you suspect is to be found in dictionary). Then for this word and this word only, print it in details, with its length, with bracket on either side etc., for both the word out of checklist and the corresponding key in the dictionary...
(*) repr() was a suggestion from John Machin. Instead I often use brackets or other characters as in print('[' + word + ']'), but repr() is more exacting in its output.

Consider stripping your words of any whitespace that might be there, and changing all the words of both sets to the same case. Like this:
word.strip().lower()
That way you can make sure you're comparing apples to apples.

Clearly "word not in refDict" always evaluates to True. This is probably because the contents of refDict or checkList are not what you think they are. Are they both tuples or lists of strings?

The code you have would work if the keys in refDict are the correctly spelt words. If the correctly spelt words are the values in your dict then you need something like this:
for word in checkList:
if word not in refDict.values():
print word
Is there a reason you dictionary is stored as a mapping as opposed to a list or a set? A python dict contains name-value pairs for example I could use this mapping: {"dog":23, "cat":45, "pony":67} to store an index of a word and page number it is found in some book. In your case your dict is a mapping of what to what?

Are the words in the refDict the keys or the values?
Your code will only see keys: e.g.:
refDict = { 'w':'x', 'y':'z' }
for word in [ 'w','x','y','z' ]:
if word not in refDict:
print word
prints:
x
z
Othewise you want;
if word not in refDict.values()
Of course this rather assumes that your dictionary is an actual python dictionary which seems an odd way to store a list of words.

Your refDict is probably wrong. The in keyword checks if the value is in the keys of the dictionary. I believe you've put your words in as values.
I'd propose using a set instead of a dictionary.
knownwords = set("dog", "cat")
knownwords.add("apple")
text = "The dog eats an apple."
for word in text.split(" "):
# to ignore case word is converted to lowercase
if word.lower() not in knownwords:
print word
# The
# eats
# an
# apple. <- doesn't work because of the dot

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split by regex and add matches to dictionary - python

Related

How to understand the flaw in my simple three part python code?

best way to find substring using regex in python 3

duplicate item_01 and rename to item_02

How can I make multiple replacements in a string using a dictionary?

Using Python to check words

Categories

Resources