d_hsp={"1":"I","2":"II","3":"III","4":"IV","5":"V","6":"VI","7":"VII","8":"VIII",
"9":"IX","10":"X","11":"XI","12":"XII","13":"XIII","14":"XIV","15":"XV",
"16":"XVI","17":"XVII","18":"XVIII","19":"XIX","20":"XX","21":"XXI",
"22":"XXII","23":"XXIII","24":"XXIV","25":"XXV"}
HSP_OLD['tryl'] = HSP_OLD['tryl'].replace(d_hsp, regex=True)
HSP_OLD is a dataframe, tryl is one column of HSP_OLD, and here's some example of values in tryl:
SAF/HSP: Secondary diagnosis E code 1
SAF/HSP: Secondary diagnosis E code 11
I use a dictionary to replace, it works for 1-10, but for 11, it will become "II" , for 12, it will become "III".
Sorry, didn't notice that you're not merely updating the field but you actually want to replace a number at the end, but even if that's the case - it's much better to properly convert your number to roman numerals than to map every possible occurrence of such (what would happen with your code if there is a number larger than 25?). So, here's one way to do it:
ROMAN_MAP = [(1000, 'M'), (900, 'CM'), (500, 'D'), (400, 'CD'), (100, 'C'), (90, 'XC'),
(50, 'L'), (40, 'XL'), (10, 'X'), (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')]
def romanize(data):
if not data or not isinstance(data, str): # we know how to work with strings only
return data
data = data.rstrip() # remove potential extra whitespace at the end
space_pos = data.rfind(" ") # find the last space before the number
if space_pos != -1:
try:
number = int(data[space_pos + 1:]) # get the number at the end
roman_number = ""
for i, r in ROMAN_MAP: # loop-reduce substitution based on the ROMAN_MAP
while number >= i:
roman_number += r
number -= i
return data[:space_pos + 1] + roman_number # put everything back together
except (TypeError, ValueError):
pass # couldn't extract a number
return data
So now if we create your data frame as:
HSP_OLD = pd.DataFrame({"tryl": ["SAF/HSP: Secondary diagnosis E code 1",
None,
"SAF/HSP: Secondary diagnosis E code 11",
"Something else without a number at the end"]})
We can noe easily apply our function over the whole column with:
HSP_OLD['tryl'] = HSP_OLD['tryl'].apply(romanize)
Which results in:
tryl
0 SAF/HSP: Secondary diagnosis E code I
1 None
2 SAF/HSP: Secondary diagnosis E code XI
3 Something else without a number at the end
Of course, you can adapt the romanize() function to your needs to search any number within your string and turn it to roman numerals - this is just an example for how to quickly find the number at the end of the string.
You need to keep the order of the items, and start searching with the longest substring.
You may use an OrderDict here. To initialize it, use a list of tuples. You may reverse it already here, when initializing, but you can do it later, too.
import collections
import pandas as pd
# My test data
HSP_OLD = pd.DataFrame({'tryl':['1. Text', '11. New Text', '25. More here']})
d_hsp_lst=[("1","I"),("2","II"),("3","III"),("4","IV"),("5","V"),("6","VI"),("7","VII"),("8","VIII"), ("9","IX"),("10","X"),("11","XI"),("12","XII"),("13","XIII"),("14","XIV"),("15","XV"), ("16","XVI"),("17","XVII"),("18","XVIII"),("19","XIX"),("20","XX"),("21","XXI"), ("22","XXII"),("23","XXIII"),("24","XXIV"),("25","XXV")]
d_hsp = collections.OrderedDict(d_hsp_lst) # Creating the OrderedDict
d_hsp = collections.OrderedDict(reversed(d_hsp.items())) # Here, reversing
>>> HSP_OLD['tryl'] = HSP_OLD['tryl'].replace(d_hsp, regex=True)
>>> HSP_OLD
tryl
0 I. Text
1 XI. New Text
2 XXV. More here
Related
Given a string, e.g. i am a string.
I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.
from nltk import ngrams
s = 'i am a string'
for n in range(1, 3):
for grams in ngrams(s.split(), n):
print(grams)
Gives the output:
('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')
Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).
[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]
The question is similar to this one, though with an additional layer of complexity.
Working solution - adapted from here.
I have a working solution, but it's really slow for longer strings.
def get_ngrams(s, min_=1, max_=4):
token_lst = []
for n in range(min_, max_):
for idx, grams in enumerate(ngrams(s.split(), n)):
token_lst.append(' '.join(grams))
return token_lst
def to_sum_k(s):
for len_ in range(1, len(s.split())+1):
for i in itertools.permutations(get_ngrams(s), r=len_):
if ' '.join(i) == s:
print(i)
to_sum_k('a b c')
EDIT:
This answer was based on the assumption that the question was to reconstruct an unknown unique string based on it's ngrams. I'll leave it active for anyone interested in this problem. The actual answer for the actual problem as clarified in the comments can be found here.
EDIT END
In general no. Consider e.g. the case n = 2 and s = "a b a b". Then your ngrams would be
[("a"), ("b"), ("a", "b"), ("b", "a")]
The set of strings that generate this set of ngrams in this case however would be all that may be generated by
(ab(a|(ab)*a?))|(ba(b|(ba)*b?)
Or n = 2, s = "a b c a b d a", where "c" and "d" may be arbitrarily ordered within the generating strings. E.g. "a b d a b c a" would also be a valid string. In addition the same issue as above arises and an arbitrary number of strings can generate the set of ngrams.
That being said there exists a way to test whether a set of ngrams uniquely identifies a string:
Consider your set of strings as a description of a non-deterministic state-machine. Each ngram can be defined as a chain of states where the single characters are transitions. As an example for the ngrams [("a", "b", "c"), ("c", "d"), ("a", "d", "b")] we would build the following state-machine:
0 ->(a) 1 ->(b) 2 ->(c) 3
0 ->(c) 3 ->(d) 4
0 ->(a) 1 ->(d) 5 ->(b) 6
Now perform a determinization of this state-machine. Iff there exists a unique string that can be reconstructed from the ngrams, the state-machine will have a longest transition-chain that doesn't contain any cycles and contains all ngrams we built the original state-machine from. In this case the original string is simply the individual state-transitions of this path joined back together. Otherwise there exist multiple strings that can be built from the provided ngrams.
While my previous answer assumed that the problem was to find an unknown string based on it's ngrams, this answer will deal with the problem of finding all ways to construct a given string using it's ngrams.
Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:
import numpy
def generate_sums(l, n, intermediate):
if l == 0:
yield intermediate
elif l < 0:
return
else:
for i in range(1, n + 1):
yield from generate_sums(l - i, n, intermediate + [i])
def combinations(s, n):
words = s.split(' ')
for c in generate_sums(len(words), n, [0]):
cs = numpy.cumsum(c)
yield [words[l:u] for (l, u) in zip(cs, cs[1:])]
EDIT:
As pointed out by #norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.
END EDIT
If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:
def build_state_machine(s, n):
next_state = 1
transitions = {}
for ng in ngrams(s.split(' '), n):
state = 0
for word in ng:
if (state, word) not in transitions:
transitions[(state, word)] = next_state
next_state += 1
state = transitions[(state, word)]
return transitions
def combinations(s, n):
transitions = build_state_machine(s, n)
states = [(0, set(), [], [])]
for word in s.split(' '):
new_states = []
for state, term_visited, path, cur_elem in states:
if state not in term_visited:
new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))
if (state, word) in transitions:
new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))
states = new_states
return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]
As an example the following state machine would be generated for the string "a b a":
Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.
My code:
seperated = startContent.split(' ')
seperatedNum = len(seperated)
#Ask for user input
for word in seperated and for i in seperatedNum:
if word == 'ADJECTIVE':
seperated[i] = input('Enter an adjective:')
elif word == 'NOUN':
seperated[i] = input('Enter a noun:')
elif word == 'ADVERB':
seperated[i] = input('Enter an adverb:')
elif word == 'VERB':
seperated[i] = input('Enter a verb:')
Basically asking the user input each time they run into one of the following words (there can be multiple of each).
I get my sentence, split it into a list with split command. And run the loop for each word. I want to then edit the list using list[x] = 'replacement' method.
The word in seperated, returns the listitem. So I need another argument passed to it, e.g i in len(list) to then get the accurate index of the word. I can't use list.index(str) because it returns the first index of the string when there are multiple iterations of the text.
You're looking for a way to pass multiple parameters in a for loop: There is nothing special about a for loop in this regard. The loop will iterate over a given sequence and will, each iteration, assign the current element to the given left-hand side.
for LEFT_HAND_SIDE in SEQUENCE
Python also supports "automatic" unpacking of sequences during assigments, as you can see in the following example:
>>> a, b = (4, 2)
>>> a
4
>>> b
2
In conclusion, you can just combine multiple variables on the left-hand side in your for loop with a sequence of sequences:
>>> for a, b in [(1, 2), (3, 4)]:
... print(a)
... print(b)
...
1
2
3
4
That for loop had two assignments a, b = (1, 2) and a, b = (3, 4).
In you specific case, you want to combine the value of an element in a sequence with its index. The built-in function enumerate comes in handy here:
>>> enumerate(["x", "y"])
<enumerate object at 0x7fc72f6685a0>
>>> list(enumerate(["x", "y"]))
[(0, 'x'), (1, 'y')]
So you could write your for loop like this:
for i, word in enumerate(seperated)
I'm trying to print each element individually, which is fine but also repeat each element based on position eg. "abcd" = A-Bb-Ccc-Dddd etc
So my problems are making print statements print x times based off their position in the string. I've tried a few combinations using len and range but i often encounter errors because i'm using strings not ints.
Should i be using len and range here? I'd prefer if you guys didn't post finished code, just basically how to go about that specific problem (if possible) so i can still go about figuring it out myself.
user_string = input()
def accum(s):
for letter in s:
pos = s[0]
print(letter.title())
pos = s[0 + 1]
accum(user_string)
You can enumerate iterables (lists, strings, ranges, dictkeys, ...) - it provides the index and a value:
text = "abcdef"
for idx,c in enumerate(text):
print(idx,c)
Output:
(0, 'a')
(1, 'b')
(2, 'c')
(3, 'd')
(4, 'e')
(5, 'f')
You can use that to print something multiple times. The print command takes 2 optional parameters :
print("Bla","blubb", sep=" --->", end=" Kawumm\n")
Output:
Bla --->blubb Kawumm
that specify what is printed between outputs and on the end of output - you can specify an end="" - so you can continue printing on the same line.
Doku:
Print
Enumerate
Edit:
user_string = input()
def accum(s):
t = [] # list to store stuff into
for count, letter in enumerate(s):
total = letter.upper() + letter * (count) # 1st as Upper, rest as is
t.append(total) # add to list
print(*t, sep="-") # the * "unpacks" the list into its parts
accum(user_string)
Unpacking:
print( [1,2,3,4,5], sep=" +++ ") # its just 1 value to print, no sep needed
print(*[1,2,3,4,5], sep=" +++ ") # 5 values to print, sep needed
Output:
[1, 2, 3, 4, 5]
1 +++ 2 +++ 3 +++ 4 +++ 5
You could try having a counter that will increase by 1 as the loop traverses through the string. Then, within the loop that you currently have, have another for loop to loop the size of the counter. If you want it to print out the first letter capitalized then you will need to account for that along with the dashes.
I am trying a hangman code in python. For matching a character of a word , iam using index function to get the location of character.
Ex :word = 'COMPUTER'
user_input = raw_input('Enter a character :') # say 'T; is given here
if user_input in word:
print "\nThe Character %c is present in the word \n" %user_input
word_dict[word.index(user_input)] = user_input
#so the output will looks like
{0: '_', 1: '_', 2: '_', 3: '_', 4: '_', 5: 'T', 6: '_', 7: '_'}
Now , my problems comes when it comes with the repeated character.
# Another example
>>> 'CARTOON'.index('O')
4
For the second 'O', how to get its index. since i have used this 'index' logic, i am looking to continue on this way.
As per the str.index docs, signature looks like this
str.index(sub[, start[, end]])
The second parameter is the starting index to search from. So you can pass the index which you got for the first item + 1, to get the next index.
i = 'CARTOON'.index('O')
print 'CARTOON'.index('O', i + 1)
Output
5
The above code can be written like this
data = 'CARTOON'
print data.index('O', data.index('O') + 1)
You can even have this as a utility function, like this
def get_second_index(input_string, sub_string):
return input_string.index(sub_string, input_string.index(sub_string) + 1)
print get_second_index("CARTOON", "O")
Note: If the string is not found atleast twice, this will throw ValueError.
The more generalized way,
def get_index(input_string, sub_string, ordinal):
current = -1
for i in range(ordinal):
current = input_string.index(sub_string, current + 1)
else:
raise ValueError("ordinal {} - is invalid".format(ordinal))
return current
print get_index("AAABBBCCCC", "C", 4)
A perhaps more pythonic method would be to use a generator, thus avoiding the intermediate array 'found':
def find_indices_of(char, in_string):
index = -1
while True:
index = in_string.find(char, index + 1)
if index == -1:
break
yield index
for i in find_indices_of('x', 'axccxx'):
print i
1
4
5
An alternative would be the enumerate built-in
def find_indices_of_via_enumerate(char, in_string):
return (index for index, c in enumerate(in_string) if char == c)
This also uses a generator.
I then got curious as to perf differences. I'm a year into using python, so I'm only beginning to feel truly knowledgeable. Here's a quick test, with various types of data:
test_cases = [
('x', ''),
('x', 'axxxxxxxxxxxx'),
('x', 'abcdefghijklmnopqrstuvw_yz'),
('x', 'abcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvwxyz'),
]
for test_case in test_cases:
print "('{}', '{}')".format(*test_case)
print "string.find:", timeit.repeat(
"[i for i in find_indices_of('{}', '{}')]".format(*test_case),
"from __main__ import find_indices_of",
)
print "enumerate :", timeit.repeat(
"[i for i in find_indices_of_via_enumerate('{}', '{}')]".format(*test_case),
"from __main__ import find_indices_of_via_enumerate",
)
print
Which, on my machine results in these timings:
('x', '')
string.find: [0.6248660087585449, 0.6235580444335938, 0.6264920234680176]
enumerate : [0.9158611297607422, 0.9153609275817871, 0.9118690490722656]
('x', 'axxxxxxxxxxxx')
string.find: [6.01502799987793, 6.077538013458252, 5.997750997543335]
enumerate : [3.595151901245117, 3.5859270095825195, 3.597352981567383]
('x', 'abcdefghijklmnopqrstuvw_yz')
string.find: [0.6462750434875488, 0.6512351036071777, 0.6495819091796875]
enumerate : [2.6581480503082275, 2.6216518878936768, 2.6187551021575928]
('x', 'abcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvw_yzabcdefghijklmnopqrstuvwxyz')
string.find: [1.2539417743682861, 1.2511990070343018, 1.2702908515930176]
enumerate : [7.837890863418579, 7.791800022125244, 7.9181809425354]
enumerate() method is more expressive, pythonic. Whether or not perf differences matter depends on the actual use cases.
You've asked how to find the second occurrence, and gotten an excellent answer for that, generalized for any specific occurrence. What you'll realize you actually want though is all occurrences at once. Here's a method for that:
def find_characters(word, character):
found = []
last_index = -1
while True:
try:
last_index = word.index(character, last_index+1)
except ValueError:
break
else:
found.append(last_index)
return found
You can use the count method of the strings to find the number of occurrences of the user_input in the string. Then, use the str.index(sub,start) method for each occurrence of the user_input in the word and increment start by 1 each time so that you do not wind up getting the same index each time.
if user_input in word:
count=word.count(user_input)
a=word.index(user_input)
word_dict[word.index(a)]=user_input
for i in range(count-1):
a=word.index(user_input,a+1)
word_dict[word.index(a)]=user_input
This should really be a one-liner if you use filter because if you use index you will be forced to either iterate or use recursion. In this case, there is absolutely no need for either. You can just filter out the values that are relevant to you.
Using filter is easy. An example implementation is the following one-liner:
def f1(w, c):
return zip(* filter(lambda (x,y): x == c, zip(w, range(len(w))) ))[1]
f1('cartoon', 'o') # --> (4, 5)
You can always add error checking as in:
def f1(w, c) :
if c not in w: return ()
else: return zip(* filter(lambda (x,y): x == c, zip(w, range(len(w))) ))[1]
If the character isn't found in the string, you just get an empty tuple. Otherwise, you get all elements that match. If you want something generic, counting on the fact that there will only be one or two instances of a character is not the right way to go about it. For example:
In [18]: f1('supercalifragilisticexpialidocious', 'i')
Out[18]: (8, 13, 15, 18, 23, 26, 30)
Here is another Example.
a="samesame"
po=-1 # for this, po+1 is start from 0
for c in a:
if c=='s': # For example, I chose "S" what I want to find
po = a.index(c,po+1) # if you find first element 'C' then search again in next postion
print(po)
Apologies if this answer is not properly formatted or I've messed up somewhere as I am new here and this is my first post. I have used the following code in my own Hangman games to get the index of multiple repeating letters in a word and has worked great. Hopefully a newcomer will understand this ok.
a = "hangman" #the chosen word
length = len(a) #determines length of chosen word
for i in range(length) #this will loop through the code length number of times
if a[i] == "n": #n is the players guess. Checks if the letter is at index i
po = a.index("n", i) # po gets the index of the letter if previous line is true
print(po) #prints the position/s
Hope this helps someone!
def findcharpos(string, character, position=None):
array = []
index = -1
while True:
try:
index = string.index(character, index+1)
except ValueError:
break
else:
array.append(index)
if position == None and len(array) != 0:
return array
elif position > len(array):
raise ValueError(f"The character {character} does not occur {position}
times in the {string}")
else:
return array[position-1]
return array
msg = 'samesame'
print(msg.index('s', 2)) # prints the index of second 's' in the string.
I love the ability to define a parseAction with pyarsing, but I've run into a roadblock for a particular use case. Take the input string and the following simple grammar:
from pyparsing import *
line = "[[one [two [three] two [three [four]]] one] zero]"
token = Word(alphas)
# Define the simple recursive grammar
grammar = Forward()
nestedBrackets = nestedExpr('[', ']', content=grammar)
grammar << (token | nestedBrackets)
P = grammar.parseString(line)
print P
I'd like the results to be:
[('one',1), [('two',2), [('three',3)], ('two',2), [('three',3), [('four',4)]]] one], ('zero',0)]
i.e parse each token and return a tuple with the token and the depth. I know that this can be done post-parse, but I want to know if it is possible to do with a parseAction. Here way my incorrect attempt with a global variable:
# Try to count the depth
counter = 0
def action_token(x):
global counter
counter += 1
return (x[0],counter)
token.setParseAction(action_token)
def action_nest(x):
global counter
counter -= 1
return x[0]
nestedBrackets.setParseAction(action_nest)
Giving:
[('one', 1), ('two', 2), ('three', 3), ('two', 3), ('three', 4), ('four', 5), ('one', 3), ('zero', 3)]
Do this (leaving the rest as you have it):
def openB(s, l, t):
global count
count += 1
def closeB(s, l, t):
global count
count -= 1
opener = Literal("[").setParseAction(openB)
closer = Literal("]").setParseAction(closeB)
nestedBrackets = nestedExpr(opener, closer, content=grammar)
The issue is that the nesting depends not on the number of nested groups matched, but on the number of open brackets matched versus the number of closed brackets matched. Therefore, you need to adjust the count when you parse the open/close brackets, not when you parse the group. So you need to set the parseAction on the group delimiters, not the group itself.
Also, your example has the nesting off by one level (at least by my eyes). The 'zero" should really be one, since it is inside one level of brackets, and likewise everything else should be shifted up by one. If you really want that outermost "zero" to have level zero and so on, you need to initialize count to -1.