Hello so I have a python function that's working but not in the way I expect and I'm not sure where my code is off.
def preprocess(text):
case = truecase.get_true_case(text)
doc = nlp(case)
return doc
def summarize_texts(texts):
actions = {}
entities = {}
for item in texts:
doc = preprocess(item)
for token in doc:
if token.pos_ == "VERB":
actions[str.lower(token.text)] = actions.get(token.text, 0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
if token.text not in entities[token.label_]:
entities[token.label_].append(token.text)
return {
'actions': actions,
'entities': entities
}
when I call the function for a list of sentences, this is the output I get:
docs = [
"Play something by Billie Holiday, and play again",
"Set a timer for five minutes",
"Play it again, Sam"
]
summarize_texts(docs)
output: {'actions': {'play': 1, 'set': 1},
'entities': {'PERSON': ['Sam'], 'TIME': ['five minutes']}}
It it's working in that it's finding the action keys and entities keys but I am having two issues.
it's not counting the actions right
it's only storing the last value of each entity.
output should be:
output: {'actions': {'play': 3, 'set': 1},
'entities': {'PERSON': ['Billie','Sam'], 'TIME': ['five minutes']}}
Any help would be AMAZING! I have a feeling its something easy but just too brain fried to see it.
You're replacing the data structures, not simply updating the values. You only want to create a new container if does not exist at that point.
For actions:
if token.pos_ == "VERB":
action_key = str.lower(token.text)
if action_key not in actions:
actions[action_key] = 0
actions[action_key] += 1
For entities:
for token in doc.ents:
entity_key = token.label_
entity_value = token.text
if entity_key not in entities:
entities[entity_key] = []
if entity_value not in entities[entity_key]:
entities[entity_key].append(entity_value)
As a note, you can simplify this logic by using a defaultdict. You can also use a set rather than checking the list for duplicates each time
actions = defaultdict(int)
entities = defaultdict(set)
...
if token.pos_ == "VERB":
actions[str.lower(token.text)] += 1
...
for token in doc.ents:
entities[token.label_].add(token.text)
You're not consistent in converting the token to lowercase. You use the lowercase version when assigning to the dictionary, but the original case when calling actions.get(). So if the token has mixed case, you'll keep on getting the default when you call actions.get(), and keep setting it to 1.
actions[token.text.lower()] = actions.get(token.text.lower(), 0) +1
Related
I have a list which contains a chat conversation between agent and customer.
chat = ['agent',
'Hi',
'how may I help you?',
'customer',
'I am facing issue with internet',
'agent',
'Can i know your name and mobile no.?'
'customer',
'john doe',
'111111',..... ]
This is a sample of chat list.
I am looking to divide the list into two parts, agent_chat and customer_chat, where agent_chat contains all the lines that agent said, and customer_chat containing the lines said by customer.
Something like this(final output).
agent_chat = ['Hi','how may I help you?','Can i know your name and mobile no.?'...]
customer_chat = ['I am facing issue with internet','john doe','111111',...]
I'm facing issues while solving this, i tried using list.index() method to split the chat list based on indexes, but I'm getting multiple values for the same index.
For example, the following snippet:
[chat.index(l) for l in chat if l=='agent']
Displays [0, 0], since its only giving me first occurrence.
Is there a better way to achieve the desired output?
index() returns only the first index of the element so you'll need to accumulate the index of all occurrence by iterating over the list.
I would suggest to solve this using a simple for loop as:
agent_chat = []
customer_chat = []
chat_type = 'agent'
for chat in chats:
if chat in ['agent', 'customer']:
chat_type = chat
continue
if chat_type == 'agent':
agent_chat.append(chat)
else:
customer_chat.append(chat)
Other approaches like list comprehension will require two iterations of the list.
This would be my solution to your problem.
chat = ['agent',
'Hi',
'how may I help you?',
'customer',
'I am facing issue with internet',
'agent',
'Can i know your name and mobile no.?',
'customer',
'john doe',
'111111']
agent_list = []
customer_list = []
agent = False
customer = False
for message in chat:
if message == 'agent':
agent = True
customer = False
elif message == 'customer':
agent = False
customer = True
elif agent:
agent_list.append(message)
elif customer:
customer_list.append(message)
else:
pass
Here is my solution. I don't know this is the best one but I hope it helps
def chat_lists(chat):
agent_chat = []
customer_chat = []
user_flag = ""
for message in chat:
if message == 'agent':
user_flag = 'agent'
elif message == 'customer':
user_flag = 'customer'
else :
if user_flag == 'agent':
agent_chat.append(message)
else:
customer_chat.append(message)
return customer_chat, agent_chat
You can do something like this.
chat = ['agent',
'Hi',
'how may I help you?',
'customer',
'I am facing issue with internet',
'agent',
'Can i know your name and mobile no.?'
'customer',
'john doe',
'111111']
agent = []
customer = []
for j in chat:
if j=='agent':
curr = 'agent'
continue
if j=='customer':
curr = 'customer'
continue
if(curr=='agent'):
agent.append(j)
else:
customer.append(j)
print(agent)
print(customer)
You can set up a while loop to parse through the messages, and set up a variable to act as a 'switch' for whether the agent or client is talking.
# Get the current speaker (first speaker)
current_speaker = chat[0]
# Make the chat logs
agent_chat = []
customer_chat = []
# Iterate for the array
for message in chat:
# If the current speaker is updated
if message in ['agent', 'customer']:
# Then update the speaker
current_speaker = message
# Skip to the next iteration
continue
# Add the message based
if current_speaker == 'agent':
agent_chat.append(message)
else:
customer_chat.append(message)
Another note to keep in mind is that this whole system will bug out heavily if a customer or agent decides, for whatever reason, to type in the word 'agent' or 'customer'.
I'm performing what I imagine is a common pattern with indexing graph databases: my data is a list of edges and I want to "stream" the upload of this data. I.e, for each edge, I want to create the two nodes on each side and then create the edge between them; I don't want to first upload all the nodes and then link them afterwards. A naive implementation would result in a lot of duplicate nodes obviously. Therefore, I want to implement some sort of "get_or_create" to avoid duplication.
My current implementation is below, using pyArango:
def get_or_create_graph(self):
db = self._get_db()
if db.hasGraph('citator'):
self.g = db.graphs["citator"]
self.judgment = db["judgment"]
self.citation = db["citation"]
else:
self.judgment = db.createCollection("judgment")
self.citation = db.createCollection("citation")
self.g = db.createGraph("citator")
def get_or_create_node_object(self, name, vertex_data):
object_list = self.judgment.fetchFirstExample(
{"name": name}
)
if object_list:
node = object_list[0]
else:
node = self.g.createVertex('judgment', vertex_data)
node.save()
return node
My problems with this solution are:
Since the application, not the database, is checking existence, there could be an insertion between the existence check and the creation. I have found duplicate nodes in practice I suspect this is why?
It isn't very fast. Probably because it hits the DB twice potentially.
I am wandering whether there is a faster and/or more atomic way to do this, ideally a native ArangoDB query? Suggestions? Thank you.
Update
As requested, calling code shown below. It's in a Django context, where Link is a Django model (ie data in a database):
... # Class definitions etc
links = Link.objects.filter(dirty=True)
for i, batch in enumerate(batch_iterator(links, limit=LIMIT, batch_size=ITERATOR_BATCH_SIZE)):
for link in batch:
source_name = cleaner.clean(link.case.mnc)
target_name = cleaner.clean(link.citation.case.mnc)
if source_name == target_name: continue
source_data = _serialize_node(link.case)
target_data = _serialize_node(link.citation.case)
populate_pair(citation_manager, source_name, source_data, target_name, target_data, link)
def populate_pair(citation_manager, source_name, source_data, target_name, target_data, link):
source_node = citation_manager.get_or_create_node_object(
source_name,
source_data
)
target_node = citation_manager.get_or_create_node_object(
target_name,
target_data
)
description = source_name + " to " + target_name
citation_manager.populate_link(source_node, target_node, description)
link.dirty = False
link.save()
And here's a sample of what the data looks like after cleaning and serializing:
source_data: {'name': 'P v R A Fu', 'court': 'ukw', 'collection': 'uf', 'number': 'CA 139/2009', 'tag': 'NA', 'node_id': 'uf89638', 'multiplier': '5.012480529547776', 'setdown_year': 0, 'judgment_year': 0, 'phantom': 'false'}
target_data: {'name': 'Ck v R A Fu', 'court': 'ukw', 'collection': 'uf', 'number': '10/22147', 'tag': 'NA', 'node_id': 'uf67224', 'multiplier': '1.316227766016838', 'setdown_year': 0, 'judgment_year': 0, 'phantom': 'false'}
source_name: [2010] ZAECGHC 9
target_name: [2012] ZAGPJHC 189
I don't know with the Python driver. But this could be done using AQL
FOR doc in judgement
Filter doc.name == "name"
Limit 1
Insert merge((vertexobject, { _from: doc.id }) into citator
The vertextObject need to be an AQL object with at least the _to value
Note There may be typo I'm answering from my phone
So I have two functions. The first takes a string parameter and converts it into spacy tokens.
def preprocess (texts):
case = truecase.get_true_case(texts)
doc = nlp(case)
return doc
The next calls that function and processes the text into aggregated dictionaries.
def summarize_texts(texts):
doc = preprocess(texts) #another function that took text and processed it as a spacy doc
actions = {}
entities = {}
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text, 0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
return {
'actions': actions,
'entities': entities
}
So that when you call the function you'll get these results.
summarize_texts("Play it again, Sam")
output: {'actions': {'play': 1}, 'entities': {'PERSON': ['Sam']}}
The issue I'm having is that my functions only work with one parameter but will fail if give it a parameter that's a list of sentences such as:
["Play something by Billie Holiday",
"Set a timer for five minutes",
"Play it again, Sam"]
and I'm not sure how to get it to work the way I want it to.
for example if I called
summarize_texts(["Play it again, Sam", "Play something by Billie Holiday"])
output: {'actions': {'play': 2}, 'entities': {'PERSON': ['Sam', 'Billie']}}
However if I run
docs = [
"Play something by Billie Holiday",
"Set a timer for five minutes",
"Play it again, Sam"
]
summarize_texts(docs)
output is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-46-200347d5cac5> in <module>()
4 "Play it again, Sam"
5 ]
----> 6 summarize_texts(docs)
5 frames
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in _replace_html_entities(text, keep, remove_illegal, encoding)
257 return "" if remove_illegal else match.group(0)
258
--> 259 return ENT_RE.sub(_convert_entity, _str_to_unicode(text, encoding))
260
261
TypeError: expected string or bytes-like object
You can check the type of the input!
Here I am checking if it is an str or a list!
After that if it is a str I am creating a list with just one sentence!
Your output will be a list of results! [Here you just return the one result if there was just one input! -- optional]
return result[0] if len(result)==1 else result
def preprocess (texts):
case = truecase.get_true_case(texts)
doc = nlp(case)
return doc
def summarize_texts(texts):
if type(texts) is str: texts = [texts]
result = []
for text in texts:
doc = preprocess(text) #another function that took text and processed it as a spacy doc
actions = {}
entities = {}
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text, 0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
result.append({
'actions': actions,
'entities': entities
})
return result
print(summarize_texts("Play it again, Sam"))
print(summarize_texts(["Play something by Billie Holiday", "Set a timer for five minutes", "Play it again, Sam"]))
I'm working on a custom Ansible dynamic inventory python script. I have created groups from k=v pairs, but for certain groups, I want the key prefixed to the values, otherwise the group names are meaningless (1,2,3, etc.)
I've tried sticking the key name in various places, but without a proper understanding of what I'm doing. In the example below, I am trying to get the "bucket" group to have every value look something like bucket_3 (which would then be the Ansible group name).
result = {
'all': {
'hosts': [],
'vars': {},
},
'_meta': {
'hostvars': {}
}
}
server = ''
for raw_line in output.split('\n'):
line = raw_line.strip()
if len(line) > 0 and not line.startswith(comment_char):
if line.endswith(server_char):
server = line[:-1]
result['all']['hosts'].append(server)
result['_meta']['hostvars'][server] = {}
else:
raw_key, raw_value = line.split('=', 1)
key = raw_key.strip()
value = raw_value.strip()
result['_meta']['hostvars'][server][key] = value
if key == 'ansible_groups':
for group in value.split(","):
if group not in result.keys():
result[group] = {'hosts': [], 'vars': {}}
result[group]['hosts'].append(server)
if key == 'bucket':
for group in value:
if group not in result.keys():
result[group] = 'bucket_' + {'hosts': [], 'vars': {}}
result[group]['hosts'].append(server)
I expect to get groups such as bucket_1, bucket_2, etc. (The source has 'bucket = 1', 'bucket = 2', etc.).
Getting error "'bucket_' + {'hosts': [], 'vars': {}} TypeError: cannot concatenate 'str' and
'dict' objects"
granted, this is just my latest attempt, so errors have been varied as I try to find the correct way to modify the group name.
nevermind...just not thinking.
if key == 'bucket':
for group in value:
group = 'bucket_' + group
if group not in result.keys():
still a bit slower than I would like, but it is functional
I was trying to fetch auto scaling groups with Application tag value as 'CCC'.
The list is as below,
gweb
prd-dcc-eap-w2
gweb
prd-dcc-emc
gweb
prd-dcc-ems
CCC
dev-ccc-wer
CCC
dev-ccc-gbg
CCC
dev-ccc-wer
The script I coded below gives output which includes one ASG without CCC tag.
#!/usr/bin/python
import boto3
client = boto3.client('autoscaling',region_name='us-west-2')
response = client.describe_auto_scaling_groups()
ccc_asg = []
all_asg = response['AutoScalingGroups']
for i in range(len(all_asg)):
all_tags = all_asg[i]['Tags']
for j in range(len(all_tags)):
if all_tags[j]['Key'] == 'Name':
asg_name = all_tags[j]['Value']
# print asg_name
if all_tags[j]['Key'] == 'Application':
app = all_tags[j]['Value']
# print app
if all_tags[j]['Value'] == 'CCC':
ccc_asg.append(asg_name)
print ccc_asg
The output which I am getting is as below,
['prd-dcc-ein-w2', 'dev-ccc-hap', 'dev-ccc-wfd', 'dev-ccc-sdf']
Where as 'prd-dcc-ein-w2' is an asg with a different tag 'gweb'. And the last one (dev-ccc-msp-agt-asg) in the CCC ASG list is missing. I need output as below,
dev-ccc-hap-sdf
dev-ccc-hap-gfh
dev-ccc-hap-tyu
dev-ccc-mso-hjk
Am I missing something ?.
In boto3 you can use Paginators with JMESPath filtering to do this very effectively and in more concise way.
From boto3 docs:
JMESPath is a query language for JSON that can be used directly on
paginated results. You can filter results client-side using JMESPath
expressions that are applied to each page of results through the
search method of a PageIterator.
When filtering with JMESPath expressions, each page of results that is
yielded by the paginator is mapped through the JMESPath expression. If
a JMESPath expression returns a single value that is not an array,
that value is yielded directly. If the result of applying the JMESPath
expression to a page of results is a list, then each value of the list
is yielded individually (essentially implementing a flat map).
Here is how it looks like in Python code with mentioned CCP value for Application tag of Auto Scaling Group:
import boto3
client = boto3.client('autoscaling')
paginator = client.get_paginator('describe_auto_scaling_groups')
page_iterator = paginator.paginate(
PaginationConfig={'PageSize': 100}
)
filtered_asgs = page_iterator.search(
'AutoScalingGroups[] | [?contains(Tags[?Key==`{}`].Value, `{}`)]'.format(
'Application', 'CCP')
)
for asg in filtered_asgs:
print asg['AutoScalingGroupName']
Elaborating on Michal Gasek's answer, here's an option that filters ASGs based on a dict of tag:value pairs.
def get_asg_name_from_tags(tags):
asg_name = None
client = boto3.client('autoscaling')
while True:
paginator = client.get_paginator('describe_auto_scaling_groups')
page_iterator = paginator.paginate(
PaginationConfig={'PageSize': 100}
)
filter = 'AutoScalingGroups[]'
for tag in tags:
filter = ('{} | [?contains(Tags[?Key==`{}`].Value, `{}`)]'.format(filter, tag, tags[tag]))
filtered_asgs = page_iterator.search(filter)
asg = filtered_asgs.next()
asg_name = asg['AutoScalingGroupName']
try:
asgX = filtered_asgs.next()
asgX_name = asg['AutoScalingGroupName']
raise AssertionError('multiple ASG\'s found for {} = {},{}'
.format(tags, asg_name, asgX_name))
except StopIteration:
break
return asg_name
eg:
asg_name = get_asg_name_from_tags({'Env':env, 'Application':'app'})
It expects there to be only one result and checks this by trying to use next() to get another. The StopIteration is the "good" case, which then breaks out of the paginator loop.
I got it working with below script.
#!/usr/bin/python
import boto3
client = boto3.client('autoscaling',region_name='us-west-2')
response = client.describe_auto_scaling_groups()
ccp_asg = []
all_asg = response['AutoScalingGroups']
for i in range(len(all_asg)):
all_tags = all_asg[i]['Tags']
app = False
asg_name = ''
for j in range(len(all_tags)):
if 'Application' in all_tags[j]['Key'] and all_tags[j]['Value'] in ('CCP'):
app = True
if app:
if 'Name' in all_tags[j]['Key']:
asg_name = all_tags[j]['Value']
ccp_asg.append(asg_name)
print ccp_asg
Feel free to ask if you have any doubts.
The right way to do this isn't via describe_auto_scaling_groups at all but via describe_tags, which will allow you to make the filtering happen on the server side.
You can construct a filter that asks for tag application instances with any of a number of values:
Filters=[
{
'Name': 'key',
'Values': [
'Application',
]
},
{
'Name': 'value',
'Values': [
'CCC',
]
},
],
And then your results (in Tags in the response) are all the times when a matching tag is applied to an autoscaling group. You will have to make the call multiple times, passing back NextToken every time there is one, to go through all the pages of results.
Each result includes an ASG ID that the matching tag is applied to. Once you have all the ASG IDs you are interested in, then you can call describe_auto_scaling_groups to get their names.
yet another solution, in my opinion simple enough to extend:
client = boto3.client('autoscaling')
search_tags = {"environment": "stage"}
filtered_asgs = []
response = client.describe_auto_scaling_groups()
for group in response['AutoScalingGroups']:
flattened_tags = {
tag_info['Key']: tag_info['Value']
for tag_info in group['Tags']
}
if search_tags.items() <= flattened_tags.items():
filtered_asgs.append(group)
print(filtered_asgs)