Automatic labeling of LDA generated topics - python

I'm trying to categorize customer feedback and I ran an LDA in python and got the following output for 10 topics:
(0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" + 0.011*"thanks" + 0.010*"city" + 0.009*"way"')
(1, u'0.397*"package" + 0.073*"address" + 0.055*"time" + 0.047*"customer" + 0.045*"apartment" + 0.037*"delivery" + 0.031*"number" + 0.026*"item" + 0.021*"support" + 0.018*"door"')
(2, u'0.190*"time" + 0.127*"order" + 0.113*"minute" + 0.075*"pickup" + 0.074*"restaurant" + 0.031*"food" + 0.027*"support" + 0.027*"delivery" + 0.026*"pick" + 0.018*"min"')
(3, u'0.072*"code" + 0.067*"gps" + 0.053*"map" + 0.050*"street" + 0.047*"building" + 0.043*"address" + 0.042*"navigation" + 0.039*"access" + 0.035*"point" + 0.028*"gate"')
(4, u'0.434*"hour" + 0.068*"time" + 0.034*"min" + 0.032*"amount" + 0.024*"pay" + 0.019*"gas" + 0.018*"road" + 0.017*"today" + 0.016*"traffic" + 0.014*"load"')
(5, u'0.245*"route" + 0.154*"warehouse" + 0.043*"minute" + 0.039*"need" + 0.039*"today" + 0.026*"box" + 0.025*"facility" + 0.025*"bag" + 0.022*"end" + 0.020*"manager"')
(6, u'0.371*"location" + 0.110*"pick" + 0.097*"system" + 0.040*"im" + 0.038*"employee" + 0.022*"evening" + 0.018*"issue" + 0.015*"request" + 0.014*"while" + 0.013*"delivers"')
(7, u'0.182*"schedule" + 0.181*"please" + 0.059*"morning" + 0.050*"application" + 0.040*"payment" + 0.026*"change" + 0.025*"advance" + 0.025*"slot" + 0.020*"date" + 0.020*"tomorrow"')
(8, u'0.138*"stop" + 0.110*"work" + 0.062*"name" + 0.055*"account" + 0.046*"home" + 0.043*"guy" + 0.030*"address" + 0.026*"city" + 0.025*"everything" + 0.025*"feature"')
Is there a way to automatically label them? I do have a csv file which has feedbacks manually labeled, but I do not want to supply these labels myself. I want the model to create labels. Is it possible?

The comments here link to another SO answer that links to a paper. Let's say you wanted to do the minimum to try to make this work. Here is an MVP-style solution that has worked for me: search Google for the terms, then look for keywords in the response.
Here is some working, though hacky, code:
pip install cssselect
then
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter
def get_srp_text(search_term):
raw = get(f"https://www.google.com/search?q={topic_terms}").text
page = fromstring(raw)
blob = ""
for result in page.cssselect("a"):
for res in result.findall("div"):
blob += ' '
blob += res.text if res.text else " "
blob += ' '
return blob
def blob_cleaner(blob):
clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
return ''.join(e for e in blob if e.isalnum() or e.isspace())
def get_name_from_srp_blob(clean_blob):
blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
c = Counter(blob_tokens)
most_common = c.most_common(10)
name = f"{most_common[0][0]}-{most_common[1][0]}"
return name
pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))
Then you can just get the topic words from your model, e.g.
topic_terms = "delivery area mile option partner traffic hub thanks city way"
name = pipeline(topic_terms)
print(name)
>>> City-Transportation
and
topic_terms = "package address time customer apartment delivery number item support door"
name = pipeline(topic_terms)
print(name)
>>> Parcel-Package
You could improve this up a lot. For example, you could use POS tags to only find the most common nouns, then use those for the name. Or find the most common adjective and noun, and make the name "Adjective Noun". Even better, you could get the text from the linked sites, then run YAKE to extract keywords.
Regardless, this demonstrates a simple way to automatically name clusters, without directly using machine learning (though, Google is most certainly using it to generate the search results, so you are benefitting from it).

Related

Urdu language dataset for aspect-based sentiment analysis

when i run my code i get this error this error because of what>
text_raw_indices = tokenizer.text_to_sequence(text_left + " " + aspect + " " + text_right)
text_raw_without_aspect_indices = tokenizer.text_to_sequence(text_left + " " + text_right)
text_left_indices = tokenizer.text_to_sequence(text_left)
text_left_with_aspect_indices = tokenizer.text_to_sequence(text_left + " " + aspect)
text_right_indices = tokenizer.text_to_sequence(text_right, reverse=True)
text_right_with_aspect_indices = tokenizer.text_to_sequence(" " + aspect + " " + text_right, reverse=True)
aspect_indices = tokenizer.text_to_sequence(aspect)
left_context_len = np.sum(text_left_indices != 0)
aspect_len = np.sum(aspect_indices != 0)
aspect_in_text = torch.tensor([left_context_len.item(), (left_context_len + aspect_len - 1).item()])
polarity = int(polarity) + 1
Just use LASER and you'll be fine. It covers Urdu as well.
You can read more here:
https://engineering.fb.com/ai-research/laser-multilingual-sentence-embeddings/
https://github.com/facebookresearch/LASER
There's also unofficial pypi package here. It substitutes some inner dependencies, but still works as expected.
And most important question, so we may better help you: what are you trying to achieve, what is your final goal?

Change API call in for loop python

I am trying to call several api datasets within a for loop in order to change the call and then append those datasets together into a larger dataframe.
I have written this code which works to call the first dataset but then returns this error for the next call.
`url = base + "max=" + maxrec + "&" "type=" + item + "&" + "freq=" + freq + "&" + "px=" +px + "&" + "ps=" + str(ps) + "&" + "r="+ r + "&" + "p=" + p + "&" + "rg=" +rg + "&" + "cc=" + cc + "&" + "fmt=" + fmt
TypeError: must be str, not Response`
Here is my current code
import requests
import pandas as pd
base = "http://comtrade.un.org/api/get?"
maxrec = "50000"
item = "C"
freq = "A"
px="H0"
ps="all"
r="all"
p="0"
rg="2"
cc="AG2"
fmt="json"
comtrade = pd.DataFrame(columns=[])
for year in range(1991,2018):
ps="{}".format(year)
url = base + "max=" + maxrec + "&" "type=" + item + "&" + "freq=" + freq + "&" + "px=" +px + "&" + "ps=" + str(ps) + "&" + "r="+ r + "&" + "p=" + p + "&" + "rg=" +rg + "&" + "cc=" + cc + "&" + "fmt=" + fmt
r = requests.get(url)
x = r.json()
new = pd.DataFrame(x["dataset"])
comtrade = comtrade.append(new)
Let requests assemble the URL for you.
common_params = {
"max": maxrec,
"type": item,
"freq": freq,
# etc
}
for year in range(1991,2018):
response = requests.get(base, params=dict(common_params, ps=str(year))
response_data = response.json()
new = pd.DataFrame(response_data["dataset"])
comtrade = comtrade.append(new)
Disclaimer: the other answer is correct and you should use it.
However, your actual problem stems from the fact that you are overriding r here:
r = requests.get(url)
x = r.json()
During the next iteration r will still be that value and not the one you initialized it with in the first place. You could simply rename it to result to avoid that problem. Better let the requests library do the work though.

Adding the values of two strings using Python and XML path

It generates an output with wallTime and setupwalltime into a dat file, which has the following format:
24000 4 0
81000 17 0
192000 59 0
648000 250 0
1536000 807 0
3000000 2144 0
6591000 5699 0
I would like to know how to add the two values i.e.(wallTime and setupwalltime) together. Can someone give me a hint? I tried converting to float, but it doesn’t seem to work.
import libxml2
import os.path
from numpy import *
from cfs_utils import *
np=[1,2,3,4,5,6,7,8]
n=[20,30,40,60,80,100,130]
solver=["BiCGSTABL_iluk", "BiCGSTABL_saamg", "BiCGSTABL_ssor" , "CG_iluk", "CG_saamg", "CG_ssor" ]# ,"cholmod", "ilu" ]
file_list=["eval_BiCGSTABL_iluk_default", "eval_BiCGSTABL_saamg_default" , "eval_BiCGSTABL_ssor_default" , "eval_CG_iluk_default","eval_CG_saamg_default", "eval_CG_ssor_default" ] # "simp_cholmod_solver_3D_evaluate", "simp_ilu_solver_3D_evaluate" ]
for cnt_np in np:
i=0
for sol in solver:
#open write_file= "Graphs/" + "Np"+ cnt_np + "/CG_iluk.dat"
#"Graphs/Np1/CG_iluk.dat"
write_file = open("Graphs/"+ "Np"+ str(cnt_np) + "/" + sol + ".dat", "w")
print("Reading " + "Graphs/"+ "Np"+ str(cnt_np) + "/" + sol + ".dat"+ "\n")
#loop through different unknowns
for cnt_n in n:
#open file "cfs_calculations_" + cnt_n +"np"+ cnt_np+ "/" + file_list(i) + "_default.info.xml"
read_file = "cfs_calculations_" +str(cnt_n) +"np"+ str(cnt_np) + "/" + file_list[i] + ".info.xml"
print("File list" + file_list[i] + "vlaue of i " + str(i) + "\n")
print("Reading " + " cfs_calculations_" +str(cnt_n) +"np"+ str(cnt_np) + "/" + file_list[i] + ".info.xml" )
#read wall and cpu time and write
if os.path.exists(read_file):
doc = libxml2.parseFile(read_file)
xml = doc.xpathNewContext()
walltime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/solve/timer/#wall")
setupwalltime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/setup/timer/#wall")
# cputime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/solve/timer/#cpu")
# setupcputime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/solve/timer/#cpu")
unknowns = 3*cnt_n*cnt_n*cnt_n
write_file.write(str(unknowns) + "\t" + walltime + "\t" + setupwalltime + "\n")
print("Writing_point" + str(unknowns) + "%f" ,float(setupwalltime ) )
doc.freeDoc()
xml.xpathFreeContext()
write_file.close()
i=i+1
In java you can add strings and floats. What I understand is that you need to add the values and then display them. That would work (stringing the sum)
write_file.write(str(unknowns) + "\f" + str(float(walltime) + float(setupwalltime)) + "\n")
You are trying to add a str to a float. That doesn't work. If you want to use string concatenation, first coerce all of the values to str. Try this:
write_file.write(str(unknowns) + "\t" + str(float(walltime) + float(setupwalltime)) + "\n")
Or, perhaps more readably:
totalwalltime = float(walltime) + float(setupwalltime)
write_file.write("{}\t{}\n".format(unknowns, totalwalltime))

Naming LDA topics in Python

I am new to python and trying to implement topic modelling. I am successful in implementing LDA in pything using gensim , but I am not able to give any label/name to these topics.
How do we name these topics? please help out with the best way to implement in python.
My LDA output is somewhat like this(please let me know if you need the code) :-
0.024*research + 0.021*students + 0.019*conference + 0.019*chi + 0.017*field + 0.014*work + 0.013*student + 0.013*hci + 0.013*group + 0.013*researchers
0.047*research + 0.034*students + 0.020*ustars + 0.018*underrepresented + 0.017*participants + 0.012*researchers + 0.012*mathematics + 0.012*graduate + 0.012*mathematical + 0.012*conference
0.027*students + 0.026*research + 0.018*conference + 0.017*field + 0.015*new + 0.014*participants + 0.013*chi + 0.012*robotics + 0.010*researchers + 0.010*student
0.023*students + 0.019*robotics + 0.018*conference + 0.017*international + 0.016*interact + 0.016*new + 0.016*ph.d. + 0.016*meet + 0.016*ieee + 0.015*u.s.
0.033*research + 0.030*flow + 0.028*field + 0.023*visualization + 0.020*challenges + 0.017*students + 0.015*project + 0.013*shape + 0.013*visual + 0.012*data
0.044*research + 0.020*mathematics + 0.017*program + 0.014*june + 0.014*conference + 0.014*- + 0.013*mathematicians + 0.013*conferences + 0.011*field + 0.011*mrc
0.023*research + 0.021*students + 0.015*field + 0.014*hovering + 0.014*mechanisms + 0.014*dpiv + 0.013*aerodynamic + 0.012*unsteady + 0.012*conference + 0.012*hummingbirds
0.031*research + 0.018*mathematics + 0.016*program + 0.014*flow + 0.014*mathematicians + 0.012*conferences + 0.011*field + 0.011*june + 0.010*visualization + 0.010*communities
0.028*students + 0.028*research + 0.018*ustars + 0.018*mathematics + 0.015*underrepresented + 0.010*program + 0.010*encouraging + 0.010*'', + 0.010*participants + 0.010*conference
0.049*research + 0.021*conference + 0.021*program + 0.020*mathematics + 0.014*mathematicians + 0.013*field + 0.013*- + 0.011*conferences + 0.010*areas
Labeling topics is completely distinct from topic modeling. Here's an article that describes using a keyword extraction technique (KERA) to apply meaningful labels to topics: http://arxiv.org/abs/1308.2359

optimize python processing json retrieved from the fb-graph-api

i'm getting json data from the facebook-graph-api about:
my relationship with my friends
my friends relationships with each other.
right now my program looks like this (in python pseudo code, please note some variables have been changed for privacy):
import json
import requests
# protected
_accessCode = "someAccessToken"
_accessStr = "?access_token=" + _accessCode
_myID = "myIDNumber"
r = requests.get("https://graph.facebook.com/" + _myID + "/friends/" + _accessStr)
raw = json.loads(r.text)
terminate = len(raw["data"])
# list used to store the friend/friend relationships
a = list()
for j in range(0, terminate + 1):
# calculate terminating displacement:
term_displacement = terminate - (j + 1)
print("Currently processing: " + str(j) + " of " + str(terminate))
for dj in range(1, term_displacement + 1):
# construct urls based on the raw data:
url = "https://graph.facebook.com/" + raw["data"][j]["id"] + "/friends/" + raw["data"][j + dj]["id"] + "/" + _accessStr
# visit site *THIS IS THE BOTTLENECK*:
reqTemp = requests.get(url)
rawTemp = json.loads(reqTemp.text)
if len(rawTemp["data"]) != 0:
# data dumps to list which dumps to file
a.append(str(raw["data"][j]["id"]) + "," + str(rawTemp["data"][0]["id"]))
outputFile = "C:/Users/franklin/Documents/gen/friendsRaw.csv"
output = open(outputFile, "w")
# write all me/friend relationship to file
for k in range(0, terminate):
output.write(_myID + "," + raw["data"][k]["id"] + "\n")
# write all friend/friend relationships to file
for i in range(0, len(a)):
output.write(a[i])
output.close()
So what its doing is: first it calls my page and gets my friend list (this is allowed through the facebook api using an access_token) calling a friend's friend list is NOT allowed but I can work around that by requesting a relationship between a friend on my list and another friend on my list. so in part two (indicated by the double for loops) i'm making another request to see if some friend, a, is also a friend of b, (both of which are on my list); if so there will be a json object of length one with friend a's name.
but with about 357 friends there's literally thousands of page requests that need to be made. in other words the program is spending a lot of time just waiting around for the json-requests.
my question is then can this be rewritten to be more efficient? currently, due to security restrictions, calling a friend's friend list attribute is disallowed. and it doesn't look like the api will allow this. are there any python tricks that can make this run faster? maybe parallelism?
Update modified code is pasted below in the answers section.
Update this is the solution I came up with. Thanks #DMCS for the FQL suggestion but I just decided to use what I had. I will post the FQL solution up when I get a chance to study the implementation. As you can see this method just makes use of more condensed API calls.
Incidentally for future reference the API call limit is 600 calls per 600 seconds, per token & per IP, so for every unique IP address, with a unique access token, the number of calls is limited to 1 call per second. I'm not sure what that means for asynchronous calling #Gerrat, but there is that.
import json
import requests
# protected
_accessCode = "someaccesscode"
_accessStr = "?access_token=" + _accessCode
_myID = "someidnumber"
r = requests.get("https://graph.facebook.com/"
+ _myID + "/friends/" + _accessStr)
raw = json.loads(r.text)
terminate = len(raw["data"])
a = list()
for k in range(0, terminate - 1):
friendID = raw["data"][k]["id"]
friendName = raw["data"][k]["name"]
url = ("https://graph.facebook.com/me/mutualfriends/"
+ friendID + _accessStr)
req = requests.get(url)
temp = json.loads(req.text)
print("Processing: " + str(k + 1) + " of " + str(terminate))
for j in range(0, len(temp["data"])):
a.append(friendID + "," + temp["data"][j]["id"] + ","
+ friendName + "," + temp["data"][j]["name"])
# dump contents to file:
outputFile = "C:/Users/franklin/Documents/gen/friendsRaw.csv"
output = open(outputFile, "w")
print("Dumping to file...")
# write all me/friend relationships to file
for k in range(0, terminate):
output.write(_myID + "," + raw["data"][k]["id"]
+ ",me," + str(raw["data"][k]["name"].encode("utf-8", "ignore")) + "\n")
# write all friend/friend relationships to file
for i in range(0, len(a)):
output.write(str(a[i].encode("utf-8", "ignore")) + "\n")
output.close()
This isn't likely optimal, but I tweaked your code a bit to use Requests async method (untested):
import json
import requests
from requests import async
# protected
_accessCode = "someAccessToken"
_accessStr = "?access_token=" + _accessCode
_myID = "myIDNumber"
r = requests.get("https://graph.facebook.com/" + _myID + "/friends/" + _accessStr)
raw = json.loads(r.text)
terminate = len(raw["data"])
# list used to store the friend/friend relationships
a = list()
def add_to_list(reqTemp):
rawTemp = json.loads(reqTemp.text)
if len(rawTemp["data"]) != 0:
# data dumps to list which dumps to file
a.append(str(raw["data"][j]["id"]) + "," + str(rawTemp["data"][0]["id"]))
async_list = []
for j in range(0, terminate + 1):
# calculate terminating displacement:
term_displacement = terminate - (j + 1)
print("Currently processing: " + str(j) + " of " + str(terminate))
for dj in range(1, term_displacement + 1):
# construct urls based on the raw data:
url = "https://graph.facebook.com/" + raw["data"][j]["id"] + "/friends/" + raw["data"][j + dj]["id"] + "/" + _accessStr
req = async.get(url, hooks = {'response': add_to_list})
async_list.append(req)
# gather up all the results
async.map(async_list)
outputFile = "C:/Users/franklin/Documents/gen/friendsRaw.csv"
output = open(outputFile, "w")

Categories