Urdu language dataset for aspect-based sentiment analysis

Urdu language dataset for aspect-based sentiment analysis - python

when i run my code i get this error this error because of what>
text_raw_indices = tokenizer.text_to_sequence(text_left + " " + aspect + " " + text_right)
text_raw_without_aspect_indices = tokenizer.text_to_sequence(text_left + " " + text_right)
text_left_indices = tokenizer.text_to_sequence(text_left)
text_left_with_aspect_indices = tokenizer.text_to_sequence(text_left + " " + aspect)
text_right_indices = tokenizer.text_to_sequence(text_right, reverse=True)
text_right_with_aspect_indices = tokenizer.text_to_sequence(" " + aspect + " " + text_right, reverse=True)
aspect_indices = tokenizer.text_to_sequence(aspect)
left_context_len = np.sum(text_left_indices != 0)
aspect_len = np.sum(aspect_indices != 0)
aspect_in_text = torch.tensor([left_context_len.item(), (left_context_len + aspect_len - 1).item()])
polarity = int(polarity) + 1

Just use LASER and you'll be fine. It covers Urdu as well.
You can read more here:
https://engineering.fb.com/ai-research/laser-multilingual-sentence-embeddings/
https://github.com/facebookresearch/LASER
There's also unofficial pypi package here. It substitutes some inner dependencies, but still works as expected.
And most important question, so we may better help you: what are you trying to achieve, what is your final goal?

Related

Is there a way to use fuzzywuzzy with an UpdateCursor to select features and then pass the variable with the highest ratio?

I am comparing two parcel datasets and want to use fuzzywuzzy to identify owner names that are similar. For example, City of Pittsburgh verses City of Pitt. I am currently using an UpdateCursor to do other things with the data and am not sure how I can add the fuzzy logic to this.
What I would like to do is have the script select all parcels that border the update cursor parcel, compare two fields for a fuzzy ratio, and then pass the variable for the one with the highest ratio.
Here is the original script that I'd like to modify with the fuzzy logic.
count = 0
with arcpy.da.UpdateCursor(children,c_field_list) as update1:
for u1row in update1:
c_owner = u1row[0]
c_id = u1row[1]
c_parent = u1row[2]
c_geo = u1row[3]
if c_parent == " ":
parents_select = arcpy.management.SelectLayerByLocation(parents, 'BOUNDARY_TOUCHES', c_geo, '', 'NEW_SELECTION')
whereClause = ' "ACCOUNTPAR" ' + " != '" + str(c_owner) + " ' "
with arcpy.da.SearchCursor(parents_select, p_field_list,whereClause) as search1:
for parent in search1:
p_owner = parent[0]
p_id = parent[1]
p_add = parent[2]
print("Parent Account Party: " + p_add + " " + p_id + " Child SOW Owner: " + c_owner + " " + c_id)
u1row[2] = p_id
u1row[4] = 6
update1.updateRow(u1row)
count += 1
print("Updating " + c_owner + " with " + u1row[2])
else:
pass
print("Aggregation Analysis Complete")
print(str(count) + " parcels updated")

Test Run Failed: local variable '' referenced before assignment

Long time reader, first time poster!
I've been working on setting up the first piece of automation at my workplace and teaching myself my first programming language at the same time. The end goal is to set up a Sikuli script to run testing overnight.
I keep running into errors that feel like a lack of understanding on basic python principles and I don't have anyone around to teach me.
The function do_math parses a .csv file, does some math, and returns those variables in a tuple. I then assign those results into a variable and try to compare them but I keep running into:
Test Run Failed: local variable 'D2LAverage' referenced before assignment
I've tried assigning D2LAverage in a number of different places, making it global, returning a list vs tuples, but it just keeps getting stuck.
<handler.py>
def do_math():
with open ('C:/Program Files/TrueVision Surgical/DSM/Logs/Latency_' + timestr + '.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
D2L = []
L2D = []
next(csv_file)
for row in csv_reader:
next
D2L.append(float(row[1]))
L2D.append(float(row[2]))
line_count += 1
else:
D2LAverage = 0
L2DAverage = 0
D2LAverage = float(sum(D2L) / len(D2L))
L2DAverage = float(sum(L2D) / len(L2D))
D2Lvar = sum(pow(x-D2LAverage,2) for x in D2L) / len(D2L) # Get varience
D2Lstd = math.sqrt(D2Lvar) # Calculate STD
L2Dvar = sum(pow(x-L2DAverage,2) for x in L2D) / len(L2D) # Get varience
L2Dstd = math.sqrt(L2Dvar) # Calculate STD
return D2LAverage, D2Lstd, L2DAverage, L2Dstd
<start_script.py> - Calling the above do_math
do_math()
results = do_math()
# Specify the path
path = 'C:/Users/AeosFactory/Desktop/'
# Specify the file name
file_name = "Latency_Results" + "_" + str(datetime.datetime.now().strftime('%Y-%m-%d_%H_%M_%S')) + ".txt"
# Create a file at specified location and do comparison
with open (os.path.join(path, file_name), 'a+') as Latency_Results:
if results[0] <= 85:
Latency_Results.write("Pass" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, D2L Average:" + " " + str(round(D2LAverage,2)) + '\n')
else:
Latency_Results.write("Fail" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, D2L Average:" + " " + str(round(D2LAverage,2)) + '\n')
if D2Lstd <= 10:
Latency_Results.write("Pass" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, D2L Standard Deviation:" + " " + str(round(D2Lstd,2)) + '\n')
else:
Latency_Results.write("Fail" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, D2L Standard Deviation:" + " " + str(round(D2Lstd,2)) + '\n')
if L2DAverage <= 85:
Latency_Results.write("Pass" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, L2D Average:" + " " + str(round(L2DAverage,2)) + '\n')
else:
Latency_Results.write("Fail" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, L2D Average:" + " " + str(round(L2DAverage,2)) + '\n')
if L2Dstd <= 10:
Latency_Results.write("Pass" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, L2D Standard Deviation:" + " " + str(round(L2Dstd,2)) + '\n' + '\n')
else:
Latency_Results.write("Fail" + " - " + "4 Screens, Image Mode 1, Surgery, Live, HC, L2D Standard Deviation:" + " " + str(round(L2Dstd,2)) + '\n' + '\n')

With the exception of global variables, variables aren't shared between python files. If you look in start_script.py you never define D2LAverage anywhere. When you attempt to use it during your if statements you raise an error because Python "doesn't know what you're talking about". I think what caught you up is thinking that the name of the values returned is exposed in the scope of where the function was called. All that gets returned is the return value itself.
At the end of do_math() you have the line return D2LAverage, D2Lstd, L2DAverage, L2Dstd, because Python only allows 1 return value per function these get packaged into a tuple.
In start_script.py you call do_math() twice,
once without capturing the return value
once where you assign the return value to results
If you only care about the results of the function, you'll probably want to remove the first call to it.
To access the individual values returned by do_math() you can either access the desired value directly by indexing into the tuple with results[0], or use tuple unpacking to assign the 4 return values to individual variables, with something like D2LAverage, D2Lstd, L2DAverage, L2Dstd = results.

Need help refining code to run a for loop to summarize economic variables from a csv file?

I have a csv file with a time series of two economic variables (housing starts and Unemployment). I have a list of calculations and a summary (text) that is written with the output of the calculations (basically summarizing in a paragraph format what the trends are of the data). I would like feedback on how i get I get a for loop to go through each variable in the csv file so i have a summary for each variable as the final output.
I tried applying the basic logic of a for loop but I'm just not sure what i have incorrect. I looked at a number of examples on stackoverflow but nothing seems to fit, I'm sure I'm missing something simple but haven't been using python that long so just not sure at this point.
raw_data = pd.read_csv('C:/Users/J042666/Desktop/2019.03 HOUST and GDP.csv')
df = pd.DataFrame(raw_data)
for i in df:
freq = "monthly "
units = " million "
pos = 1
colname = df.columns[pos]
alltime = df.mean()
low = df.min()
maximum = df.max()
today = df.iloc[720]
one_year = df.iloc[709:721].mean()
two_year = df.iloc[697:721].mean()
five_year = df.iloc[661:721].mean()
one_year_vol = df.iloc[709:721].std()
two_year_vol = df.iloc[697:721].std()
five_year_vol = df.iloc[661:721].std()
today_vs_1 = ((today/one_year) -1)*100
today_vs_2 = ((today/two_year) -1)*100
today_vs_5 = ((today/five_year) -1)*100
rolling_1 = df.rolling(window=3).mean()
rolling_2 = df.rolling(window=6).mean()
rolling_3 = df.rolling(window=9).mean()
today_vs_1_rolling = ((today/rolling_1.iloc[720]) -1)*100
today_vs_2_rolling = ((today/rolling_2.iloc[720]) -1)*100
today_vs_3_rolling = ((today/rolling_3.iloc[720]) -1)*100
summary = ("The " + str(freq) + str(colname) + " currently stands at " + str(today) + str(units) + " which compares to the 1,2 and 5 year averages of " + str(one_year) + str(units) + "," + str(two_year) + str(units) + "," + " and " + str(five_year) + str(units) + " respectively. " + " Based on the current " + str(colname) + " levels, that reflects a change of" + str(today_vs_1) + ", " + str(today_vs_2) + " and " + str(today_vs_5) + " respectively." " Since the metric began being tracked, the minimum, maximum and long run average total " + str(low) + str(units) + ", " + str(maximum) + str(units) + " and " + str(alltime) + str(units) + " respectively. " "The 1, 2 and 5 year standard deviation for " + str(colname) + " totals " + str(one_year_vol) + str(units) + " ," + str(two_year_vol) + str(units) + " and" + str(five_year_vol) + str(units) + " respectively." + " Based on the current " + str(colname) + " levels compared to the 3, 6 and 9 month rolling averages, the current level reflects a change of " + str(today_vs_1_rolling) + ", " + str(today_vs_2_rolling) + " and " + str(today_vs_3_rolling) + " respectively.")
print(summary)
As I describe above, I am hoping to have code that produces a paragraph summary of the financial metrics i calculate in the for loop for each variable.

The problem is that you are choosing the entire dataframe rather than each column alone;hence, the analysis you were doing was done for both columns. I also just extracted the values required from your operations rather than keeping the entire text that is printed out from Pandas.
This should work:
df = pd.read_csv('2019.03 HOUST and GDP.csv')
df = df.loc[:, ['Housing Starts', 'Unemployment Rate']]
for idx, col in enumerate(df.columns):
freq = "monthly "
units = " million "
colname = col
selectedCol = df.loc[:, [col]]
alltime = selectedCol.mean()[0]
low = selectedCol.min()[0]
maximum = selectedCol.max()[0]
today = selectedCol.iloc[720][0]
one_year = selectedCol.iloc[709:721].mean()[0]
two_year = selectedCol.iloc[697:721].mean()[0]
five_year = selectedCol.iloc[661:721].mean()[0]
one_year_vol = selectedCol.iloc[709:721].std()[0]
two_year_vol = selectedCol.iloc[697:721].std()[0]
five_year_vol = selectedCol.iloc[661:721].std()[0]
today_vs_1 = ((today/one_year) -1)*100
today_vs_2 = ((today/two_year) -1)*100
today_vs_5 = ((today/five_year) -1)*100
rolling_1 = selectedCol.rolling(window=3).mean()
rolling_2 = selectedCol.rolling(window=6).mean()
rolling_3 = selectedCol.rolling(window=9).mean()
today_vs_1_rolling = ((today/rolling_1.iloc[720]) -1)*100
today_vs_2_rolling = ((today/rolling_2.iloc[720]) -1)*100
today_vs_3_rolling = ((today/rolling_3.iloc[720]) -1)*100
summary = ("The " + str(freq) + str(colname) + " currently stands at " + str(today) + str(units) + " which compares to the 1,2 and 5 year averages of " + str(one_year) + str(units) + "," + str(two_year) + str(units) + "," + " and " + str(five_year) + str(units) + " respectively. " + " Based on the current " + str(colname) + " levels, that reflects a change of" + str(today_vs_1) + ", " + str(today_vs_2) + " and " + str(today_vs_5) + " respectively." " Since the metric began being tracked, the minimum, maximum and long run average total " + str(low) + str(units) + ", " + str(maximum) + str(units) + " and " + str(alltime) + str(units) + " respectively. " "The 1, 2 and 5 year standard deviation for " + str(colname) + " totals " + str(one_year_vol) + str(units) + " ," + str(two_year_vol) + str(units) + " and" + str(five_year_vol) + str(units) + " respectively." + " Based on the current " + str(colname) + " levels compared to the 3, 6 and 9 month rolling averages, the current level reflects a change of " + str(today_vs_1_rolling[0]) + ", " + str(today_vs_2_rolling[0]) + " and " + str(today_vs_3_rolling[0]) + " respectively.")
print(summary)

Frozen Python File Can't access "Save_File"

After packing my program I decided to test it out to make sure it worked, a few things happened, but the main issue is with the Save_File.
I use a Save_File.py for data, static save data. However, the frozen python file can't do anything with this file. It can't write to it, or read from it. Writing says saved successful but on load it resets all values to zero again.
Is it normal for any .py file to do this?
Is it an issue in pyinstaller?
Bad freeze process?
Or is there some other reason that the frozen file can't write, read, or interact with files not already inside it? (Save_File was frozen inside and doesn't work, but removing it causes errors, similar to if it never existed).
So the exe can't see outside of itself or change within itself...
Edit: Added the most basic version of the save file, but basically, it gets deleted and rewritten a lot.
def save():
with open("Save_file.py", "a") as file:
file.write("healthy = " + str(healthy) + "\n")
file.write("infected = " + str(infected) + "\n")
file.write("zombies = " + str(zombies) + "\n")
file.write("dead = " + str(dead) + "\n")
file.write("cure = " + str(cure) + "\n")
file.write("week = " + str(week) + "\n")
file.write("infectivity = " + str(infectivity) + "\n")
file.write("infectivity_limit = " + str(infectivity_limit) + "\n")
file.write("severity = " + str(severity) + "\n")
file.write("severity_limit = " + str(severity_limit) + "\n")
file.write("lethality = " + str(lethality) + "\n")
file.write("lethality_limit = " + str(lethality_limit) + "\n")
file.write("weekly_infections = " + str(weekly_infections) + "\n")
file.write("dna_points = " + str(dna_points) + "\n")
file.write("burst = " + str(burst) + "\n")
file.write("burst_price = " + str(burst_price) + "\n")
file.write("necrosis = " + str(necrosis) + "\n")
file.write("necrosis_price = " + str(necrosis_price) + "\n")
file.write("water = " + str(water) + "\n")
file.write("water_price = " + str(water_price) + "\n")
file.write("air = " + str(air) + "\n")
file.write("blood = " + str(blood) + "\n")
file.write("saliva = " + str(saliva) + "\n")
file.write("zombify = " + str(zombify) + "\n")
file.write("rise = " + str(rise) + "\n")
file.write("limit = int(" + str(healthy) + " + " + str(infected) + " + " + str(dead) + " + " + str(zombies) + ")\n")
file.write("old = int(1)\n")
Clear.clear()
WordCore.word_corex("SAVING |", "Save completed successfully")
time.sleep(2)
Clear.clear()
player_menu()

it's probably because the frozen version of the file (somewhere in a .zip file) is loaded and never the one you're writing (works when the files aren't frozen)
That's bad practice to:
- have a zillion global variables to hold your persistent data
- generate code in a python file just to evaluate it back again (it's _self-modifying code_).
If you used C or C++ language, would you generate some code to store your data then compile it in your new executable ? would you declare 300 globals? I don't think so.
You'd be better off with json data format and a dictionary for your variables, that would work for frozen or not frozen:
your dictionary would be like:
variables = {"healthy" : True, "zombies" : 345} # and so on
Access your variables:
if variables["healthy"]: # do something
then save function:
import json
def save():
with open("data.txt", "w") as file:
json.dump(variables,file,indent=3)
creates a text file with data like this:
{
"healthy": true,
"zombies": 345
}
and load function (declaring variables as global to avoid creating the same variable, but local only)
def load():
global variables
with open("data.txt", "r") as file:
variables = json.load(file)

Automatic labeling of LDA generated topics

I'm trying to categorize customer feedback and I ran an LDA in python and got the following output for 10 topics:
(0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" + 0.011*"thanks" + 0.010*"city" + 0.009*"way"')
(1, u'0.397*"package" + 0.073*"address" + 0.055*"time" + 0.047*"customer" + 0.045*"apartment" + 0.037*"delivery" + 0.031*"number" + 0.026*"item" + 0.021*"support" + 0.018*"door"')
(2, u'0.190*"time" + 0.127*"order" + 0.113*"minute" + 0.075*"pickup" + 0.074*"restaurant" + 0.031*"food" + 0.027*"support" + 0.027*"delivery" + 0.026*"pick" + 0.018*"min"')
(3, u'0.072*"code" + 0.067*"gps" + 0.053*"map" + 0.050*"street" + 0.047*"building" + 0.043*"address" + 0.042*"navigation" + 0.039*"access" + 0.035*"point" + 0.028*"gate"')
(4, u'0.434*"hour" + 0.068*"time" + 0.034*"min" + 0.032*"amount" + 0.024*"pay" + 0.019*"gas" + 0.018*"road" + 0.017*"today" + 0.016*"traffic" + 0.014*"load"')
(5, u'0.245*"route" + 0.154*"warehouse" + 0.043*"minute" + 0.039*"need" + 0.039*"today" + 0.026*"box" + 0.025*"facility" + 0.025*"bag" + 0.022*"end" + 0.020*"manager"')
(6, u'0.371*"location" + 0.110*"pick" + 0.097*"system" + 0.040*"im" + 0.038*"employee" + 0.022*"evening" + 0.018*"issue" + 0.015*"request" + 0.014*"while" + 0.013*"delivers"')
(7, u'0.182*"schedule" + 0.181*"please" + 0.059*"morning" + 0.050*"application" + 0.040*"payment" + 0.026*"change" + 0.025*"advance" + 0.025*"slot" + 0.020*"date" + 0.020*"tomorrow"')
(8, u'0.138*"stop" + 0.110*"work" + 0.062*"name" + 0.055*"account" + 0.046*"home" + 0.043*"guy" + 0.030*"address" + 0.026*"city" + 0.025*"everything" + 0.025*"feature"')
Is there a way to automatically label them? I do have a csv file which has feedbacks manually labeled, but I do not want to supply these labels myself. I want the model to create labels. Is it possible?

The comments here link to another SO answer that links to a paper. Let's say you wanted to do the minimum to try to make this work. Here is an MVP-style solution that has worked for me: search Google for the terms, then look for keywords in the response.
Here is some working, though hacky, code:
pip install cssselect
then
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter
def get_srp_text(search_term):
raw = get(f"https://www.google.com/search?q={topic_terms}").text
page = fromstring(raw)
blob = ""
for result in page.cssselect("a"):
for res in result.findall("div"):
blob += ' '
blob += res.text if res.text else " "
blob += ' '
return blob
def blob_cleaner(blob):
clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
return ''.join(e for e in blob if e.isalnum() or e.isspace())
def get_name_from_srp_blob(clean_blob):
blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
c = Counter(blob_tokens)
most_common = c.most_common(10)
name = f"{most_common[0][0]}-{most_common[1][0]}"
return name
pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))
Then you can just get the topic words from your model, e.g.
topic_terms = "delivery area mile option partner traffic hub thanks city way"
name = pipeline(topic_terms)
print(name)
>>> City-Transportation
and
topic_terms = "package address time customer apartment delivery number item support door"
name = pipeline(topic_terms)
print(name)
>>> Parcel-Package
You could improve this up a lot. For example, you could use POS tags to only find the most common nouns, then use those for the name. Or find the most common adjective and noun, and make the name "Adjective Noun". Even better, you could get the text from the linked sites, then run YAKE to extract keywords.
Regardless, this demonstrates a simple way to automatically name clusters, without directly using machine learning (though, Google is most certainly using it to generate the search results, so you are benefitting from it).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Urdu language dataset for aspect-based sentiment analysis - python

Related

Is there a way to use fuzzywuzzy with an UpdateCursor to select features and then pass the variable with the highest ratio?

Test Run Failed: local variable '' referenced before assignment

Need help refining code to run a for loop to summarize economic variables from a csv file?

Frozen Python File Can't access "Save_File"

Automatic labeling of LDA generated topics

Categories

Resources