Extract Dictionary Values from Classifier Output - python

I'm trying zero-shot classification. I get an output like below
[{'labels': ['rep_appreciation',
'cx_service_appreciation',
'issue_resolved',
'recommend_product',
'callback_realted',
'billing_payment_related',
'disppointed_product'],
'scores': [0.9198898673057556,
0.8672246932983398,
0.79215407371521,
0.6239275336265564,
0.4782547056674957,
0.39024001359939575,
0.010263209231197834],
'sequence': 'Alan Edwards provided me with nothing less the excellent assistance'}
Above is output for one row in a data frame
I'm hoping to finally build a data frame columns and output values mapped like below. 1s for labels if the scores are above certain threshold
Any nudge/help to solve this is highly appreciated.

Define a function which returns a key : value dictionary for every row, with key being the label and value being 1/0 based on threshold
def get_label_score_dict(row, threshold):
result_dict = dict()
for _label, _score in zip(row['labels'], row['scores']):
if _score > threshold:
result_dict.update({_label: 1})
else:
result_dict.update({_label: 0})
return result_dict
Now if you have a list_of_rows with each row being in the form as shown above, then you can use the map function to get the above mentioned dictionary for every row. Once you get this, convert it into a DataFrame.
th = 0.5 #whatever threshold value you want
result = list(map(lambda x: get_label_score_dict(x, th), list_of_rows))
result_df = pd.DataFrame(result)

Related

Why is mapping dataseries with dataframe column names taking so long with map function

Hi all i have a dataframe of approx 400k rows with a column of interest. I would like to map each element in the column to a category (LU, HU, etc.). This is obtained from a smaller dataframe where the column names are the category. The function below however runs very slow voor only 400k rows. I m not sure why. In the example below ofcourse it is fast for 5 examples.
cwp_sector_mapping = {
'LU': ['C2P34', 'C2P35', 'C2P36'],
'HU': ['C2P37', 'C2P38', 'C2P39'],
'EH': ['C2P40', 'C2P41', 'C2P42'],
'EL': ['C2P43', 'C2P44', 'C2P45'],
'WL': ['C2P12', 'C2P13', 'C2P14'],
'WH': ['C2P15', 'C2P16', 'C2P17'],
'NL': ['C2P18', 'C2P19', 'C2P20'],
}
df_cwp = pd.DataFrame.from_dict(cwp_sector_mapping)
columns = df_cwp.columns
ls = pd.Series(['C2P44', 'C2P43', 'C2P12', 'C2P1'])
temp = list((map(lambda pos: columns[df_cwp.eq(pos).any()][0] if
columns[df_cwp.eq(pos).any()].size != 0 else 'UN', ls)))
Use next with iter trick for possible get first meached value of columns, if no match get default value UN:
temp = [next(iter(columns[df_cwp.eq(pos).any()]), 'UN') for pos in ls]

Calculating averaged data in and writing to csv from a pandas dataframe

I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations.
The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation.
Data looks like this
x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6
For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.
Calculate the average of one numeric value for all the values with a third specific text value
Sum another numeric value for all the values with the same text value
Write the a resultant table of 'x, y, average, sum' to a csv.
My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.
What have I done wrong and how can I speed it up?
for text1 in text_list1:
for text2 in text_list2:
# Get the data into smaller dataframe
df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
#Get the minimum and maximum x and y
minXw = df['x'].min()
maxXw = df['x'].max()
minYw = df['y'].min()
maxYw = df['y'].max()
# dictionary for quicker printing
dict_out = {}
rows_list = []
# Make output filename
filenameOut = text1+"_"+text2+"_Values.csv"
# Start looping through x values
for x in np.arange(minXw, maxXw, x_inc):
xcount += 1
# Start looping through y values
for y in np.arange(minYw, maxYw, y_inc):
ycount += 1
# calculate average and sum
ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()
# Make Dictionary of output values
dict_out = dict([('text1', text1),
('text2', text2),
('text3', df['text3']),
('x' , x-x_inc),
('y' , y-y_inc),
('ave' , ave_val),
('sum' , sum_val)])
rows_list_c.append(dict_out)
# Write csv
columns = ['text1','text2','text3','x','y','ave','sum']
with open(filenameOut, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=columns)
writer.writeheader()
for data in dict_out:
writer.writerow(data)
My resultant csv gives me:
text1,text2,text3,x,y,ave,sum
text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0
Not really clear what you're trying to do. But here is a starting point
If you only need to process rows with a specific text3value, start by filtering out the other rows:
df = df[df.text3=="my_value"]
If at this point, you do not need text3 anymore, you can also drop it
df = df.drop(columns="text3")
Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:
for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
filenameOut = text1+"_"+text2+"_Values.csv"
# Process sub df
output_df = process(sub_df)
# Write sub df
output_df.to_csv(filenameOut)
Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.
Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).
At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.
def process(sub_df):
# drop the text1 and text2 columns since they are in the filename anyway
out = sub_df.drop(columns=["text1","text2"])
# Compute mean and max
return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))
And that's preety much it.
Bonus: 2-liner version (but don't do that...)
for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1+"_"+text2+"_Values.csv")
To do this in an efficient way in pandas you will need to use groupby, agg and the in-built to_csv method rather than using for loops to construct lists of data and writing each one with the csv module. Something like this:
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby("text3") \
.agg({"float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
It's not clear exactly what you're trying to do with the incrementing of x and y values, which is also what makes your current code very slow. To present sums and averages of the floating point columns by intervals of x and y, you could make bin columns and group by those too.
data["x_bin"] = (data["x"] - data["x"].min()) // x_inc
data["y_bin"] = (data["y"] - data["y"].min()) // y_inc
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby(["text3", "x_bin", "y_bin"]) \
.agg({"x": "first", "y": "first", "float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")

Array to columns in dataframe

I've built a functioning classification model following this tutorial.
I bring in a csv and then pass each row's text value into a function which calls on the classification model to make a prediction. The function returns an array which I need put into columns in the dataframe.
Function:
def get_top_k_predictions(model,X_test,k):
# get probabilities instead of predicted labels, since we want to collect top 3
np.set_printoptions(suppress=True)
probs = model.predict_proba(X_test)
# GET TOP K PREDICTIONS BY PROB - note these are just index
best_n = np.argsort(probs, axis=1)[:,-k:]
# GET CATEGORY OF PREDICTIONS
preds = [
[(model.classes_[predicted_cat], distribution[predicted_cat])
for predicted_cat in prediction]
for distribution, prediction in zip(probs, best_n)]
preds=[ item[::-1] for item in preds]
return preds
Function Call:
for index, row in df.iterrows():
category_test_features=category_loaded_transformer.transform(df['Text'].values.astype('U'))
df['PREDICTION'] = get_top_k_predictions(category_loaded_model,category_test_features,9)
This is the output from the function:
[[('Learning Activities', 0.001271131465669718),
('Communication', 0.002696299964802842),
('Learning Objectives', 0.002774964762863968),
('Learning Technology', 0.003557563051027678),
('Instructor/TAs', 0.004512712287403168),
('General', 0.006675929282872587),
('Learning Materials', 0.013051869950436862),
('Course Structure', 0.02781481160602757),
('Community', 0.9376447176288959)]]
I want the output to look like this in the end.
You function returns a list that contains a list of tuples? Why the double-nested list? One way I can think of:
tmp = {}
for index, row in df.iterrows():
predictions = get_top_k_predictions(...)
tmp[index] = {
key: value for key, value in predictions[0]
}
tmp = pd.DataFrame(tmp).T
df.join(tmp)

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

Extract string from data frame in python

I am new in Python and I would like to extract a string data from my data frame. Here is my data frame:
Which state has the most counties in it?
Unfortunately I could not extract a string! Here is my code:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
return census_df[census_df['COUNTY']==census_df['COUNTY'].max()]['STATE']
answer_five()
How about this:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
"""
Returns the 'STATE' corresponding to the max 'COUNTY' value
"""
max_county = census_df['COUNTY'].max()
s = census_df.loc[census_df['COUNTY']==max_county, 'STATE']
return s
answer_five()
This should output a pd.Series object featuring the 'STATE' value(s) where 'COUNTY' is maxed. If you only want the value and not the Series (as your question stated, and since in your image there's only 1 max value for COUNTY) then return s[0] (instead of return s) should do.
def answer_five():
return census_df.groupby('STNAME')['COUNTY'].nunique().idxmax()
You can aggregate data using group by state name, then get count on unique counties and return id of max count.
I had the same issue for some reason I tried using .item() and manage to extract the exact value I needed.
In your case it would look like:
return census_df[census_df['COUNTY'] == census_df['COUNTY'].max()]['STATE'].item()

Categories