Difference between masking and querying pandas.DataFrame

Difference between masking and querying pandas.DataFrame - python

My example shows when using DataFrame of float that querying might in certains cases be faster than using masks. When you look at the graph, the q̶u̶e̶r̶y̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶b̶e̶t̶t̶e̶r̶ ̶w̶h̶e̶n̶ ̶t̶h̶e̶ ̶c̶o̶n̶d̶i̶t̶i̶o̶n̶ ̶i̶s̶ ̶c̶o̶m̶p̶o̶s̶e̶d̶ ̶o̶f̶ ̶1̶ ̶t̶o̶ ̶5̶ ̶s̶u̶b̶c̶o̶n̶d̶i̶t̶i̶o̶n̶s̶.
Edit (thanks to a_guest): mask function performs better when the condition is composed of 1 to 5 subconditions
Then, Is there any difference between the two methods since it tends to have the same trend over the number of subconditions.
The function used to plot my data:
import matplotlib.pyplot as plt
def graph(data):
t = [int(i) for i in range(1, len(data["mask"]) + 1)]
plt.xlabel('Number of conditions')
plt.ylabel('timeit (ms)')
plt.title('Benchmark mask vs query')
plt.grid(True)
plt.plot(t, data["mask"], 'r', label="mask")
plt.plot(t, data["query"], 'b', label="query")
plt.xlim(1, len(data["mask"]))
plt.legend()
plt.show()
The functions used to creates the conditions to be tested by timeit:
def create_multiple_conditions_mask(columns, nb_conditions, condition):
mask_list = []
for i in range(nb_conditions):
mask_list.append("(df['" + columns[i] + "']" + " " + condition + ")")
return " & ".join(mask_list)
def create_multiple_conditions_query(columns, nb_conditions, condition):
mask_list = []
for i in range(nb_conditions):
mask_list.append(columns[i] + " " + condition)
return "'" + " and ".join(mask_list) + "'"
The function to benchmark masking vs querying using a pandas DataFrame containing float:
def benchmarks_mask_vs_query(dim_df=(50,10), labels=[], condition="> 0", random=False):
# init local variable
time_results = {"mask": [], "query": []}
nb_samples, nb_columns = dim_df
all_labels = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
if nb_columns > 26:
if len(labels) == nb_columns:
all_labels = labels
else:
raise Exception("labels length must match nb_columns" )
df = pd.DataFrame(np.random.randn(nb_samples, nb_columns), columns=all_labels[:nb_columns])
for col in range(nb_columns):
if random:
condition = "<" + str(np.random.random(1)[0])
mask = "df[" + create_multiple_conditions_mask(df.columns, col+1, condition) + "]"
query = "df.query(" + create_multiple_conditions_query(df.columns, col+1, condition) + ")"
print("Parameters: nb_conditions=" + str(col+1) + ", condition= " + condition)
print("Mask created: " + mask)
print("Query created: " + query)
print()
result_mask = timeit(mask, number=100, globals=locals()) * 10
result_query = timeit(query, number=100, globals=locals()) * 10
time_results["mask"].append(result_mask)
time_results["query"].append(result_query)
return time_results
What I run:
# benchmark on a DataFrame of shape(50,25) populating with random values
# as well as the conditions ("<random_value")
data = benchmarks_mask_vs_query((50,25), random=True)
graph(data)
What I get:

Related

How to display apriori algorithm dataframe in flask

I make recommendations using the apriori algorithm, the results of the apriori algorithm are stored in the dataframe in the result variable. in Result there are 3 columns, namely the rule, support and confidence columns. Like below
Result=pd.DataFrame(columns=['Rule','Support','Confidence'])
for idx, elem in enumerate(association_results):
# print(elem)
thiselem = elem
# print("1 ", thiselem)
nextelem = association_results[(idx + 1) % len(association_results)]
# r1 = [x for x in thiselem[0]]
# r2 = [x for x in nextelem[0]]
# print("rule: ", r1[0], r2[0])
# print("sup: ", elem[1])
# print("conf: ", elem[2][0][2])
Result=Result.append({
'Rule':str([str(x) for x in thiselem[0]])+ " -> " +str([str(x) for x in nextelem[0]]),
'Support':str(round(elem[1] *100, 2))+'%',
'Confidence':str(round(elem[2][0][2] *100, 2))+'%'
},ignore_index=True)
I display the result of the dataframe in flask, this is the route
#app.route('/rekomendasi', methods=['POST'])
def rekomendasi():
sup = request.form.get('support')
conf = request.form.get('confidence')
# Model
store_data = pd.read_csv('dataPekerjaan.csv', sep=',', header=None, error_bad_lines=False)
records = []
# memisahkan data menjadi list
for i in range(store_data.shape[0]):
records.append([str(store_data.values[i, j]).split(',') for j in range(store_data.shape[1])])
# hanya ambil data nama pekerjaan
dataKerja = [[] for dataKerja in range(len(records))]
for i in range(len(records)):
for j in records[i][1]:
dataKerja[i].append(j)
# dataKerja
min_sup = float('0.00' + str(sup))
min_conf = float('0.00' + str(conf))
association_rules = apriori(dataKerja, min_support=min_sup, min_confidence=min_conf, min_length=2)
association_results = list(association_rules)
# menampilkan hasil asosiasi
pd.set_option('max_colwidth', 200)
result = pd.DataFrame(columns=['Rule', 'Support', 'Confidence'])
for item in association_results:
pair = item[2]
for i in pair:
result = result.append({
'Rule': str([x for x in i[0]]) + " -> " + str([x for x in i[1]]),
'Support': str(round(item[1] * 100, 2)) + '%',
'Confidence': str(round(i[2] * 100, 2)) + '%'
}, ignore_index=True)
return render_template('rekomendasi.html', name='made', sup = sup, conf = conf, len= len(result)-1, query = result)
and this is the rekomendasi.html
{%for i in range(len) %}
<tr>
<th>{{i+1}}</th>
<td>{{query['Rule'][i]}}</td>
<td>{{query['Support'][i]}}</td>
<td>{{query['Confidence'][i]}}</td>
</tr>
{%endfor%}
I wanted it to look like this without the [' '] in Rule column
But when the flask is run, this is how it looks like.
Is there anyway to fix this

You can use the join function to join items in a list with delimiter. Here the delimiter is , . So your code will be:
...
for item in association_results:
pair = item[2]
for i in pair:
result = result.append({
'Rule': ", ".join([x for x in i[0]]) + " -> " + ", ".join([x for x in i[1]]),
'Support': str(round(item[1] * 100, 2)) + '%',
'Confidence': str(round(i[2] * 100, 2)) + '%'
}, ignore_index=True)
...
This should replace the quotes and the brackets

How to iterate through two pandas columns and create a new column

I am trying to create a new column by concatenating two columns with certain conditions.
master['work_action'] = np.nan
for a,b in zip(master['repair_location'],master['work_service']):
if a == 'Field':
master['work_action'].append(a + " " + b)
elif a == 'Depot':
master['work_action'].append(a + " " + b)
else:
master['work_action'].append(a)
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
The problem is with master['work_action'].append(a + " " + b)
If I change my code to this:
test = []
for a,b in zip(master['repair_location'],master['work_service']):
if a == 'Field':
test.append(a + " " + b)
elif a == 'Depot':
test.append(a + " " + b)
else:
test.append(a)
I get exactly what I want in a list. But I want it in a pandas column. How do I create a new pandas column with the conditions above?

If performance is important, I would use numpy's select:
master = pd.DataFrame(
{
'repair_location': ['Field', 'Depot', 'Other'],
'work_service':[1, 2, 3]
}
)
master['work_action'] = np.select(
condlist= [
master['repair_location'] == 'Field',
master['repair_location'] == 'Depot'
],
choicelist= [
master['repair_location'] + ' ' + master['work_service'].astype(str),
master['repair_location'] + ' ' + master['work_service'].astype(str)
],
default= master['repair_location']
)
Which results in:
repair_location work_service work_action
0 Field 1 Field 1
1 Depot 2 Depot 2
2 Other 3 Other

Append method is for insert values at the end. You are trying to concatenate two strings values. Use apply method:
def fun(a,b):
if a == 'Field':
return a + " " + b
elif a == 'Depot':
return a + " " + b
else:
return a
master['work_action'] = master.apply(lambda x: fun(x['repair_location'], x['work_service']), axis=1)

Add text to figure using python's plotnine

I would like to add a label to a line in plotnine. I get the following error when using geom_text:
'NoneType' object has no attribute 'copy'
Sample code below:
df = pd.DataFrame({
'date':pd.date_range(start='1/1/1996', periods=4*25, freq='Q'),
'small': pd.Series([0.035]).repeat(4*25) ,
'large': pd.Series([0.09]).repeat(4*25),
})
fig1 = (ggplot()
+ geom_step(df, aes(x='date', y='small'))
+ geom_step(df, aes(x='date', y='large'))
+ scale_x_datetime(labels=date_format('%Y'))
+ scale_y_continuous(labels=lambda l: ["%d%%" % (v * 100) for v in l])
+ labs(x=None, y=None)
+ geom_text(aes(x=pd.Timestamp('2000-01-01'), y = 0.0275, label = 'small'))
)
print(fig1)
Edit:
has2k1's answer below solves the error, but I get:
I want this: (from R)
R code:
ggplot() +
geom_step(data=df, aes(x=date, y=small), color='#117DCF', size=0.75) +
geom_step(data=df, aes(x=date, y=large), color='#FF7605', size=0.75) +
scale_y_continuous(labels = scales::percent, expand = expand_scale(), limits = c(0,0.125)) +
labs(x=NULL, y=NULL) +
geom_text(aes(x = as.Date('1996-01-07'), y = 0.0275, label = 'small'), color = '#117DCF', size=5)
Any documentation beyond https://plotnine.readthedocs.io/en/stable/index.html? I have read the geom_text there and still can't produce what I need...

geom_text has no dataframe. If you want to print the text put it in quotes i.e. '"small"' or put the label mapping outside aes(), but it makes more sense to use annotate.
(ggplot(df)
...
# + geom_text(aes(x=pd.Timestamp('2000-01-01'), y = 0.0275, label = '"small"'))
# + geom_text(aes(x=pd.Timestamp('2000-01-01'), y = 0.0275), label = 'small')
+ annotate('text', x=pd.Timestamp('2000-01-01'), y = 0.0275, label='small')
)

The python operation database error

I use python operation postgresql database, the implementation of sql, it removed the quotation marks, resulting in inquiries failed, how to avoid?
def build_sql(self,table_name,keys,condition):
print(condition)
# condition = {
# "os":["Linux","Windows"],
# "client_type":["ordinary"],
# "client_status":'1',
# "offset":"1",
# "limit":"8"
# }
sql_header = "SELECT %s FROM %s" % (keys,table_name)
sql_condition = []
sql_range = []
sql_sort = []
sql_orederby = []
for key in condition:
if isinstance(condition[key],list):
sql_condition.append(key+" in ("+",".join(condition[key])+")")
elif key == 'limit' or key == 'offset':
sql_range.append(key + " " + condition[key])
else:
sql_condition.append(key + " = " + condition[key])
print(sql_condition)
print(sql_range)
sql_condition = [str(i) for i in sql_condition]
if not sql_condition == []:
sql_condition = " where " + " and ".join(sql_condition) + " "
sql = sql_header + sql_condition + " ".join(sql_range)
return sql
Error：
MySQL Error Code : column "winxp" does not exist
LINE 1: ...T * FROM ksc_client_info where base_client_os in (WinXP) and...

Mind you I do not have much Python experience, but basically you don't have single quotes in that sequence, so you either need to add those before passing it to function or for example during join(), like that:
sql_condition.append(key+" in ("+"'{0}'".format("','".join(condition[key]))+")")
You can see other solutions in those questions:
Join a list of strings in python and wrap each string in quotation marks
Add quotes to every list elements

Dictionary query

Below is a function that extracts information from a database which holds information about events. Everything works except that when I try and iterate through times in rows in HTML it is apparently empty. I will therefore assume that rows.append(time) is not doing what it should be doing. I tried rows.append((time)) and that did not work either.
def extractor(n):
date = (datetime.datetime.now() + datetime.timedelta(days=n)).date()
rows = db.execute("SELECT * FROM events WHERE date LIKE :date ORDER BY date", date = str(date) + '%')
printed_day = date.strftime('%A') + ", " + date.strftime('%B') + " " + str(date.day) + ", " + str(datetime.datetime.now().year)
start_time = time.strftime("%H:%M:%S")
for row in rows:
date_split = str.split(row['date'])
just_time = date_split[1]
if just_time == '00:00:00':
just_time = 'All Day'
else:
just_time = just_time[0:5]
times.append((just_time))
rows.append(times)
results.append((rows, printed_day, start_time, times))

Solved it:
replace
times.append((just_time))
rows.append(times)
with
row['times'] = just_time

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difference between masking and querying pandas.DataFrame - python

Related

How to display apriori algorithm dataframe in flask

How to iterate through two pandas columns and create a new column

Add text to figure using python's plotnine

The python operation database error

Dictionary query

Categories

Resources