How to iterate through two pandas columns and create a new column

How to iterate through two pandas columns and create a new column - python

I am trying to create a new column by concatenating two columns with certain conditions.
master['work_action'] = np.nan
for a,b in zip(master['repair_location'],master['work_service']):
if a == 'Field':
master['work_action'].append(a + " " + b)
elif a == 'Depot':
master['work_action'].append(a + " " + b)
else:
master['work_action'].append(a)
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
The problem is with master['work_action'].append(a + " " + b)
If I change my code to this:
test = []
for a,b in zip(master['repair_location'],master['work_service']):
if a == 'Field':
test.append(a + " " + b)
elif a == 'Depot':
test.append(a + " " + b)
else:
test.append(a)
I get exactly what I want in a list. But I want it in a pandas column. How do I create a new pandas column with the conditions above?

If performance is important, I would use numpy's select:
master = pd.DataFrame(
{
'repair_location': ['Field', 'Depot', 'Other'],
'work_service':[1, 2, 3]
}
)
master['work_action'] = np.select(
condlist= [
master['repair_location'] == 'Field',
master['repair_location'] == 'Depot'
],
choicelist= [
master['repair_location'] + ' ' + master['work_service'].astype(str),
master['repair_location'] + ' ' + master['work_service'].astype(str)
],
default= master['repair_location']
)
Which results in:
repair_location work_service work_action
0 Field 1 Field 1
1 Depot 2 Depot 2
2 Other 3 Other

Append method is for insert values at the end. You are trying to concatenate two strings values. Use apply method:
def fun(a,b):
if a == 'Field':
return a + " " + b
elif a == 'Depot':
return a + " " + b
else:
return a
master['work_action'] = master.apply(lambda x: fun(x['repair_location'], x['work_service']), axis=1)

Related

How to display apriori algorithm dataframe in flask

I make recommendations using the apriori algorithm, the results of the apriori algorithm are stored in the dataframe in the result variable. in Result there are 3 columns, namely the rule, support and confidence columns. Like below
Result=pd.DataFrame(columns=['Rule','Support','Confidence'])
for idx, elem in enumerate(association_results):
# print(elem)
thiselem = elem
# print("1 ", thiselem)
nextelem = association_results[(idx + 1) % len(association_results)]
# r1 = [x for x in thiselem[0]]
# r2 = [x for x in nextelem[0]]
# print("rule: ", r1[0], r2[0])
# print("sup: ", elem[1])
# print("conf: ", elem[2][0][2])
Result=Result.append({
'Rule':str([str(x) for x in thiselem[0]])+ " -> " +str([str(x) for x in nextelem[0]]),
'Support':str(round(elem[1] *100, 2))+'%',
'Confidence':str(round(elem[2][0][2] *100, 2))+'%'
},ignore_index=True)
I display the result of the dataframe in flask, this is the route
#app.route('/rekomendasi', methods=['POST'])
def rekomendasi():
sup = request.form.get('support')
conf = request.form.get('confidence')
# Model
store_data = pd.read_csv('dataPekerjaan.csv', sep=',', header=None, error_bad_lines=False)
records = []
# memisahkan data menjadi list
for i in range(store_data.shape[0]):
records.append([str(store_data.values[i, j]).split(',') for j in range(store_data.shape[1])])
# hanya ambil data nama pekerjaan
dataKerja = [[] for dataKerja in range(len(records))]
for i in range(len(records)):
for j in records[i][1]:
dataKerja[i].append(j)
# dataKerja
min_sup = float('0.00' + str(sup))
min_conf = float('0.00' + str(conf))
association_rules = apriori(dataKerja, min_support=min_sup, min_confidence=min_conf, min_length=2)
association_results = list(association_rules)
# menampilkan hasil asosiasi
pd.set_option('max_colwidth', 200)
result = pd.DataFrame(columns=['Rule', 'Support', 'Confidence'])
for item in association_results:
pair = item[2]
for i in pair:
result = result.append({
'Rule': str([x for x in i[0]]) + " -> " + str([x for x in i[1]]),
'Support': str(round(item[1] * 100, 2)) + '%',
'Confidence': str(round(i[2] * 100, 2)) + '%'
}, ignore_index=True)
return render_template('rekomendasi.html', name='made', sup = sup, conf = conf, len= len(result)-1, query = result)
and this is the rekomendasi.html
{%for i in range(len) %}
<tr>
<th>{{i+1}}</th>
<td>{{query['Rule'][i]}}</td>
<td>{{query['Support'][i]}}</td>
<td>{{query['Confidence'][i]}}</td>
</tr>
{%endfor%}
I wanted it to look like this without the [' '] in Rule column
But when the flask is run, this is how it looks like.
Is there anyway to fix this

You can use the join function to join items in a list with delimiter. Here the delimiter is , . So your code will be:
...
for item in association_results:
pair = item[2]
for i in pair:
result = result.append({
'Rule': ", ".join([x for x in i[0]]) + " -> " + ", ".join([x for x in i[1]]),
'Support': str(round(item[1] * 100, 2)) + '%',
'Confidence': str(round(i[2] * 100, 2)) + '%'
}, ignore_index=True)
...
This should replace the quotes and the brackets

Why does my python loop return Key Error : 0 when I change the input dataframe?

I'm trying to do iterative calculation that will store the result of each iteration by append into a dataframe
however when I try to change the input dataframe into something else, I got the key error : 0
here are my complete code
d = []
df_it = df_ofr
i = 0
last_col = len(df_it.iloc[:,3:].columns) - 1
print("User Group : " + df_it[['user_type'][0]][0] + " " + df_it[['user_status'][0]][0])
for column in df_it.iloc[:,3:]:
if i > 0 :
if i < last_col: # 1 step conversion
convert_baseline = df_it[[column][0]][0]
convert_variant_a = df_it[[column][0]][1]
elif i == last_col: # end to end conversion
convert_baseline = df_it[[column][0]][0]
convert_variant_a = df_it[[column][0]][1]
lead_baseline = step_1_baseline
lead_variant_a = step_1_variant_a
#perform proportion z test
test_stat, p_value = proportions_ztest([convert_baseline,convert_variant_a], [lead_baseline,lead_variant_a], alternative='smaller')
#perform bayesian ab test
#initialize a test
test = BinaryDataTest()
#add variant using aggregated data
test.add_variant_data_agg("Baseline", totals=lead_baseline, positives=convert_baseline)
test.add_variant_data_agg("Variant A", totals=lead_variant_a, positives=convert_variant_a)
bay_result = test.evaluate(seed=99)
#append result
d.append(
{
'Convert into': column,
'# Users Baseline': lead_baseline,
'# Users Variant A': lead_variant_a,
'% CVR Baseline' : convert_baseline / lead_baseline,
'% CVR Variant A' : convert_variant_a / lead_variant_a,
'Z Test Stat' : test_stat,
'P-Value' : p_value,
'Prob Baseline being the Best' : bay_result[0]['prob_being_best'],
'Prob Variant A being the Best' : bay_result[1]['prob_being_best']
}
)
elif i == 0:
step_1_baseline = df_it[[column][0]][0]
step_1_variant_a = df_it[[column][0]][1]
i = i+1
lead_baseline = df_it[[column][0]][0]
lead_variant_a = df_it[[column][0]][1]
pd.DataFrame(d)
the one that I'm trying to change is this part
df_it = df_ofr
thanks for your help, really appreciate it
I'm trying to do iterative calculation that will store the result of each iteration by append into a dataframe

Using DataFrame.query() with pandas.Series.str.contains gets AttributeError: 'dict' object has no attribute 'append'

I created two query strings using str.contains(), and combine them, then pass it to DataFrame.query().
I get an AttributeError: 'dict' object has no attribute 'append'.
Removing regex=False parameter worked fine but my data contains some '/*'.
So I need this parameter to treat my data as a literal string.
import pandas as pd
df = pd.DataFrame({'city': ['osaka', ], 'food': ['apple', ],})
querystr1 = 'osaka'
querystr2 = 'apple'
querystr1 = "city.str.contains('" + querystr1 + "', regex=False)"
querystr2 = "food.str.contains('" + querystr2 + "', regex=False)"
querystr = querystr1 + ' & ' + querystr2
print(querystr)
value = df.query(querystr, engine='python').index.values.astype(int)
print(value)
print(value.size)
How can I query my dataframe not recognizing regular expressions?
Is there a smarter way to do this?

I'm no expert, but removing the regex=False part worked for me:
import pandas as pd
df = pd.DataFrame({'city': ['osaka', ], 'food': ['apple', ],})
querystr1 = 'osaka'
querystr2 = 'apple'
querystr1 = "city.str.contains('" + querystr1 + "')"
querystr2 = "food.str.contains('" + querystr2 + "')"
querystr = querystr1 + ' & ' + querystr2
print(querystr)
value = df.query(querystr, engine='python').index.values.astype(int)

Difference between masking and querying pandas.DataFrame

My example shows when using DataFrame of float that querying might in certains cases be faster than using masks. When you look at the graph, the q̶u̶e̶r̶y̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶b̶e̶t̶t̶e̶r̶ ̶w̶h̶e̶n̶ ̶t̶h̶e̶ ̶c̶o̶n̶d̶i̶t̶i̶o̶n̶ ̶i̶s̶ ̶c̶o̶m̶p̶o̶s̶e̶d̶ ̶o̶f̶ ̶1̶ ̶t̶o̶ ̶5̶ ̶s̶u̶b̶c̶o̶n̶d̶i̶t̶i̶o̶n̶s̶.
Edit (thanks to a_guest): mask function performs better when the condition is composed of 1 to 5 subconditions
Then, Is there any difference between the two methods since it tends to have the same trend over the number of subconditions.
The function used to plot my data:
import matplotlib.pyplot as plt
def graph(data):
t = [int(i) for i in range(1, len(data["mask"]) + 1)]
plt.xlabel('Number of conditions')
plt.ylabel('timeit (ms)')
plt.title('Benchmark mask vs query')
plt.grid(True)
plt.plot(t, data["mask"], 'r', label="mask")
plt.plot(t, data["query"], 'b', label="query")
plt.xlim(1, len(data["mask"]))
plt.legend()
plt.show()
The functions used to creates the conditions to be tested by timeit:
def create_multiple_conditions_mask(columns, nb_conditions, condition):
mask_list = []
for i in range(nb_conditions):
mask_list.append("(df['" + columns[i] + "']" + " " + condition + ")")
return " & ".join(mask_list)
def create_multiple_conditions_query(columns, nb_conditions, condition):
mask_list = []
for i in range(nb_conditions):
mask_list.append(columns[i] + " " + condition)
return "'" + " and ".join(mask_list) + "'"
The function to benchmark masking vs querying using a pandas DataFrame containing float:
def benchmarks_mask_vs_query(dim_df=(50,10), labels=[], condition="> 0", random=False):
# init local variable
time_results = {"mask": [], "query": []}
nb_samples, nb_columns = dim_df
all_labels = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
if nb_columns > 26:
if len(labels) == nb_columns:
all_labels = labels
else:
raise Exception("labels length must match nb_columns" )
df = pd.DataFrame(np.random.randn(nb_samples, nb_columns), columns=all_labels[:nb_columns])
for col in range(nb_columns):
if random:
condition = "<" + str(np.random.random(1)[0])
mask = "df[" + create_multiple_conditions_mask(df.columns, col+1, condition) + "]"
query = "df.query(" + create_multiple_conditions_query(df.columns, col+1, condition) + ")"
print("Parameters: nb_conditions=" + str(col+1) + ", condition= " + condition)
print("Mask created: " + mask)
print("Query created: " + query)
print()
result_mask = timeit(mask, number=100, globals=locals()) * 10
result_query = timeit(query, number=100, globals=locals()) * 10
time_results["mask"].append(result_mask)
time_results["query"].append(result_query)
return time_results
What I run:
# benchmark on a DataFrame of shape(50,25) populating with random values
# as well as the conditions ("<random_value")
data = benchmarks_mask_vs_query((50,25), random=True)
graph(data)
What I get:

The python operation database error

I use python operation postgresql database, the implementation of sql, it removed the quotation marks, resulting in inquiries failed, how to avoid?
def build_sql(self,table_name,keys,condition):
print(condition)
# condition = {
# "os":["Linux","Windows"],
# "client_type":["ordinary"],
# "client_status":'1',
# "offset":"1",
# "limit":"8"
# }
sql_header = "SELECT %s FROM %s" % (keys,table_name)
sql_condition = []
sql_range = []
sql_sort = []
sql_orederby = []
for key in condition:
if isinstance(condition[key],list):
sql_condition.append(key+" in ("+",".join(condition[key])+")")
elif key == 'limit' or key == 'offset':
sql_range.append(key + " " + condition[key])
else:
sql_condition.append(key + " = " + condition[key])
print(sql_condition)
print(sql_range)
sql_condition = [str(i) for i in sql_condition]
if not sql_condition == []:
sql_condition = " where " + " and ".join(sql_condition) + " "
sql = sql_header + sql_condition + " ".join(sql_range)
return sql
Error：
MySQL Error Code : column "winxp" does not exist
LINE 1: ...T * FROM ksc_client_info where base_client_os in (WinXP) and...

Mind you I do not have much Python experience, but basically you don't have single quotes in that sequence, so you either need to add those before passing it to function or for example during join(), like that:
sql_condition.append(key+" in ("+"'{0}'".format("','".join(condition[key]))+")")
You can see other solutions in those questions:
Join a list of strings in python and wrap each string in quotation marks
Add quotes to every list elements

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to iterate through two pandas columns and create a new column - python

Related

How to display apriori algorithm dataframe in flask

Why does my python loop return Key Error : 0 when I change the input dataframe?

Using DataFrame.query() with pandas.Series.str.contains gets AttributeError: 'dict' object has no attribute 'append'

Difference between masking and querying pandas.DataFrame

The python operation database error

Categories

Resources