Creating dataframe columns within a for loop - python

I'm having a hard time figuring out how to create a data frame within a for loop.
df = pd.DataFrame()
for sym in sorted(snapshot):
for lp in sorted(snapshot[sym]):
df['trader'] = lp
df['bid'] = snapshot[sym][lp][":b"]["LUC"]["price"] if ":b" in snapshot[sym][lp] else "0"
df['ask'] = snapshot[sym][lp][":a"]["LUC"]["price"] if ":a" in snapshot[sym][lp] else "0"
print df
print df['trader']
Printing 'df' results in Columns: [trader, bid, ask] Index: []
Printing 'df['trader'] results in Series([], Name: bid, dtype: object)
If I change the df[column headings] to assignments, everything prints fine.
I'm trying to create a df that look like this:
trader bid ask
0 MM2 1.25 1.26
1 MM5 1.23 1.27
2 MM3 1.25 1.28
....
Thanks for all the help

It's hard to understand from your question what's going on and what data do you have. Hovewer from your code you overwriting your columns in each step of for loop. You could add loc with indices to avoid that:
df = pd.DataFrame()
sym_len = len(snapshot[sym])
for i, sym in enumerate(sorted(snapshot)):
for j, lp in enumerate(sorted(snapshot[sym])):
idx = i*sym_len + j
df.loc[idx, 'trader'] = lp
df.loc[idx, 'bid'] = snapshot[sym][lp][":b"]["LUC"]["price"] if ":b" in snapshot[sym][lp] else "0"
df.loc[idx, 'ask'] = snapshot[sym][lp][":a"]["LUC"]["price"] if ":a" in snapshot[sym][lp] else "0"

Related

Pandas add a new column with a string where the cell match a particular condition

I'm trying to apply Pandas style to my dataset and add a column with a string with the matching result.
This is what I want to achieve:
Link
Below is my code, an expert from stackflow assisted me to apply the df.style so I believe for the df.style is correct based on my test. However, how can I run iterrows() and check the cell for each column and return/store a string to the new column 'check'? Thank you so much. I'm trying to debug but not able to display what I want.
df = pd.DataFrame([[10,3,1], [3,7,2], [2,4,4]], columns=list("ABC"))
df['check'] = None
def highlight(x):
c1 = 'background-color: yellow'
m = pd.concat([(x['A'] > 6), (x['B'] > 2), (x['C'] < 3)], axis=1)
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
return df1.mask(m, c1)
def check(v):
for index, row in v[[A]].iterrows():
if row[A] > 6:
A_check = f'row:{index},' + '{0:.1f}'.format(row[A]) + ">6"
return A_check
for index, row in v[[B]].iterrows():
if row[B] > 2:
B_check = f'row:{index}' + '{0:.1f}'.format(row[B]) + ">2"
return B_check
for index, row in v[[C]].iterrows():
if row[C] < 3:
C_check = f'row:{index}' + '{0:.1f}'.format(row[C]) + "<3"
return C_check
df['check'] = df.apply(lambda v: check(v), axis=1)
df.style.apply(highlight, axis=None)
This is the error message I got:
NameError: name 'A' is not defined
My understanding is that the following produces what you are trying to achieve with the check function:
def check(v):
row_str = 'row:{}, '.format(v.name)
checks = []
if v['A'] > 6:
checks.append(row_str + '{:.1f}'.format(v['A']) + ">6")
if v['B'] > 2:
checks.append(row_str + '{:.1f}'.format(v['B']) + ">2")
if v['C'] < 3:
checks.append(row_str + '{:.1f}'.format(v['C']) + "<3")
return '\n'.join(checks)
df['check'] = df.apply(check, axis=1)
Result (print(df)):
A B C check
0 10 3 1 row:0, 10.0>6\nrow:0, 3.0>2\nrow:0, 1.0<3
1 3 7 2 row:1, 7.0>2\nrow:1, 2.0<3
2 2 4 4 row:2, 4.0>2
(Replace \n with ' ' if you don't want the line breaks in the result.)
The axis=1 option in apply gives the function check one row of df as a Series with the column names of df as index (-> v). With v.name you'll get the corresponding row index. Therefore I don't see the need to use .iter.... Did I miss something?
There are few mistakes in program which we will fix one by one
Import pandas
import pandas as pd
In function check(v): var A, B, C are not defined, replace them with 'A', 'B', 'C'. Then v[['A']] will become a series, and to iterate in series we use iteritems() and not iterrows, and also index will be column name in series. Replacing will give
def check(v):
truth = []
for index, row in v[['A']].iteritems():
if row > 6:
A_check = f'row:{index},' + '{0:.1f}'.format(row) + ">6"
truth.append(A_check)
for index, row in v[['B']].iteritems():
if row > 2:
B_check = f'row:{index}' + '{0:.1f}'.format(row) + ">2"
truth.append(B_check)
for index, row in v[['C']].iteritems():
if row < 3:
C_check = f'row:{index}' + '{0:.1f}'.format(row) + "<3"
truth.append(C_check)
return '\n'.join(truth)
This should give expected output, although you need to also add additional logic so that check column doesnt get yellow color. This answer has minimal changes, but I recommend trying axis=1 to apply style columnwise as it seems more convenient. Also you can refer to style guide

modifying the dataframe column and get unexpected results

I have a dataframe listed like below:
There are actually 120000 rows in this data, and there are 20000 users, this is just one user. For every user I need to make sure the prediction is three "1" and three "0".
I wrote the following function to do that:
def check_prediction_quality(df):
df_n = df.copy()
unique = df_n['userID'].unique()
for i in range(len(unique)):
ex_df = df[df['userID']== unique[i]]
v = ex_df['prediction'].tolist()
v_bool = [i == 0 for i in v]
if sum(v_bool) != 3:
if sum(v_bool) > 3:
res = [i for i,val in enumerate(v_bool) if val]
diff = sum(v_bool) - 3
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(1)
res.remove(idx)
elif sum(v_bool) < 3:
res = [i for i,val in enumerate(v_bool) if not val]
diff = 3 - sum(v_bool)
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(0)
res.remove(idx)
for j in range(len(v)):
df_n.loc[(0+i*6)+j:(6+i*6)+j,'prediction'] = v[j]
return df_n
However, when I run to check if the number of "0" and "1" are the same, turns it's not.. I am not sure what I did wrong.
sum([i == 0 for i in df['prediction']])
should be six using the below example, but when I run on my 120000 dataframe, it does not have 60000 on each
data = {'userID': [199810,199810,199810,199810,199810,199810,199812,199812,199812,199812,199812,199812],
'trackID':[1,2,3,4,5,6,7,8,9,10,11,12],
'prediction':[0,0,0,0,1,1,1,1,1,1,0,0]
}
df = pd.DataFrame(data = data)
df
Much appreciated!
When working with pandas dataframes you should reassign the post-processed Dataframe to the old one.
df = pd.DataFrame(np.array(...))
#reasignation:
df.loc[:,3:5] = df.loc[:,3:5]*10 #This multiplies the columns from 3 to 5 by 10
Actually never mind. I found out I don't have to modify the "0" and "1"..

Is there a way to optimize this code in order to run faster?

Hi there I am working in an application and I am using this piece of code to create new columns in a data frame so I can make some calculations, however it is really slow and I would like to try a new approach.
I have read about Multiprocessing, but I am not sure how and where to use it, so I am asking for your help.
def create_exposed_columns(df):
df['MONTH_INITIAL_DATE'] = df['INITIAL_DATE'].dt.to_period(
'M')
df['MONTH_FINAL_DATE'] = df['FINAL_DATE'].dt.to_period(
'M')
df['Diff'] = df['MONTH_FINAL_DATE'] - df['MONTH_INITIAL_DATE']
list_1 = []
for index, row in df.iterrows():
valor = 1
initial_date = row['INITIAL_DATE']
diff = row['Diff']
temporal_list = {}
list_1.append(temporal_list)
for i in range(meses_iterables + 1):
date = initial_date + relativedelta(months=+1 * i)
if len(str(date.month)) == 1:
value = {str(date.year) + '-0' + str(date.month): valor}
temporal_list.update(value)
else:
value = {str(date.year) + '-' + str(date.month): valor}
temporal_list.update(value)
df_2 = pd.DataFrame(list_1)
df = df.reset_index()
df = pd.concat([df, df_2], axis=1)
return df
I have no idea where to start, so any kind of help will be useful.
Thanks

Build table from for loop values

I have a for loop that does calculations from multiple columns in a dataframe with multiple criteria that prints float values I need to arrange in a table.
demolist = ['P13+', 'P18-34']
impcount = ['<1M', '1-5M']
for imp in impcount:
print(imp)
for d in demolist:
print(d)
target_ua = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_ua_digital'].sum()
target_pop = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_pop'].sum()
target_reach = target_ua / target_pop
print(target_reach)
The output looks like this:
<1M
P13+
0.10
P18-34
0.12
1-5M
P13+
0.92
P18-34
0.53
The code is working correctly, but I need the output to be arranged in a new dataframe with impcount in the columns and demolist in the rows
<1M 1-5M
P13+ 0.10 0.92
P18-34 0.12 0.53
It is just a matter of how to arrange your data. A table is a 2D data structure, which is often represented as a list of list (tuple) in python, e.g. [[1,2], [3, 4]]. For your case, you could collect your data row by row to build the table data, meaning that generate a tuple or list for each element of the row, then for the whole row we get a list of list (the table).
Here is an example showing how to form a table when each value of each cell could be calculated (here is a random value)
In [53]: x = list('abc')
...: y = list('123')
...:
...: data=[]
...: for i in x:
...: row=[]
...: for j in y:
...: row.append(np.random.rand())
...: data.append(row)
...:
...: df = pd.DataFrame(data, index=x, columns=y)
...:
In [54]: df
Out[54]:
1 2 3
a 0.107659 0.840387 0.642285
b 0.184508 0.641443 0.475105
c 0.503608 0.379945 0.933735
Try this:
demolist = ['P13+', 'P18-34']
impcount = ['<1M', '1-5M']
imp_str = '\t'
for imp in impcount:
imp_str += imp + '\t'
print(imp_str.rstrip())
imp_counter = 0
for imp in impcount:
demo_str = demolist[imp_counter]+'\t'
for d in demolist:
target_ua = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_ua_digital'].sum()
target_pop = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_pop'].sum()
target_reach = target_ua / target_pop
demo_str += str(target_reach)+'\t'
print(demo_str.rstrip())
imp_counter += 1
Hope this helps!

Looping through a python pivot table

I have a pivot table that I have created (pivotTable) using:
pivotTable= dayData.pivot_table(index=['sector'], aggfunc='count')
which has produced the following pivot table:
sector id
broad_sector
Communications 2 2
Utilities 3 3
Media 3 3
Could someone just let me know if there is a way to loop through the pivot table assigning the index value and sector total to respective variables sectorName and sectorCount
I have tried:
i=0
while i <= lenPivotTable:
sectorName = sectorPivot.index.get_level_values(0)
sectorNumber = sectorPivot.index.get_level_values(1)
i=i+1
to return for the first loop iteration:
sectorName = 'Communications'
sectorCount = 2
for the second loop iteration:
sectorName = 'Utilities'
sectorCount = 3
for the third loop iteration:
sectorName = 'Media'
sectorCount = 3
But can't get it to work.
This snippet will get you the values as asked.
for sector_name, sector_count, _ in pivotTable.to_records():
print(sector_name, sector_count)
well, i don't understand why do you need this (because looping through DF is very slow), but you can do it this way:
In [403]: for idx, row in pivotTable.iterrows():
.....: sectorName = idx
.....: sectorCount = row['sector']
.....: print(sectorName, sectorCount)
.....:
Communications 2
Utilities 3
Media 3

Categories