How to loop over big dataframe in batches

How to loop over big dataframe in batches - python

I have a pretty big dataframe of about 1.5 million rows and I am trying to execute the code below into batches of 10,000. Then append the results into the "dataset" dataframe. One of the columns, 'subjects' is structured really weird so I had to clean it up but it takes a long time to process. That's why I want to use the k=10,000 batch. Thoughts on the best way to accomplish this?
reuters_set = reuters_set.loc[reuters_set['subjects'].str.contains('P:')]
reuters_set.shape[0]
1590478
reuters_set.subjects.iloc[33] #Example of data in column that needs to be processed
['B:1092', 'B:12', 'B:19', 'B:20', 'B:22', 'B:227', 'B:228', 'B:229', 'B:24', 'G:1', 'G:6', 'G:B1', 'G:K', 'G:S', 'M:1QD', 'M:AV', 'M:B6', 'M:Z', 'R:600058.SS', 'N2:ASIA', 'N2:ASXPAC', 'N2:BMAT', 'N2:BMAT08', 'N2:CMPNY', 'N2:CN', 'N2:EASIA', 'N2:EMRG', 'N2:EQTY', 'N2:IRNST', 'N2:LEN', 'N2:MEMI', 'N2:METWHL', 'N2:MIN', 'N2:MINE', 'N2:MINE08', 'N2:MTAL', 'N2:MTAL08', 'N2:STEE', 'P:4295865030']
dataset = []
k = 10000
ct=0
# Testing the first 10,000. It takes really long after this value...
bk = reuters_set.iloc[0:k]
bk.reset_index(inplace = True)
bk['id'] = np.arange(bk.shape[0])
bk['N2'] = ''
bk['P'] = ''
bk['R'] = ''
for index, row in bk.iterrows():
a = [i.split(':') for i in ast.literal_eval(row['subjects'])]
b = pd.DataFrame(a)
b = b.groupby(0, as_index = False).agg({1:'unique'})
dict_code = dict(zip(b[0], b[1]))
if 'N2' in dict_code.keys():
bk.loc[bk['id']== index, 'N2'] = str(dict_code['N2'].tolist())
if 'R' in dict_code.keys():
bk.loc[bk['id']== index, 'R' ] = str(dict_code['R'].tolist())
if 'P' in dict_code.keys():
bk.loc[bk['id']== index, 'P' ] = str(dict_code['P'].tolist())

Related

Python while loop not updating the DataFrame column calculations

I am writing a python code where I have a condition which till the time it is true I want the calculations to happen and update the dataframe columns. However I am noticing that the dataframe is not getting updated and all the values are of the 1st iteration only. Can an expert guide on where I am going wrong. Below is my sample code -
'''
mbd_out_ub2 = mbd_out_ub1
mbd_out_ub2_len = len(mbd_out_ub2)
plt_mbd_c1_all = pd.DataFrame()
brd2c2_all = pd.DataFrame()
iterc=1
### plt_mbd_c >> this is the data frame with data before the loop starts
plt_mbd_c0 = plt_mbd_c.copy()
plt_mbd_c0 = plt_mbd_c0[plt_mbd_c0['UB_OUT']==1]
while (iterc < 10):
plt_mbd_c1 = plt_mbd_c0.copy()
brd2c2 = plt_mbd_c1.groupby('KEY1')['NEST_VAL_PER'].agg([('KEY1_CNT','count'),('PER1c', lambda x: x.quantile(0.75))]).reset_index()
brd2c2_all = brd2c2_all.append(brd2c2).reset_index(drop=True)
plt_mbd_c1 = pd.merge(plt_mbd_c1,brd2c2[['KEY1','PER1c']],on='KEY1', how='left')
del brd2c2, plt_mbd_c0
plt_mbd_c1['NEST_VAL_PER1'] = plt_mbd_c1['PER1c'] * (plt_mbd_c1['EVAL_LP_%'] / 100)
plt_mbd_c1['NEST_VAL_PER1'] = np.where((plt_mbd_c1['BRD_OUT_FLAG'] == 0),plt_mbd_c1['NEST_VAL'],plt_mbd_c1['NEST_VAL_PER1'] )
plt_mbd_c1['SALESC'] = plt_mbd_c1['NEST_VAL_PER1']/plt_mbd_c1['PROJR']/plt_mbd_c1['NEWPRICE']
plt_mbd_c1['C_SALES_C'] = np.where(plt_mbd_c1['OUT_FLAG'] == 1,plt_mbd_c1['SALESC'],plt_mbd_c1['SALESUNIT'])
plt_mbd_c1['NEST_VAL_PER'] = plt_mbd_c1['C_SALES_C'] * plt_mbd_c1['PROJR'] * plt_mbd_c1['NEWPRICE']
plt_mbd_c1['ITER'] = iterc
plt_mbd_c1_all = plt_mbd_c1_all.append(plt_mbd_c1).reset_index(drop=True)
plt_mbd_c1.drop(['PER1c'],axis=1,inplace=True)
plt_mbd_c0 = plt_mbd_c1.copy()
del plt_mbd_c1
print("iter = ",iterc)
iterc = iterc + 1
'''
So above I want to take 75th percentile of a column by KEY1 and do few calculations. The idea is after every iteration my 75th percentile will keep reducing as I am updating the same column with calculated value which would be lower then the current value (since it is based on 75th percentile). However when I check I find for all the iterations the values are same as the 1st iteration only. I have tried to delete the data frames, save to temp data frame, copy dataframe but non seem to be working.
Please help !!

Iterate over loop with append data continuous

I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
print(df_new)
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
etc....
Thanks for your help, I appreciate that.

You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:
df['alt'].add((df.index+1)*400)

Strange difference in performance of Pandas, dataframe on small & large scale

I have a dataframe read from a CSV file. I need to generate new data and add them to the end of old ones.
But it's strange that it shows a totally different result when compare small scale and large scale. I guess it may relate to view, copy() & Chained assignment.
I tried 2 options to use pd.copy() to avoid potential problems.
First option:
d_jlist = pd.read_csv('127case.csv', sep=',') #got the data shape: (46355,48) from CSV file
d_jlist2 = d_jlist.copy() #Use deep copy, in case of change the raw data
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365) #Select from row 5 to row 46350
for i in a:
d_jlist3 = d_jlist3.append(d_jlist.iloc[i].copy() +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1127case_1.csv',header = a,index=False)
Second option:
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365)
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
The problem is, if I use these code on a small scale, it works as expected. New rows add to the old ones, and nothing in old data changed.
However, if I come to the scale above (1000*365), the old rows will get changed.
And the strange thing is: only the first two columns of each row will stay unchanged. The rest of the columns of each row will all get changed.
The results:
The left one is old dataframe, it has (46356,48) shape. Below are the new data generated.
The right one is result got from option 1 (both options got same result). From the third columns, the old data got changed.
If I try either of the options in smaller scale (3 rows), it will be fine. All the old data will be kept.
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist = d_jlist.iloc[:10] #Only select 10 rows from old ones
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,6),size = 3) #Only select 3 rows randomly from old data
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
How can I understand this? I spent lots of time try to find explanation for this but failed.
Are some rules changed in Pandas when the scale is larger (To 365K level)?

Output unique values from a pandas dataframe without reordering the output

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))

Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

Construct one row for each Index value from multiple by appending

I have got some data(42 features) collected from people during some months(maximum - 6; varies for different entries), every month's value is represented in its own row:
There are 9267 unique ID values(set as index) and as many as 50 000 rows in the df.
I want to convert it to 42 * 6 feature vectors for each ID(even though some will have a lot of NaNs there), so that i can train on them, here is how it should look like:
Here is my solution:
def flatten_features(f_matrix, ID):
'''constructs a 1x(6*n) vector from 6xn matrix'''
#check wether it is a series, not dataframe
if(len(f_matrix.shape) == 1):
f_matrix['ID'] = ID
return f_matrix
flattened_vector = f_matrix.iloc[0]
for i in range(1, f_matrix.shape[0]):
vector_append = f_matrix.iloc[i]
vector_append.index = (lambda month, series_names : series_names.map(lambda name : name + '_' + str(month)))\
(i, vector_append.index)
flattened_vector = flattened_vector.append(vector_append)
flattened_vector['ID'] = ID
return flattened_vector
#construct dataframe of flattened vectors for numerical features
new_indices = flatten_features(numerical_f.iloc[:6], 1).index
new_indices
flattened_num_f = pd.DataFrame(columns=new_indices)
flattened_num_f
for label in numerical_f.index.unique():
matr = numerical_f.loc[label]
flattened_num_f = flattened_num_f.append(flatten_features(matr, label))
It yields needed results, however it runs very slow. I wonder, is there a more elegant and fast solution?

if you want to transpose df, you could cam T function.
I assume you have id stored in unique_id variable
new_f = numerical_f.T
new_f.columns = unique_id

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to loop over big dataframe in batches - python

Related

Python while loop not updating the DataFrame column calculations

Iterate over loop with append data continuous

Strange difference in performance of Pandas, dataframe on small & large scale

Output unique values from a pandas dataframe without reordering the output

Construct one row for each Index value from multiple by appending

Categories

Resources