Strange difference in performance of Pandas, dataframe on small & large scale

Strange difference in performance of Pandas, dataframe on small & large scale - python

I have a dataframe read from a CSV file. I need to generate new data and add them to the end of old ones.
But it's strange that it shows a totally different result when compare small scale and large scale. I guess it may relate to view, copy() & Chained assignment.
I tried 2 options to use pd.copy() to avoid potential problems.
First option:
d_jlist = pd.read_csv('127case.csv', sep=',') #got the data shape: (46355,48) from CSV file
d_jlist2 = d_jlist.copy() #Use deep copy, in case of change the raw data
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365) #Select from row 5 to row 46350
for i in a:
d_jlist3 = d_jlist3.append(d_jlist.iloc[i].copy() +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1127case_1.csv',header = a,index=False)
Second option:
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365)
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
The problem is, if I use these code on a small scale, it works as expected. New rows add to the old ones, and nothing in old data changed.
However, if I come to the scale above (1000*365), the old rows will get changed.
And the strange thing is: only the first two columns of each row will stay unchanged. The rest of the columns of each row will all get changed.
The results:
The left one is old dataframe, it has (46356,48) shape. Below are the new data generated.
The right one is result got from option 1 (both options got same result). From the third columns, the old data got changed.
If I try either of the options in smaller scale (3 rows), it will be fine. All the old data will be kept.
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist = d_jlist.iloc[:10] #Only select 10 rows from old ones
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,6),size = 3) #Only select 3 rows randomly from old data
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
How can I understand this? I spent lots of time try to find explanation for this but failed.
Are some rules changed in Pandas when the scale is larger (To 365K level)?

Related

Python mapclassify.Pooled with differently sized dataframes and create synthetic array for Y

I want to use mapclassify.Pooled with 2 different pandas dataframes that have non-identical lengths. My solution works fine for me as below. But I am curious whether you recommend other ways:
def create_synthetic_data_for_pooled_map_classify(number_simulations, df_smaller_length,
df_larger_length, col_to_analyse):
all_synthetic_smaller_df = []
length_other_df = len(df_larger_length)
for i in range(number_simulations):
syn_data = np.random.choice(df_smaller_length[col_to_analyse], length_other_df)
all_synthetic_smaller_df.append(syn_data)
all_synthetic_smaller_df = pd.DataFrame(all_synthetic_smaller_df).T
all_synthetic_smaller_df = pd.DataFrame(np.sort(all_synthetic_smaller_df.values, axis = 0),
index = all_synthetic_smaller_df.index, columns = all_synthetic_smaller_df.columns)
all_synthetic_smaller_df["avg"] = all_synthetic_smaller_df.mean(axis=1)
col_synth = all_synthetic_smaller_df["avg"].tolist()
col_real = df_larger_length[col_to_analyse].tolist()
zipped = list(zip(col_synth, col_real))
df = pd.DataFrame(zipped, columns=['col_synth', 'col_real'])
return df
Then use returned df as the input Y in mapclassify.Pooled.
Setting number_simulations large enough, the synthetic data resembles closely the actual distribution of df_smaller_length[col_to_analyse].
I am just curious whether this is how it can be done or if there are better ways.

Looping through Pandas data frame for a calculation based off the row values in two different list

I have been able to get the calculation to work but now I am having trouble appending the results back into the data frame e3. You can see from the picture that the values are printing out.
brand_list = list(e3["Brand Name"])
product_segment_list = list(e3['Product Segment'])
# Create a list of tuples: data
data = list(zip(brand_list, product_segment_list))
for i in data:
step1 = e3.loc[(e3['Brand Name']==i[0]) & (e3['Product Segment']==i[1])]
Delta_Price = (step1['Price'].diff(1).div(step1['Price'].shift(1),axis=0).mul(100.0))
print(Delta_Price)

it's easier to use groupby. In each loop 'r' will be just the grouped rows from e3 dataframe from each category and i an index.
new_df = []
for i,r in e3.groupby(['Brand Name','Product Segment']):
price_num = r["Price"].diff(1).values
price_den = r["Price"].shift(1).values
r['Price Delta'] = price_num/price_den
new_df.append(r)
e3_ = pd.concat(new_df, axis = 1)

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)

So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.

import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

How to loop over big dataframe in batches

I have a pretty big dataframe of about 1.5 million rows and I am trying to execute the code below into batches of 10,000. Then append the results into the "dataset" dataframe. One of the columns, 'subjects' is structured really weird so I had to clean it up but it takes a long time to process. That's why I want to use the k=10,000 batch. Thoughts on the best way to accomplish this?
reuters_set = reuters_set.loc[reuters_set['subjects'].str.contains('P:')]
reuters_set.shape[0]
1590478
reuters_set.subjects.iloc[33] #Example of data in column that needs to be processed
['B:1092', 'B:12', 'B:19', 'B:20', 'B:22', 'B:227', 'B:228', 'B:229', 'B:24', 'G:1', 'G:6', 'G:B1', 'G:K', 'G:S', 'M:1QD', 'M:AV', 'M:B6', 'M:Z', 'R:600058.SS', 'N2:ASIA', 'N2:ASXPAC', 'N2:BMAT', 'N2:BMAT08', 'N2:CMPNY', 'N2:CN', 'N2:EASIA', 'N2:EMRG', 'N2:EQTY', 'N2:IRNST', 'N2:LEN', 'N2:MEMI', 'N2:METWHL', 'N2:MIN', 'N2:MINE', 'N2:MINE08', 'N2:MTAL', 'N2:MTAL08', 'N2:STEE', 'P:4295865030']
dataset = []
k = 10000
ct=0
# Testing the first 10,000. It takes really long after this value...
bk = reuters_set.iloc[0:k]
bk.reset_index(inplace = True)
bk['id'] = np.arange(bk.shape[0])
bk['N2'] = ''
bk['P'] = ''
bk['R'] = ''
for index, row in bk.iterrows():
a = [i.split(':') for i in ast.literal_eval(row['subjects'])]
b = pd.DataFrame(a)
b = b.groupby(0, as_index = False).agg({1:'unique'})
dict_code = dict(zip(b[0], b[1]))
if 'N2' in dict_code.keys():
bk.loc[bk['id']== index, 'N2'] = str(dict_code['N2'].tolist())
if 'R' in dict_code.keys():
bk.loc[bk['id']== index, 'R' ] = str(dict_code['R'].tolist())
if 'P' in dict_code.keys():
bk.loc[bk['id']== index, 'P' ] = str(dict_code['P'].tolist())

Python looping and Pandas rank/index quirk

This question pertains to one posted here:
Sort dataframe rows independently by values in another dataframe
In the linked question, I utilize a Pandas Dataframe to sort each row independently using values in another Pandas Dataframe. The function presented there works perfectly every single time it is directly called. For example:
import pandas as pd
import numpy as np
import os
##Generate example dataset
d1 = {}
d2 = {}
d3 = {}
d4 = {}
## generate data:
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
d3[col+'3'] = np.random.random_integers(0,100, 12)
d4[col+'4'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
#place data into dataframes
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
dat3 = pd.DataFrame(d3, index = t_index)
dat4 = pd.DataFrame(d4, index = t_index)
## Functions
def sortByAnthr(X,Y,Xindex, Reverse=False):
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex,Y), key=lambda pair: pair[1],reverse=Reverse)]
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
return(ordrd_row)
def r_selectr(dat2,dat1, n, Reverse=False):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index,Reverse),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
## Call functions
ordr_cols2,ordr_r = r_selectr(dat2,dat1,5)
##print output:
print("Ordering set:\n",dat2.iloc[-2:,:])
print("Original set:\n", dat1.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
As can be checked, the columns of dat1 are correctly ordered according to the values in dat2.
However, when called from a loop over dataframes, it does not rank/index correctly and produces completely dubious results. Although I am not quite able to recreate the problem using the reduced version presented here, the idea should be the same.
## Loop test:
out_dict = {}
data_dicts = {'dat2':dat2, 'dat3': dat3, 'dat4':dat4}
for i in range(3):
#this outer for loop supplies different parameter values to a wrapper
#function that calls r_selectr.
for key in data_dicts.keys():
ordr_cols,_ = r_selectr(data_dicts[key], dat1,5)
out_list.append(ordr_cols)
#do stuff here
#print output:
print("Ordering set:\n",dat3.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
In my code (almost completely analogous to the example given here), the ordr_cols are no longer ordered correctly for any of the sorting data frames.
I currently solve the issue by separating the ordering and indexing operations with r_selectr into two separate functions. That, for some reason, resolves the issue though I have no idea why.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange difference in performance of Pandas, dataframe on small & large scale - python

Related

Python mapclassify.Pooled with differently sized dataframes and create synthetic array for Y

Looping through Pandas data frame for a calculation based off the row values in two different list

Randomization of a list with conditions using Pandas

How to loop over big dataframe in batches

Python looping and Pandas rank/index quirk

Categories

Resources