subset a dataframe after a for loop and manipulate - python

I'm trying to write a for loop where I can subset a dataframe for each unique ID and create a new column. In my example, I want to create a new balance based on the ID, balance and initial amount. My idea was to loop through each group of ID's, take that subset and follow it up with some if/if else statements. In the iteration I want the loop to look at all of the unique ID's, for example when I loop though df["ID"] == 2, there should be 7 rows, since their balances are all related to each other. This is what my dataframe would look like:
df = pd.DataFrame(
{"ID" : [2,2,2,2,2,2,2,3,4,4,4],
"Initial amount":
[3250,10800,6750,12060,8040,4810,12200,13000,10700,12000,27000],
"Balance": [0,0,0,0,0,0,0,2617,19250,19250,19250], "expected output":
[0,0,0,0,0,0,0,2617,10720,8530,0]})
My current code looks like this, but I feel like i'm headed towards the wrong direction. Thanks!
unique_ids = list(df["ID"].unique())
new_output = []
for i in range(len(unique_ids)):
this_id = unique_ids[i]
subset = df.loc[df["ID"] == this_id,:]
for j in range(len(subset)):
this_bal = subset["Balance"]
this_amt = subset["Initial amount"]
if j == 0:
this_output = np.where(this_bal >= this_amt, this_amt, this_bal)
new_output.append(this_output)
elif this_bal - sum(this_output) >= this_amt:
this_output = this_amt
new_output.append(this_output)
else:
this_output = this_bal - sum(this_output)
new_output.append(this_output)
Any suggestions would be greatly appreciated!

Related

Improve performance of 8million iterations over a dataframe and query it

There is a for loop of 8 million iterations, which takes 2 sample values from a column of a 1 million records dataframe (say df_original_nodes) and then query that 2 samples in another dataframe say (df_original_rel) and if sample does not exist then add that samples as a new row into the queried dataframe (df_original_rel) and finally write the dataframe (df_original_rel) into a CSV.
This loop is taking roughly around 24+ hrs to complete. How this can be made performant? Happy if it even takes 8 hrs to complete than anything 12+ hrs.
Here is the piece of code:
for j in range(1, n_8000000):
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
df_ran_rel = df_original_nodes["UID"].sample(2, ignore_index=True)
FROM = df_ran_rel[0]
TO = df_ran_rel[1]
if df_original_rel.query("#FROM == FROM and #TO == TO").empty:
k += 1
new_row = {"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]}
df_original_rel = df_original_rel.append(new_row, ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)
My assumption is that querying a dataframe df_original_rel is the heavy-lifting part where the dataframe df_original_rel is also keep growing as the new row is added.
In my view lists are faster to traverse and maybe to query but then there will be another layer of conversion from dataframe to lists and vice-versa which could add further complexity.
Some things that should probably help – most of them around "do less Pandas".
Since I don't have your original data or anything like it, I can't test this.
# Grab a regular list of UIDs that we can use with `random.sample`
original_nodes_uid_list = df_original_nodes["UID"].tolist()
# Make a regular set of FROM-TO tuples
rel_from_to_pairs = set(df_original_rel[["FROM", "TO"]].apply(tuple, axis=1).tolist())
# Store new rows here instead of putting them in the dataframe; we'll also update rel_from_to_pairs as we go.
new_rows = []
for j in range(1, 8_000_000):
# These two lines could probably also be a `random.choice`
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
# Grab a from-to pair from the UID list
FROM, TO = random.sample(original_nodes_uid_list, 2)
# If this pair isn't in the set of known pairs...
if (FROM, TO) not in rel_from_to_pairs:
# ... prepare a new row to be added later
new_rows.append({"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]})
# ... and since this from-to pair _would_ exist had df_original_rel
# been updated, update the pairs set.
rel_from_to_pairs.add((FROM, TO))
# Finally, make a dataframe of the new rows, concatenate it with the old, and output.
df_new_rel = pd.DataFrame(new_rows)
df_original_rel = pd.concat([df_original_rel, df_new_rel], ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)

How to specified a group of object in a dataframe column using Python

In the example below, how do I specified 'mansion' under 'h_type', and find its highest price?
(prevent from finding a highest price from the whole data which might include 'aparment')
ie:
df=pd.DataFrame({'h_type':[aparment,mansion,....],'h_price':[..., ...,...]})
if df.loc[df['h_type']=='mansion']: ##<= do not work,
aidMax = priceSr.idxmax()
if not isnan(aidMax):
amaxSr = df.loc[aidMax]
if amost is None:
amost = amaxSr.copy()
else:
if float(amaxSr['h_price']) > float(amost['h_price']):
amost = amaxSr.copy()
amost = amost.to_frame().transpose()
print(amost, '\n==========')
TL;DR:
That can be a oneliner:
max_price = df[df["h_price"] == "mansion"]]["h_price"].max()
Explanation
A little bit of explaining here:
df[df["h_price"] == "mansion"]]
That pieces selects all the rows who's column "h_price" value is the maximum.
df[df["h_price"] == "mansion"]]["h_price"]
Then, we access the column "h_price" of all the rows.
Finally
df[df["h_price"] == "mansion"]]["h_price"].max()
Will return the maximum value for that column (amongs all rows).

How to use Lag value of a column in condition to populate another column in Pandas

I have a table like below:
I want to create another column(Check2) with below logic:
If Check1 ==0 then Check2 = A
Else Check2 = Check2(lagged value) - B(lagged) - C(lagged)
Expected Output should be like below -
I have written below code but its taking very long time(in hours) for 50000 records, please help
for i in range(len(df)):
if df.loc[i,'Check1'] == 0:
df.loc[i,'Check2'] = df.loc[i,'Volume']
else:
df.loc[i,'Check2'] = df.loc[i-1,'Check2'] - df.loc[i-1,'B'] -df.loc[i-1,'C']
You are searching for: .shift() function.
It does what you want

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

Data cleaning for-loop in python for pos data

I have a pos data of message shop.
The Data is as shown in attached picture.
##read data from csv
data = pd.read_csv('test1.csv')
#make a kist for each column
sales_id = list(data['sales_id'])
shop_number = list(data['shop_number'])
sales = list(data['sales'])
cashier_no = list(data['cashier_no'])
messager_no = list(data['messager_no'])
type_of_sale = list(data['type_of_sale'])
costomer_ID = list(data['costomer_ID'])
type_of_sale = list(data['type_of_sale'])
date = list(data['date'])
time = list(data['time'])
I want make a new list showing that the data of purchase should be deleted.
like this:
data_to_clean= [0,1,0,1,0,0,1,0,1]
To do it I want to make a for loop
for i in range(len(type_of_sale)):
data_to_clean=[]
if type_of_sale[i] == "purchase":
data_to_clean = data_to_clean.append(0)
elif type_of_sale[i] == "return":
data_to_clean = data_to_clean.append(1)
## I want to write a code so I can delete purchasse data too
#with conditions if it has the same shop_number,messager_no,costomer_ID and -price
return list(data_to_clean)
There is two main problem in this code. One it doesn't move. 2nd I don't know how to check shop_number, messager_no and costomer_ID to put 1 or 0 in my data_to_clean list.
sometimes I have to check for the data above like sales_id(1628060) and sometimes its below like sales_id(1599414)
Knowing that the cashier may differ.
but the constomer_Id should be the same always.
The question is how to write a the code so I can create a list or dataframe with 0 and 1 to show which data should be deleted.
When you want to compare data with string in Python, you should put this string in qoutes:
for i in range(len(type_of_sale)):
data_to_clean=[]
if type_of_sale[i] == "purchase": # here
data_to_clean = data_to_clean.append(0)
elif type_of_sale[i] == "return": # and here
data_to_clean = data_to_clean.append(1)
check pandas doc. Getting the items which are a return order can be as simple as
returns = data.loc[data['type_of_sale'] == 'return']
If you want the sales of cashier 90
data.loc[(data['type_of_sale'] == 'purchase') & (data['cashier_no'] == 90)]

Categories