Data cleaning for-loop in python for pos data - python

I have a pos data of message shop.
The Data is as shown in attached picture.
##read data from csv
data = pd.read_csv('test1.csv')
#make a kist for each column
sales_id = list(data['sales_id'])
shop_number = list(data['shop_number'])
sales = list(data['sales'])
cashier_no = list(data['cashier_no'])
messager_no = list(data['messager_no'])
type_of_sale = list(data['type_of_sale'])
costomer_ID = list(data['costomer_ID'])
type_of_sale = list(data['type_of_sale'])
date = list(data['date'])
time = list(data['time'])
I want make a new list showing that the data of purchase should be deleted.
like this:
data_to_clean= [0,1,0,1,0,0,1,0,1]
To do it I want to make a for loop
for i in range(len(type_of_sale)):
data_to_clean=[]
if type_of_sale[i] == "purchase":
data_to_clean = data_to_clean.append(0)
elif type_of_sale[i] == "return":
data_to_clean = data_to_clean.append(1)
## I want to write a code so I can delete purchasse data too
#with conditions if it has the same shop_number,messager_no,costomer_ID and -price
return list(data_to_clean)
There is two main problem in this code. One it doesn't move. 2nd I don't know how to check shop_number, messager_no and costomer_ID to put 1 or 0 in my data_to_clean list.
sometimes I have to check for the data above like sales_id(1628060) and sometimes its below like sales_id(1599414)
Knowing that the cashier may differ.
but the constomer_Id should be the same always.
The question is how to write a the code so I can create a list or dataframe with 0 and 1 to show which data should be deleted.

When you want to compare data with string in Python, you should put this string in qoutes:
for i in range(len(type_of_sale)):
data_to_clean=[]
if type_of_sale[i] == "purchase": # here
data_to_clean = data_to_clean.append(0)
elif type_of_sale[i] == "return": # and here
data_to_clean = data_to_clean.append(1)

check pandas doc. Getting the items which are a return order can be as simple as
returns = data.loc[data['type_of_sale'] == 'return']
If you want the sales of cashier 90
data.loc[(data['type_of_sale'] == 'purchase') & (data['cashier_no'] == 90)]

Related

Different ways of iterating through pandas DataFrame

I am currently working on a short pandas project. The project assessment keeps marking this task as incorrect for me even though the resulting list appears to be the same as when the provided correct code is used. Is my code wrong and it just happens to give the same results for this particular DataFrame?
My code:
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for t in netflix_movies_col_subset['genre']:
if t == 'Children' :
colors.append('red')
elif t == 'Documentaries' :
colors.append('blue')
elif t == 'Stand-up' :
colors.append('green')
else:
colors.append('black')
# Inspect the first 10 values in your list
print(colors[:10])
Provided code:
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for lab, row in netflix_movies_col_subset.iterrows():
if row['genre'] == 'Children' :
colors.append('red')
elif row['genre'] == 'Documentaries' :
colors.append('blue')
elif row['genre'] == 'Stand-up' :
colors.append('green')
else:
colors.append('black')
# Inspect the first 10 values in your list
print(colors[0:10])
I've always been told, that the best way to iterate over a dataframe row by row is NOT TO DO IT.
I your case, you could very nicely use df.ne()
First create a dataframe that holds all genres (df_genres)
then use
netflix_movies_col_subset['genre'].ne(df_genres, axis=0)
this should create a dataframe that has a line for every movie and columns for every genre. If a certain movie is a documentary, values in all columns would be False, only in the Documentary column it would be True.
This method is by multiple orders of magnitude faster than iterating with multiple if statements.
Does this help? I haven't test it yet.
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for t in netflix_movies_col_subset['genre']:
if t == 'Children' :
x='red'
elif t == 'Documentaries' :
x= 'blue'
elif t == 'Stand-up' :
x ='green'
else:
x ='black'
colors.append(x)
# Inspect the first 10 values in your list
print(colors[:10])
Or you can do match case.
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for t in netflix_movies_col_subset['genre']:
match t:
case 'Children':
x ='red'
case 'Documentaries':
x ='blue'
case 'Stand-up':
x ='green'
else:
x ='black'
colors.appent(x)
# Inspect the first 10 values in your list
print(colors[:10])

Filtering dataframe in a loop with use of config file values

I have the following toy dataset
data = {"Subject":["1","2","3","3","4","5","5"],
"date": ["2020-05-01 16:54:25","2020-05-03 10:31:18","2020-05-08 10:10:40","2020-05-08 10:10:42","2020-05-06 09:30:40","2020-05-07 12:46:30","2020-05-07 12:55:10"],
"Accept": ["True","False","True","True","False","True","True"],
"Amount" : [150,30,32,32,300,100,50],
"accept_1": ["True","False","True","True","False","True","True"],
"amount_1" : [20,30,32,32,150,100,30],
"Transaction":["True","True","False","False","True","True","False"],
"Label":["True","True","True","False","True","True","True"]}
data = pd.DataFrame(data)
and a small toy config file
config = [{"colname": "Accept","KeepValue":"True","RemoveTrues":"True"},
{"colname":"Transaction","KeepValue":"False","RemoveTrues":"False"}]
I want to loop through the dataset and apply these filters. After I have applied the first filter,
I want to apply the following filter on the filtered data and so on.
I run the following code and it seems it applies the filter on the data the first time and then, it applies the second filter on the original data, not the filtered one.
for i in range(len(config)):
filtering = config[i]
if filtering["RemoveTrues"] == "True":
col = filtering["colname"]
test = data[data[col] == filtering["KeepValue"]]
print(test)
else:
col = filtering["colname"]
test = data[(data[col]== filtering["KeepValue"]) | data["Label"]]
print(test)
How can I apply the first filter on the data, then the second filter on the filtered data and so on ? I need to use a loop since I have to get the filters from the configuration file.
From what I get, you want to save the filtering each time it happened, and from what I see in the code each loop you are trying to filter, but using the original reference for the data frame, which it's going to do the filter each time on the original dataframe, you have to change it to a new reference call it "test", and save it to the same reference "test" so it can be used in next loop
test = data.copy() # copy the original dataframe so we can refreance for it each time in loop
for i in range(len(config)):
filtering = config[i]
if filtering["RemoveTrues"] == "True":
col = filtering["colname"]
test = test[test[col] == filtering["KeepValue"]] # change it to the new reference, and save it to the same reference so it can be used in next loop
print(test)
else:
col = filtering["colname"]
test = test[(test[col]== filtering["KeepValue"]) | test["Label"]] # change it to the new reference, and save it to the same reference so it can be used in next loop
print(test)
I'd suggest changing your True/False strings to booleans. You can just assign a new value to df that will persist during the loop (don't create an extra test variable).
df = pd.DataFrame(data)
config = [{"colname": "Accept","KeepValue":"True","RemoveTrues":"True"},
{"colname":"Transaction","KeepValue":"False","RemoveTrues":"False"}]
for conf in config:
if conf["RemoveTrues"] == "True":
df = df[df[conf['colname']] == conf['KeepValue']]
print(df)
else:
df = df[(df[conf['colname']]== conf["KeepValue"]) | df["Label"]]
print(df)

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

Taking user input and searching through a csv using pandas python

I'm trying to take user input and search the csv file by asking the make, model, and year of the car but it is not filtering the cars correctly when I get to model. It is still showing all the car models even when I want only Toyota cars. I am also getting Empty dataframe error when I finish the input.
sData.csv
import pandas
# reads in vehicle Data
df = pandas.read_csv('sData.csv')
pandas.set_option('display.max_columns', None)
pandas.set_option('display.width', 400)
def get_choice(data, column):
#Gets user choice
nums = [val for val in range(len(df[column].unique()))]
choices = list(zip(nums, df[column].unique()))
print("Select '%s' of the car\n" % column)
for v in choices:
print("%s. %s" % (v))
user_input = input("Answer: ")
user_answer = [val[1] for val in choices if val[0]==int(user_input)][0]
print("'%s' = %s\n" % (column, user_answer))
return user_answer
def main():
make_input = get_choice(data=df, column="make")
model_input = get_choice(data=df, column="model")
year_input = get_choice(data=df, column="year")
newdf = df.loc[(df["make"]==make_input)&(df["model"]==model_input)&(df["year"]==year_input)]
print(newdf)
if __name__ == "__main__":
main()
The issue you are having with regards to models not being filtered by make is caused by the fact that you are not modifying df or passing a modified copy of it to the subsequent calls to make_choice. Essentially, each time you call make_choice the results are not filtered by prior choices.
There are multiple possible solutions, one would be to simply update df as you make choices, but this wouldn't keep the original df.
make_input = get_choice(data=df, column="make")
df = df.loc[df.make == make_input, :]
...
Alternatively:
filtered_df = df.copy() # Could be costly if df is large
make_input = get_choice(data=filtered_df, column="make")
filtered_df = filtered_df.loc[filtered_df.make == make_input, :]
model_input = get_choice(data=filtered_df, column="model")
...
As for the empty data frame error, I would advise to use a debugger and see what each step of your code is yielding. If you do not already know how to use it, I would recommend you to look at pdb.

subset a dataframe after a for loop and manipulate

I'm trying to write a for loop where I can subset a dataframe for each unique ID and create a new column. In my example, I want to create a new balance based on the ID, balance and initial amount. My idea was to loop through each group of ID's, take that subset and follow it up with some if/if else statements. In the iteration I want the loop to look at all of the unique ID's, for example when I loop though df["ID"] == 2, there should be 7 rows, since their balances are all related to each other. This is what my dataframe would look like:
df = pd.DataFrame(
{"ID" : [2,2,2,2,2,2,2,3,4,4,4],
"Initial amount":
[3250,10800,6750,12060,8040,4810,12200,13000,10700,12000,27000],
"Balance": [0,0,0,0,0,0,0,2617,19250,19250,19250], "expected output":
[0,0,0,0,0,0,0,2617,10720,8530,0]})
My current code looks like this, but I feel like i'm headed towards the wrong direction. Thanks!
unique_ids = list(df["ID"].unique())
new_output = []
for i in range(len(unique_ids)):
this_id = unique_ids[i]
subset = df.loc[df["ID"] == this_id,:]
for j in range(len(subset)):
this_bal = subset["Balance"]
this_amt = subset["Initial amount"]
if j == 0:
this_output = np.where(this_bal >= this_amt, this_amt, this_bal)
new_output.append(this_output)
elif this_bal - sum(this_output) >= this_amt:
this_output = this_amt
new_output.append(this_output)
else:
this_output = this_bal - sum(this_output)
new_output.append(this_output)
Any suggestions would be greatly appreciated!

Categories