Taking user input and searching through a csv using pandas python - python

I'm trying to take user input and search the csv file by asking the make, model, and year of the car but it is not filtering the cars correctly when I get to model. It is still showing all the car models even when I want only Toyota cars. I am also getting Empty dataframe error when I finish the input.
sData.csv
import pandas
# reads in vehicle Data
df = pandas.read_csv('sData.csv')
pandas.set_option('display.max_columns', None)
pandas.set_option('display.width', 400)
def get_choice(data, column):
#Gets user choice
nums = [val for val in range(len(df[column].unique()))]
choices = list(zip(nums, df[column].unique()))
print("Select '%s' of the car\n" % column)
for v in choices:
print("%s. %s" % (v))
user_input = input("Answer: ")
user_answer = [val[1] for val in choices if val[0]==int(user_input)][0]
print("'%s' = %s\n" % (column, user_answer))
return user_answer
def main():
make_input = get_choice(data=df, column="make")
model_input = get_choice(data=df, column="model")
year_input = get_choice(data=df, column="year")
newdf = df.loc[(df["make"]==make_input)&(df["model"]==model_input)&(df["year"]==year_input)]
print(newdf)
if __name__ == "__main__":
main()

The issue you are having with regards to models not being filtered by make is caused by the fact that you are not modifying df or passing a modified copy of it to the subsequent calls to make_choice. Essentially, each time you call make_choice the results are not filtered by prior choices.
There are multiple possible solutions, one would be to simply update df as you make choices, but this wouldn't keep the original df.
make_input = get_choice(data=df, column="make")
df = df.loc[df.make == make_input, :]
...
Alternatively:
filtered_df = df.copy() # Could be costly if df is large
make_input = get_choice(data=filtered_df, column="make")
filtered_df = filtered_df.loc[filtered_df.make == make_input, :]
model_input = get_choice(data=filtered_df, column="model")
...
As for the empty data frame error, I would advise to use a debugger and see what each step of your code is yielding. If you do not already know how to use it, I would recommend you to look at pdb.

Related

How to check total percentages in a dataframe with each respective user id

I have a huge dataset containing almost 3 million csv files. Each csv represent different user.
First of all i have appended all the csv's together and assigned a userid to each of them.
Here's the preview of how i appended all of them into a single feather file
CLICK HERE FOR HOW I APPENDED DATAFRAME 1 ALL CSV's into Dataframe
Here's a preview of dataframes
CLICK HERE FOR Dataframe1 (K3) preview
CLICK HERE FOR Dataframe2 (Questions) preview
What i need is that a program should check a entry from column (item_id) (Dataframe 1 which is K3) and matches the entry from column(question_id) (dataframe 2 which is questions)and verify user_answer entry (Dataframe 1) is same as correct_answer (Dataframe 2).
If they match then it should create a new dataframe and stores the percentages.
For Example
Expected Output
CLICK HERE FOR EXPECTED OUTPUT PREVIEW
What i have tried so far is I can manually calculate each csv one by one to get this result but i cannot do one by one for over 3 million files. Thus i want is pandas to do go over through each user interaction
counter = 0
for key, item_id in B['item_id'].iteritems():
try:
if B.loc[key, 'user_answer'] == questions.loc[questions['question_id'] == item_id, 'correct_answer'].values[0]:
counter += 1
else:
pass
except Exception:
pass
print(counter)
Total = len(pd.value_counts(B['item_id']))
InCorrect = (Total) - (counter)
User_1_sucess = ((counter) / (Total) ) * 100
print (User_1_sucess)
User_1_failure = ((InCorrect) / (Total) ) * 100
print (User_1_failure)
One option, which doesn't really use the full power of pandas but may be more straightforward to code, would be to take the code you've already written, optimise it somewhat, then put it in the loop and apply it to each CSV file in turn
Something like (rough draft):
# make a dict of the correct answers, for speed:
correct_answers = {}
for _idx, row in questions.iterrows():
correct_answers[row['question_id']] = row['correct_answer']
del questions # optional, can help if memory is tight
results = [] # can be a dataframe instead
errors = []
all_files = all_files[:10] # only process first 10 for testing
for filename in all_files:
try:
df = pd.read_csv(filename, ',', index_col=None, header=0)
user_id = filename.split('\\u')[-1].split('.')[0]
print("Processing %s" % user_id)
counter = 0
for _idx, row in df.iterrows():
if row['user_answer'] == correct_answers[row['item_id']]:
counter += 1
total = pd.value_counts(df['item_id'])
results.append((filename, user_id, counter, total))
except Exception as e:
errors.append("Error processing %s: %s" % (filename, e))
del df # optional, can help if memory is tight
print("Results:", results)
print("Errors:", errors)

Is there way to store original input in a function applied to a dataframe in python?

def update():
name = input('Which Country?')
if df.loc[name, 'added'] == 0:
df.loc[name, 'added'] = 'x'
return df.loc[name, 'added']
return 'Already Updated!!'
I created a table with list of Countries in column df['Country'] and df['added'], and the function above should update the df['added'] with 'x' each time I enter a country. For example, I entered 'Afghanistan', the Afghanistan row of 'added' will turn from 0 to 'x' to mark as completed. However, when I enter a new value e.g. 'Albania', Albania will be updated but then Afghanistan will revert back to the original 0...I was wondering if there was a way to store the original input in the dataframe(e.g. value of Afghanistan) instead of the new input replacing it?
This should help u:
import pandas as pd
def update(df):
name = input('Which Country?')
index = df.loc[df['Country']==name].index[0]
if df.loc[index,'added'] == '0':
df.loc[index,'added'] = 'x'
return df.loc[index, 'added']
return 'Already Updated!!'
df = pd.DataFrame({'Country':['Afghanistan','Albania'],'added':['0','0']})
ans = update(df)
print(ans)
ans = update(df)
print(ans)
print(df.head())
This is the entire code. This will update the df properly and will also return the values that u want. The mistake that u made was that u did not use the df.loc function properly. Changing that gave me the correct output. Plus, x is not getting reverted to 0, which means that this code works properly. Pls let me know if this works for u and any questions r welcome!

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

Function to split & expand returning NameError

def unique_unit_split(df):
df_unit_list = df_master.loc[df_master['type'] == 'unit']
df_unit_list = df_unit_list.key.tolist()
for i in range(len(df_unit_list)):
df_unit_list[i] = int(df_unit_list[i])
split_1 = df_units.units.str.split('[","]',expand=True).stack()
df_units_update = df_units.join(pd.Series(index=split_1.index.droplevel(1), data=split_1.values, name='unit_split'))
df_units_final = df_units_update[df_units_update['unit_split'].isin(df_unit_list)]
return(df)
Updated script: still not working
df_unit_list = []
split_1 = pd.DataFrame()
df_units_update = pd.DataFrame()
df_units_final = pd.DataFrame()
def unique_unit_split(df):
df_unit_list = df_master.loc[df_master['type'] == 'unit']
df_unit_list = df_unit_list.key.tolist()
for i in range(len(df_unit_list)):
df_unit_list[i] = int(df_unit_list[i])
split_1 = df_units.units.str.split('[","]',expand=True).stack()
df_units_update = df_units.join(pd.Series(index=split_1.index.droplevel(1), data=split_1.values, name='unit_split'))
df_units_final = df_units_update[df_units_update['unit_split'].isin(df_unit_list)]
return(df)
Above function originally worked when I split up the two actions (code inclusive of the for loop and above was in a function then everything below split_1 was in another function). Now that I tried to condense them, I am getting a NameError (image attached). Anyone know how I can resolve this issue and ensure my final df (df_units_final) is defined?
For more insight on this function: I have a df with comma separated values in one column and I needed to split that column, drop the [] and only keep rows with the #s I need which were defined in the list created "df_unit_list".
NameError Details
The issue was stated above (not defining df_units_final) AND my for_loop was forcing the list to be int when the values in the other df were actually strings.
Working Code

Data cleaning for-loop in python for pos data

I have a pos data of message shop.
The Data is as shown in attached picture.
##read data from csv
data = pd.read_csv('test1.csv')
#make a kist for each column
sales_id = list(data['sales_id'])
shop_number = list(data['shop_number'])
sales = list(data['sales'])
cashier_no = list(data['cashier_no'])
messager_no = list(data['messager_no'])
type_of_sale = list(data['type_of_sale'])
costomer_ID = list(data['costomer_ID'])
type_of_sale = list(data['type_of_sale'])
date = list(data['date'])
time = list(data['time'])
I want make a new list showing that the data of purchase should be deleted.
like this:
data_to_clean= [0,1,0,1,0,0,1,0,1]
To do it I want to make a for loop
for i in range(len(type_of_sale)):
data_to_clean=[]
if type_of_sale[i] == "purchase":
data_to_clean = data_to_clean.append(0)
elif type_of_sale[i] == "return":
data_to_clean = data_to_clean.append(1)
## I want to write a code so I can delete purchasse data too
#with conditions if it has the same shop_number,messager_no,costomer_ID and -price
return list(data_to_clean)
There is two main problem in this code. One it doesn't move. 2nd I don't know how to check shop_number, messager_no and costomer_ID to put 1 or 0 in my data_to_clean list.
sometimes I have to check for the data above like sales_id(1628060) and sometimes its below like sales_id(1599414)
Knowing that the cashier may differ.
but the constomer_Id should be the same always.
The question is how to write a the code so I can create a list or dataframe with 0 and 1 to show which data should be deleted.
When you want to compare data with string in Python, you should put this string in qoutes:
for i in range(len(type_of_sale)):
data_to_clean=[]
if type_of_sale[i] == "purchase": # here
data_to_clean = data_to_clean.append(0)
elif type_of_sale[i] == "return": # and here
data_to_clean = data_to_clean.append(1)
check pandas doc. Getting the items which are a return order can be as simple as
returns = data.loc[data['type_of_sale'] == 'return']
If you want the sales of cashier 90
data.loc[(data['type_of_sale'] == 'purchase') & (data['cashier_no'] == 90)]

Categories