Finding the line number of specific values in pandas dataframe - python

Finding the line number of specific values in pandas dataframe - python - python

I am working on a school project and I am trying to simulate a library's catalogue system. I have .csv files that hold all the data I need but I am having a problem with checking if an inputted title, author, bar code, etc. is in the data set. I have searched around for quite a while trying different solutions but nothing is working.
The idea that I have right now is that if I can find at what line the inputted data, then I can use .loc[] to get the needed info.
Is this the right track? is there another, more efficient way to do this?
import pandas
mainData = pandas.read_csv("mainData.csv")
barcodes = mainData["Barcode"]
authors = mainData["Author"]
titles = mainData["Title/Subtitle"]
callNumbers = mainData["Call Number"]
k = "Han, Jenny,"
for i in authors:
if k == i:
print("Success")
k = authors.index[k]
print(authors[k])
else:
print("Fail" + k)
# Please Note: This code only checks for an author match and has all other fields left out as I thought this code was too inefficient to add the rest of the fields. The code also does not find the line on witch the matched are located, therefore .loc[] can not be used to print out all the data found.
This is the code I am using right now, It outputs the result along with an error Python IndexError: only integers, slices (\`:\`), ellipsis (\`\...\`), numpy.newaxis (\`None\`) and integer or boolean arrays are valid indices and is very slow. I would like the code to be able to output the books and their respective info. I have found the the .loc[] feature (mentioned above) outputs the info quite nicely. Here is the data I am using .
Edit: I have been able to reduce the time it takes for the program to run and made a functional "prototype"
authorFirst = authorFirst.lower()
authorFirst = authorFirst.title()
authorFirst += ","
authorSecond = input("Enter author's last name: ")
authorSecond = authorSecond.lower()
authorSecond = authorSecond.title()
authorSecond += ", "
authorInput = authorSecond + authorFirst
print(mainData[mainData["Author"].isin([authorInput])])
bookChoice = input("Please Enter the number to the left of the barcode to select a book: ")
print(mainData.loc[int(bookChoice)])
id provides the functionality that I am looking for but I feel that there has to be a better way of doing it. (Not asking the user to input the row number). Idk if this is possible tho.
I am new to python and this is my first time using pandas so i'm sorry if this is really shitty and hurts your brain.
Thank-you so much for your time!

Pandas does not really need to find the numeric index of something, to do indexing.
Since you have not provided any starting point or data, I'll just provide a few pointers here as there are mans ways to match and index things in pandas.
import pandas as pd
# build a library
library = pd.DataFrame({
"Author": ["H.G. Wells", "Hubert Selby Jr.", "Ken Kesey"],
"Title": [
"The War of the Worlds",
"Requiem for a Dream",
"One Flew Over the Cuckoo's Nest",
],
"Published": [1898, 1979, 1962],
})
# find on some characteristics
mask_wells = library.Author.str.contains("Wells")
mask_rfad = library["Title"] == "Requiem for a Dream"
mask_xixth = library["Published"] < 1900

Related

How to create a dataframe with a dynamic text parameter passed to a function in Python

(I'm pretty new to Python,(and even to coding)forgive me for my stupidity.)
I'm trying to pass a text value and a list as parameters to a function. Here's an example :
Names = File['Student_Name']
Scores = File['Marks']
for a in range(0,100):
Student_Name = [Names[a]]
Marks = []
NewDf = pd.DataFrame(PreCovid(Student_Name,Marks))
Master_Sheet_PreCovid = NewDf
Master_Sheet_PreCovid
What I wish to achieve is passing Name of a Student, as a string, one at a time, to the function. In this code, I'm vaguely creating a df with each loop iteration, which obviously will only return me the last value, however, I wish to get the output for complete list of Students. What modifications/additions do I make in this code to make it work.
I followed this thread, Why the function is only returning the last value? , which was similar to my query, however might not work with my requirements.
Edited : I actually have 2 sheets that I'm fetching my data from,one is a Main Sheet,that has all the data with redundancy,I've a Rule book with unique values and the rules for calculation.In this code I'm only fetching values from Rule Book,then going to the function,fetching data based on these values from Main Sheet,performing my calculations,creating a new dataframe,inserting the values I get here into that dataframe as well,and return the Final dataframe.Right now, the calculation tested based only on Student_Name has worked, but now I've a bigger problem of calculating also based on Marks.
At the risk of sounding arrogant, I only wish to pass the name as string, not as list.
Again, I'm sorry about the stupidity of my query.

Give it a try:
Names = File['Student_Name']
Scores = File['Marks']
Master_Sheet_PreCovid = []
for a in range(0,100):
Student_Name = [Names[a]]
Marks = []
NewDf = pd.DataFrame(PreCovid(Student_Name,Marks))
Master_Sheet_PreCovid.append(NewDf)
Master_Sheet_PreCovid = pd.concat(Master_Sheet_PreCovid)
print(Master_Sheet_PreCovid)

Fixing a meeting room function schedule with double and triple bookings to determine space usage

I need to calculate the total amount of time each group uses a meeting space. But the data set has double and triple booking, so I think I need to fix the data first. Disclosure: My coding experience consists solely of working through a few Dataquest courses, and this is my first stackoverflow posting, so I apologize for errors and transgressions.
Each line of the data set contains the group ID and a start and end time. It also includes the booking type, ie. reserved, meeting, etc. Generally, the staff reserve a space for the entire period, which would create a single line, and then add multiple lines for each individual function when the details are known. They should segment the original reserved line so it's only holding space in between functions, but instead they double book the space, so I need to add multiple lines for these interim RES holds, based on the actual holds.
Here's what the data basically looks like:
Existing data:
functions = [['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/06 17:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00]]
I've tried to iterate using a for loop:
for index, row in enumerate(functions):
last_row_index = len(functions) - 1
if index == last_row_index:
pass
else:
current_index = index
next_index = index + 1
if row[3] <= functions[next_index][2]:
next
elif row[4] == 'RES' or row[6] < functions[next_index][6]:
copied_current_row = row.copy()
row[3] = functions[next_index][2]
copied_current_row[2] = functions[next_index][3]
functions.append(copied_current_row)
There seems to be a logical problem in here, because that last append line seems to put the program into some kind of loop and I have to manually interrupt it. So I'm sure it's obvious to someone experienced, but I'm pretty new.
The reason I've done the comparison to see if a function is RES is that reserved should be subordinate to actual functions. But sometimes there are overlaps between actual functions, so I'll need to create another comparison to decide which one takes precedence, but this is where I'm starting.
How I (think) I want it to end up:
[['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/05 09:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[01,01,'RES',2019/10/05 12:00,2019/10/05 12:30],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[01,01,'RES',2019/10/05 13:30,2019/10/05 14:00],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[01,01,'RES',2019/10/05 14:00,2019/10/06 09:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00],
[01,01,'RES',2019/10/06 12:00,2019/10/06 17:00]]
This way, I could do a simple calculation of elapsed time for each function line and add it up to see how much time they had the space booked for.
What I'm looking for here is just some direction I should pursue, and I'm definitely not expecting anyone to do the work for me. For example, am I on the right path here, or would it be better to use pandas and vectorized functions? If I can get the basic direction right, I think I can muddle through the specifics.
Thank-you very much,
AF

How to Combine Two Referential Lists With Python

I have stupid data coming out of a system, it needs to be flattened.
The main csv has these columns: hostname, program_name, version_name
However, there is only one row per host, so the last two fields look like this:
program_name contents:
Word
Excel
Cognos
Mozilla
version contents (not real, just for illustrative purposes):
2.3.2
121.3.0
build 22
What's the best way to ensure things match up and to more concisely and pythonically do this.
Here is what the real code looks like, the above is mainly for demo purposes:
for row in tan_output.programs:
names = row["Name"].splitlines()
versions = row["Version"].splitlines()
if(len(names) != len(versions)):
print("NAME and VERSION from tan_programs are not equal... Exiting")
exit()
else:
for name in names:
#tan_programs.append({"Count": row["Count"], "Hostname": row["Hostname"], "Name": row["Name"], "Version": row["Version"]})
I am stuck on the bottom for loop because I feel like I should be looping thru both lists simultaneously instead of looping thru one, and then what I was going to do, use a counter to reference the second one and form the flattened data.
PS, the file is 7 gigs... so the more efficient the better e.g. if I have to use the counter, I know from experience i += 1 is 100 times more efficient than i = i + 1

Just use the Counter... unless someone has a better idea:
tan_programs = []
for row in tan_output.programs:
names = row["Name"].splitlines()
versions = row["Version"].splitlines()
if(len(names) != len(versions)):
print("NAME and VERSION from tan_programs are not equal... Exiting")
exit()
else:
i = 0
for name in names:
tan_programs.append({"Hostname": row["Hostname"], "Name": name, "Version": versions[i]})
i += 1
Its actually very fast... the slow part is inserting 8 million records into a DB on another server over the network.

Update: Python average income reading and writing files

I was writing a code to find the average household income, and how many families are below poverty line.
this is my code so far
def povertyLevel():
inFile = open('program10.txt', 'r')
outFile = open('program10-out.txt', 'w')
outFile.write(str("%12s %12s %15s\n" % ("Account #", "Income", "Members")))
lineRead = inFile.readline() # Read first record
while lineRead != '': # While there are more records
words = lineRead.split() # Split the records into substrings
acctNum = int(words[0]) # Convert first substring to integer
annualIncome = float(words[1]) # Convert second substring to float
members = int(words[2]) # Convert third substring to integer
outFile.write(str("%10d %15.2f %10d\n" % (acctNum, annualIncome, members)))
lineRead = inFile.readline() # Read next record
# Close the file.
inFile.close() # Close file
Call the main function.
povertyLevel()
I am trying to find the average of annualIncome and what i tried to do was
avgIncome = (sum(annualIncome)/len(annualIncome))
outFile.write(avgIncome)
i did this inside the while lineRead. however it gave me an error saying
avgIncome = (sum(annualIncome)/len(annualIncome))
TypeError: 'float' object is not iterable
currently i am trying to find which household that exceeds the average income.

avgIncome expects a sequence (such as a list) (Thanks for the correction, Magenta Nova.), but its argument annualIncome is a float:
annualIncome = float(words[1])
It seems to me you want to build up a list:
allIncomes = []
while lineRead != '':
...
allIncomes.append(annualIncome)
averageInc = avgIncome(allIncomes)
(Note that I have one less indentation level for the avgIncome call.)
Also, once you get this working, I highly recommend a trip over to https://codereview.stackexchange.com/. You could get a lot of feedback on ways to improve this.
Edit:
In light of your edits, my advice still stands. You need to first compute the average before you can do comparisons. Once you have the average, you will need to loop over the data again to compare each income. Note: I advise saving the data somehow for the second loop, instead of reparsing the file. (You may even wish to separate reading the data from computing the average entirely.) That might best be accomplished with a new object or a namedtuple or a dict.

sum() and len() both take as their arguments an iterable. read the python documentation for more on iterables. you are passing a float into them as an argument. what would it mean to get the sum, or the length, of a floating point number? even thinking outside the world of coding, it's hard to make sense of that.
it seems like you need to review the basics of python types.

Separating lists in a list through iteration

First off, this is a homework assignment, but I've been working on it for a week now and haven't made much headway. My goal for this function is to take a list of lists (each list contains data about a football player) and separate the lists based off of the teams which the players belong to. I also want to add up each player's data so that I wind up with one list for each team with all the player's stats combined.
Here's the code I have so far. The problem I'm currently running into is that some teams are printed multiple times with different data each time. Otherwise it appears to be working correctly. Also, we have the limitation imposed on us that we are not allowed to use classes.
def TopRushingTeam2010(team_info_2010): #running into trouble calculating the rusher rating for each team, it also prints out the same team multiple times but with different stats. And just not getting the right numbers and order.
total_yards = 0
total_TD = 0
total_rush = 0
total_fum = 0
#works mostly, but is returning some teams twice, with different stats each time, which
#should not be happening. so... yeah maybe fix that?
for item in team_info_2010:
team = item[0]
total_yards = item[2]
total_TD = item[3]
total_rush = item[1]
total_fum = item[4]
new_team_info_2010.append([team, total_yards, total_TD, total_rush, total_fum])
for other_item in team_info_2010:
if other_item[0] == team:
new_team_info_2010.remove([team, total_yards, total_TD, total_rush, total_fum])
total_yards = total_yards + other_item[2]
total_TD = total_TD + other_item[3]
total_rush = total_rush + other_item[1]
total_fum = total_fum + other_item[4]
new_team_info_2010.append([team, total_yards, total_TD, total_rush, total_fum])
Any help or tips as to which direction I should head, or if I'm even headed in the right direction?

One possible problem is that you are removing from team_info_2010 while you are iterating through the list. Try deleting that line of code. I don't see a clear reason why you would want to delete from team_info_2010 and behavior is often undefined when you modify an object while iterating through it. More specifically, try deleting the following line of code:
team_info_2010.remove(item)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding the line number of specific values in pandas dataframe - python - python

Related

How to create a dataframe with a dynamic text parameter passed to a function in Python

Fixing a meeting room function schedule with double and triple bookings to determine space usage

How to Combine Two Referential Lists With Python

Update: Python average income reading and writing files

Separating lists in a list through iteration

Categories

Resources