Dynamic rows in a pandas dataframe

Dynamic rows in a pandas dataframe - python

I have an excel spreadsheet that I read using python. I was looking for a way in which I could query the first column of the spreadsheet and assign every cell from that column to a variable. The number of cells in the column that have data can change tomorrow for ex.
Excel Spreadsheet:
Names
Mike
Adam
Mitchell
Desired output: Name1=Mike; Name2= Adam;Name3=Mitchell. If tomorrow there is no Mitchell in the list or if there is an additional name I would either have 3 Name variable or respectively 4.
My try so far was:
for i in db.index:
if i == 1:
Name1 = db.ix[0]['Names']
else:
if i==2:
Name2 = db.ix[1]['Names']
else:
if i==3:
Name3 = db.ix[2]['Names']
else:
Name4 = db.ix[3]['Names']
Thanks and apologies for any mystakes

I manage to fix this in case anyone else has the same issue. I am using 2 lists and concatenate them into a dictionary.
names= db['Names'].tolist()
lst = []
for i in range(db.index):
lst.append(i)
lst=['Name'+str(x)for x in lst]
dictionary = dict(zip(lst, names))

Related

Pandas remove every entry with a specific value

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.

To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.

Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]

Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)

Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address

Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]

You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)

using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

Hey guys this is my first post. I am planning on building an anime recommendation engine using python. I came across a problem where I made a list called genre_list which stores the genres that I want to filter from the huge data spreadsheet I was given. I am using the Pandas library and it has an isin() function to check if the values of a list is included in the datasheet and its supposed to filter it out. I am using the function but its not able to detect "Action" from the datasheet although it is there. I got a feeling there's something wrong with the data types and I probably have to work around it somehow but I'm not sure how.
I downloaded my csv file from this link for anyone interested!
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?resource=download
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
print(genre_list)
df_genre = df[df["genre"].isin(genre_list)]
# df_genre = df["genre"]
print(df_genre)
Outout:
[1]: https://i.stack.imgur.com/XZzcc.png

You want to check if ANY value in your user input list is in each of the list values in the "genre" column. The "isin" function will check if your input in it's entirety is in a cell value, which is not what you want here. Change that line to this:
df_genre = df[df['genre'].apply(lambda x: any([i in x for i in genre_list]))]
Let me know if you need any more help.

import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
# List of all cells and their genre put into a list
col_list = df["genre"].values.tolist()
temp_list = []
# Each val in the list is compared with the genre_list to see if there is a match
for index, val in enumerate(col_list):
if all(x in val for x in genre_list):
# If there is a match, the UID of that cell is added to a temp_list
temp_list.append(df['uid'].iloc[index])
print(temp_list)
# This checks if the UID is contained in the temp_list of UIDs that have these genres
df_genre = df["uid"].isin(temp_list)
new_df = df.loc[df_genre, "title"]
# Prints all Anime with the specified genres
print(new_df)
This is another approach I took and works as well. Thanks for all the help :D

To make a selection from a dataframe, you can write this:
df_genre = df.loc[df['genre'].isin(genre_list)]

I've downloaded the file animes.csv from Kaggle and read it into a dataframe. What I found is that the column genre actually contains strings (of lists), not lists. So one way to get what you want would be:
...
m = df["genre"].str.contains(r"'(?:" + "|".join(genre_list) + r")'")
df_genre = df[m]
Result for genre_list = ["Action"]:
uid ... link
3 5114 ... https://myanimelist.net/anime/5114/Fullmetal_A...
4 31758 ... https://myanimelist.net/anime/31758/Kizumonoga...
5 37510 ... https://myanimelist.net/anime/37510/Mob_Psycho...
7 38000 ... https://myanimelist.net/anime/38000/Kimetsu_no...
9 2904 ... https://myanimelist.net/anime/2904/Code_Geass_...
... ... ... ...
19301 10350 ... https://myanimelist.net/anime/10350/Hakuouki_S...
19303 1293 ... https://myanimelist.net/anime/1293/Urusei_Yatsura
19304 150 ... https://myanimelist.net/anime/150/Blood_
19305 4177 ... https://myanimelist.net/anime/4177/Bounen_no_X...
19309 450 ... https://myanimelist.net/anime/450/InuYasha_Mov...
[4215 rows x 12 columns]
If you want to transform the values of the genre column for some reason into lists, then you could do either
df["genre"] = df["genre"].str[1:-1].str.replace("'", "").str.split(r",\s*")
or
df["genre"] = df["genre"].map(eval)
Afterwards
df_genre = df[~df["genre"].map(set(genre_list).isdisjoint)]
would give you the filtered dataframe.

Python Iteration of Two Pandas Dataframes, Duplicate Error

I am trying to iterate over two pandas dataframes (A & B) using nested for loops. An if statement is inserted after the second for loop. The goal is to match an unique_id column from dataframes A and B and then append another column value to an empty list.
Instead of receiving 1 name per unique id, I receive 6. It seems like the loop does not iterate once there is a match.
Assistance is greatly appreciated!
empty_list = []
for i, r in dfA.iterrows():
for j, ro in dfB.iterrows():
if (r['unique_id'] == ro['unique_id]):
empty_list.append(ro['name'])
print(r['unique_id'], ro['unique_id], ro['name'])
else:
pass
unique_id Name
1. John
1. John
1. John
1. John
1. John
Desired Output:
1. John
2. Bob
3. Ryan

You should add some data for others to help you faster.
Here is something to start with.
Your code works fine (except there were two typos; apos missing).
Also, there are better ways to "join" the two dataframes.
One reason you might be seeing 6 could be duplicates in unique_id column in original data.
import pandas as pd, io
raw1 = '''unique_id,name
1,A
2,B
3,C
'''
raw2 = '''unique_id,name
3,C
4,D
5,E
'''
dfA = pd.read_csv(io.StringIO(raw1))
dfB = pd.read_csv(io.StringIO(raw2))
empty_list = []
for i, r in dfA.iterrows():
for j, ro in dfB.iterrows():
if (r['unique_id'] == ro['unique_id']):
empty_list.append(ro['name'])
print(r['unique_id'], ro['unique_id'], ro['name'])
else:
pass
Output:
3 3 C

Searching for a string in an excel column

I am trying to find a string in an excel spreadsheet but it is only capturing the first row only and neglected to search the rest.
In my code I am using Tkinter to get a user to insert an input and using a link_url() to match it with each column cell in excel sheet and if it matches to capture the value of the same row another column.
Here is the same of the excel sheet index:
Name Link
0 ABC www.linkname1.com
1 DEF www.linkname2.com
2 GHI www.linkname3.com
3 JKL www.linkname4.com
4 MNO www.linkname5.com
5 PQR www.linkname6.com
6 STU www.linkname7.com
7 VWX www.linkname8.com
8 YZZ www.linkname9.com
9 123 www.linkname10.com
I create a the following function to search for the input:
def link_url():
df = pd.read_excel('Links.xlsx')
for i in df.index:
# print(df['Name'])
# print(e.get())
if e.get() in df['Name'][i]:
print(df['Name'][i])
link_url = df['Link'][i]
known.append(e.get())
return link_url
else:
unknown.append(e.get())
unknown_request = "I will search and return back to you"
return unknown_request
My Question
When I search for ABC it returns www.linkname1.com as requested but when I search for DEF it returns I will search and return back to you why is that happening and how can I fix it?

I may be misunderstanding the question a bit (Henry Ecker is right about the direct issue you are running into), but the solution with Pandas feels a bit weird to me.
I guess I'd personally do something more like this to filter a data frame to a specific row. I generally avoid for looping through data frames as much as I can.
import pandas as pd
my_data = pd.DataFrame(
{'Name': ['ABC', 'DEF', 'GHI'],
'Link': ['www.linkname1.com', 'www.linkname2.com', 'www.linkname3.com']}
)
keep = my_data.Name.eq('DEF')
result = my_data[keep]
if len(result) > 0:
print(result.Link.values)
else:
print("I will search and return back to you")

Using a for loop combined with a nested if statement to create a new pandas DataFrame based on 3 columns of a different DataFrame in Python

trv_last = []
for i in range(0,len(trv)):
if (trv_name_split.iloc[i,3] != None):
trv_last = trv_name_split.iloc[i,3]
elif (trv_name_split.iloc[i,2] != None):
trv_last = trv_name_split.iloc[i,2]
else:
trv_last = trv_name_split.iloc[i,1]
trv_last
This returns 'Campo' which is the last index in my range:
0 1 2 3
1 John Doe None None
2 Jane K. Martin None
: : : : :
972 Jino Campo None None
As you can see all names were together in one column and I used str.split() to split them up. Since some names had first middle middle last, I am left with 4 columns. I am only interested in he last name.
My goal is to create a new DF with only the last name. The logic here is if the 4th column is not "None" then that is the last name and move backwards toward the 2nd column being last name if all else are "None".
Thank you for having a look and I appreciate the help!

Looping through pandas dataframes isn't a great idea. That's why they made apply. Best practice is to use apply and assign.
def build_last_name(row):
if row.3:
return row.3
if row.2:
return row.2
return row.1
last_names = trv_name_split.apply(build_last_name, axis=1)
trv_name_split = trv_name_split.assign(last_name=last_names)
Familiarizing yourself with apply is going to save a lot of headaches. Here's the docs.

Figured out the answer to my own question..
trv_last = []
for i in range(0,len(trv)):
if (trv_name_split.iloc[i,3] != None):
trv_last.append(trv_name_split.iloc[i,3])
elif (trv_name_split.iloc[i,2] != None):
trv_last.append(trv_name_split.iloc[i,2])
else:
trv_last.append(trv_name_split.iloc[i,1])
trv_last

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dynamic rows in a pandas dataframe - python

I manage to fix this in case anyone else has the same issue. I am using 2 lists and concatenate them into a dictionary. names= db['Names'].tolist() lst = [] for i in range(db.index): lst.append(i) lst=['Name'+str(x)for x in lst] dictionary = dict(zip(lst, names))

Related

Pandas remove every entry with a specific value

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

Python Iteration of Two Pandas Dataframes, Duplicate Error

Searching for a string in an excel column

Using a for loop combined with a nested if statement to create a new pandas DataFrame based on 3 columns of a different DataFrame in Python

Categories

Resources