so I am trying to iterate rows and columns within a data frame and would like to compare some of the column values to another dataframe's values of identical column names.
So both data frames with about 30 columns where some are objects, some are floats and some are integers.
I would mostly like to compare all of the columns that are integers to another data frame that has 1 row extracted from the data frame, as I would like to compute the similarities of each row in the dataframe 'CB' to the one row in 'ip' and then input that value into the sim column in my dataframe.
(if it's possible to compare all relevant columns in a way that would be great too)
Image of dataframes
In the end I would like to be able to change the sim column value based on the final if statement for each row. This would be best if it was reusable in future like a function as I would like to compare it to multiple "ip's".
Below is an example of one of the variations I tried doing it:
for i in range(len(CB)):
current = CB.iloc[i, j]
for j in current:
ipValue = ip.iloc[0, j]
if current == ipValue: top += 1
continue
if (current == 1) or (ipValue == 1): bottom += 1
break
if(bottom > 0 ): CB.iloc[i, 30] = top / bottom
If anyone could help me with this it would be wonderful, thank you :)
Related
I am iterating through a dataframe using the df.iterrows() function but am not sure how to specify the row number to run through the dataframe from. I am using a row counter in the initial for loop (below) to search for X within the rows, and once it has found X, I need to loop through the rest of the dataframe to find Y, but without looping through the initial rows that were already looped through when searching for X.
I have tried to achieve this by deleting all rows up to X, but this does not work as it remove entries I need later after the initial X and Y have been found, and I need to find the next X and Y.
row_count = 0
for index, row in new_df.iterrows():
if X in row[2]:
row_count += 1
# take information required from row
for visit_index, visit_row in new_df.iterrows():
if Y in visit_row[2]:
# take information required from row
# append information to new dataframe
break
else:
new_df.drop(index, inplace = True)
row_count += 1
What I want to do instead is use the row_count so that when I find X I can then iterate through the dataframe again from the row where X was present onwards, how can I do this?
You can do this in a much more simple way I believe.
Using the .loc function of pandas you could do something like this:
subset = df.loc[df["YOUR_COLUMN_NAME"].str.contains(X)]
And this would return the subset of rows in your dataframe that contain X in the column "YOUR_COLUMN_NAME". You haven't specified the name of row[2] but use that instead of "YOUR_COLUMN_NAME"
As an example, my code:
import pandas as pd
df = pd.DataFrame([[1, "Test1.1"], [2, "Test2.1"]], columns=["ID", "STR"])
x = df.loc[df["STR"].str.contains("Test1")]
print(x)
Outputs this:
ID STR
0 1 Test1.1
From here you could take whatever information you needed from the row.
To iterate through only certain rows, take a slice of the DataFrame that contains those rows, and iterate over it.
Separately: keep in mind that a nested inner for loop will run all over again, each time through the outer loop. If the goal is to find a "starting point" and do the rest of the iteration from there, then that should be two separate loops: one to find the start point, and one to proceed from there - once.
Thus:
for start_index, row in new_df.iterrows():
if X in row[2]:
break # `start_index` is the starting point
for index, row in new_df[start_index:].iterrows():
# process the row
Image of movie genres with 1-5 scale
I have a dataset which contains different movie genres as column names and their values(1 to 5) in their respective columns.
Now, what I want is, to return rows which contain only values 3 to 5 and discard others.
So far I have used the code
req_horr = req_data [(req_data['Horror'] >= 3)]
Where req_data is dataframe in image.
With the above code I can only return rows with desired values in 1 column(in this case column 'Horror').
But I need a return of dataframe with desired values in every columns. What is the code for that?
genre_list = ["Horror", "Thriller"]
req_data.loc[(req_data[genre_list] >= 3).all(axis=1)]
I know how to delete rows and columns from a dataframe using .drop() method, by passing axis and labels.
Here's the Dataframe:
Now, if i want to remove all rows whose STNAME is equal to from (Arizona all the way to Colorado), how should i do it ?
I know i could just do it by passing row labels 2 to 7 to .drop() method but if i have a lot of data and i don't know the starting and ending indexes, it won't be possible.
Might be kinda hacky, but here is an option:
index1 = df.index[df['STNAME'] == 'Arizona'].tolist()[0]
index2 = df.index[df['STNAME'] == 'Colorado'].tolist()[-1:][0]
df = df.drop(np.arange(index1, index2+1))
This basically takes the first index number of Arizona and the last index number of Colorado, and deletes every row from the data frame between these indexes.
I have a pandas data frame, in which basically only two columns are important. The column 'Name' and the other one 'Cost'.
I have different categories for my costs. For each I have list of keywords. Based on these keywords I find its related rows in the dataframe:
a = df[df['Name'].str.contains('|'.join(keywords),case=False)]
and then I calculate the sum of Cost values in those rows to get that category cost:
sum_ = 0
for index, row in a.iterrows():
cost= float(row['Cost'])
sum_ += cost
The problem is with this approach, I never know if a certain row has been considered multiple times or if at the end a row is missed and wasn't allocated to any category.
My question is first how to get indexes of the filtered/chosen rows when using str.contain and then how to check if the rows has been previously used in another category.
Thank you so much.
I need to count the specific number of times a number is seen within a specific cell.
DataFrame ScreenShot
The values are between 1 to 7.
In this column Entity_Types, the first occurrence has 7,7,6,7,6,7,1,7,7,7,2. I think I need to create 7 additional empty columns and count the frequency of each occurrence(for each number) and append them to a new column labeled Entity_Types_1,Entity_Types_2...etc.
Example: New column 7 would have each count of 7 while New Column 1 would have the count of all 1's in that respective cell. I have a table that has 30,000 rows so I was wondering how to run it in a loop to fill out the rest of the dataset.
I can easily do it in excel using this formula
=SUMPRODUCT(LEN(O2)-LEN(SUBSTITUTE(O2,"2","")))
Where O2 is Entity_Types and "2" = the number we are looking to find.
End Example
It looks like Entity_Types is a column of strings in you data frame df. If that is the case, you can use:
for i in range(8):
df['Entity_Types_{}'.format(i)] = df.Entity_Types.str.count(str(i))