How to iterate through Pandas dataframe from specified row number - python

I am iterating through a dataframe using the df.iterrows() function but am not sure how to specify the row number to run through the dataframe from. I am using a row counter in the initial for loop (below) to search for X within the rows, and once it has found X, I need to loop through the rest of the dataframe to find Y, but without looping through the initial rows that were already looped through when searching for X.
I have tried to achieve this by deleting all rows up to X, but this does not work as it remove entries I need later after the initial X and Y have been found, and I need to find the next X and Y.
row_count = 0
for index, row in new_df.iterrows():
if X in row[2]:
row_count += 1
# take information required from row
for visit_index, visit_row in new_df.iterrows():
if Y in visit_row[2]:
# take information required from row
# append information to new dataframe
break
else:
new_df.drop(index, inplace = True)
row_count += 1
What I want to do instead is use the row_count so that when I find X I can then iterate through the dataframe again from the row where X was present onwards, how can I do this?

You can do this in a much more simple way I believe.
Using the .loc function of pandas you could do something like this:
subset = df.loc[df["YOUR_COLUMN_NAME"].str.contains(X)]
And this would return the subset of rows in your dataframe that contain X in the column "YOUR_COLUMN_NAME". You haven't specified the name of row[2] but use that instead of "YOUR_COLUMN_NAME"
As an example, my code:
import pandas as pd
df = pd.DataFrame([[1, "Test1.1"], [2, "Test2.1"]], columns=["ID", "STR"])
x = df.loc[df["STR"].str.contains("Test1")]
print(x)
Outputs this:
ID STR
0 1 Test1.1
From here you could take whatever information you needed from the row.

To iterate through only certain rows, take a slice of the DataFrame that contains those rows, and iterate over it.
Separately: keep in mind that a nested inner for loop will run all over again, each time through the outer loop. If the goal is to find a "starting point" and do the rest of the iteration from there, then that should be two separate loops: one to find the start point, and one to proceed from there - once.
Thus:
for start_index, row in new_df.iterrows():
if X in row[2]:
break # `start_index` is the starting point
for index, row in new_df[start_index:].iterrows():
# process the row

Related

How to get value in dataframe columns that contains multi-value item

I have a columns like this :
colums
how to get individual value from the column ?
desired output is list ex: [42008598,26472654,42054590,42774221,42444463], so it value(s) can be counted
Let me give you an advice: when you have some example code to show us, It would be great if you paste into the code quotes like this. It is easiest to read. Let's go with your question. You can select row in a pandas dataframe like this:
import pandas as pd
print(df.iloc[i])
where i is the row number: 0, 1, 2,... and df is your dataframe. Here is the Documentation
I am also new in Stackoverflow. I hope this could help you.
What you need to convert each row in the dataframe to an array and then do the operation that you want with this array. The way you can do it with Pandas would be to declare a function that deals with each row, and them use apply to run the function each row.
An example to count how many elements has inside each row:
def treat_array(row):
row = row.replace("{", "")
row = row.replace("}", "")
row = row.split(",")
return len(row)
df["Elements Count"] = df["Name of Column with the Arrays"].apply(treat_array)

Iterating rows AND columns, in Pandas/Python

so I am trying to iterate rows and columns within a data frame and would like to compare some of the column values to another dataframe's values of identical column names.
So both data frames with about 30 columns where some are objects, some are floats and some are integers.
I would mostly like to compare all of the columns that are integers to another data frame that has 1 row extracted from the data frame, as I would like to compute the similarities of each row in the dataframe 'CB' to the one row in 'ip' and then input that value into the sim column in my dataframe.
(if it's possible to compare all relevant columns in a way that would be great too)
Image of dataframes
In the end I would like to be able to change the sim column value based on the final if statement for each row. This would be best if it was reusable in future like a function as I would like to compare it to multiple "ip's".
Below is an example of one of the variations I tried doing it:
for i in range(len(CB)):
current = CB.iloc[i, j]
for j in current:
ipValue = ip.iloc[0, j]
if current == ipValue: top += 1
continue
if (current == 1) or (ipValue == 1): bottom += 1
break
if(bottom > 0 ): CB.iloc[i, 30] = top / bottom
If anyone could help me with this it would be wonderful, thank you :)

Looping through DataFrame via zip

I'm using this code to loop through a dataframe:
for r in zip(df['Name']):
#statements
How do I identify a particular row in the dataframe? For example, I want to assign a new value to each row of the Name column while looping through. How do I do that?
I've tried this:
for r in zip(df['Name']):
df['Name']= time.time()
The problem is that every single row is getting the same value instead of different values.
The main problem is in the assignment:
df['Name']= time.time()
This says to grab the current time and assign it to every cell in the Name column. You reference the column vector, rather than a particular row. Note your iteration statement:
for r in zip(df['Name']):
Here, r is the row, but you never refer to it. That makes it highly unlikely that anything you do within the loop will affect an individual row.
Putting on my "teacher" hat ...
Look up examples of how to iterate through the rows of a Pandas data frame.
Within those, see how individual cells are referenced: that technique looks a lot like indexing a nested list.
Now, alter your code so that you put the current time in one cell at a time, one on each iteration. It will look something like
df.at[row]['Name'] = time.time()
or
row['Name'] = time.time()
depending on how you define row in your iteration.
Does that get you to a solution?
The following also works:
import pandas as pd
import time
# example df
df = pd.DataFrame(data={'name': ['Bob', 'Dylan', 'Rachel', 'Mark'],
'age': [23, 27, 30, 35]})
# iterate through each row in the data frame
col_idx = df.columns.get_loc('name') # this is so we can use iloc
for i in df.itertuples():
df.iloc[i[0], col_idx] = time.time()
So, essentially we use the index of the dataframe as the indicator of the position of the row. The first index points to the first row in the dataframe, and so on.
EDIT: as pointed out in the comment, using .index to iterate rows is not a good practice. So, let's use the number of rows of the dataframe itself. This can be obtained via df.shape which returns a tuple (row, column) and so, we only need the row df.shape[0].
2nd EDIT: using df.itertuples() for performance gain and .iloc for integer based indexing.
Additionally, the official pandas doc recommends the use of loc for variable assignment to a pandas dataframe due to potential chained indexing. More information here http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

How to get index number for a row meeting specific condition

I am curious to know how to grab index number off of a dataframe that's meeting a specific condition. I've been playing with pandas.Index.get_loc, but no luck.
I've loaded a csv file, and it's structured in a way that has 1000+ rows with all column values filled in, but in the middle there is one completely empty row, and the data starts again. I wanted to get the index # of the row, so I can remove/delete all the subsequent rows that come after the empty row.
This is the way I identified the empty row, df[df["ColumnA"] ==None], but no luck in getting the row index number for that row. Please help!
What you most likely want is pd.DataFrame.dropna:
Return object with labels on given axis omitted where alternately any
or all of the data are missing
If the row is empty, you can simply do this:
df = df.dropna(how='all')
If you want to find indices of null rows, you can use pd.DataFrame.isnull:
res = df[df.isnull().all(axis=1)].index
To remove rows with indices greater than the first empty row:
df = df[df.index < res[0]]

Returning unique values in .csv and unique strings in python+pandas

my question is very similar to here: Find unique values in a Pandas dataframe, irrespective of row or column location
I am very new to coding, so I apologize for the cringing in advance.
I have a .csv file which I open as a pandas dataframe, and would like to be able to return unique values across the entire dataframe, as well as all unique strings.
I have tried:
for row in df:
pd.unique(df.values.ravel())
This fails to iterate through rows.
The following code prints what I want:
for index, row in df.iterrows():
if isinstance(row, object):
print('%s\n%s' % (index, row))
However, trying to place these values into a previously defined set (myset = set()) fails when I hit a blank column (NoneType error):
for index, row in df.iterrows():
if isinstance(row, object):
myset.update(print('%s\n%s' % (index, row)))
I get closest to what I was when I try the following:
for index, row in df.iterrows():
if isinstance(row, object):
myset.update('%s\n%s' % (index, row))
However, my set prints out a list of characters rather than the strings/floats/values that appear on my screen when I print above.
Someone please help point out where I fail miserably at this task. Thanks!
I think the following should work for almost any dataframe. It will extract each value that is unique in the entire dataframe.
Post a comment if you encounter a problem, i'll try to solve it.
# Replace all nones / nas by spaces - so they won't bother us later
df = df.fillna('')
# Preparing a list
list_sets = []
# Iterates all columns (much faster than rows)
for col in df.columns:
# List containing all the unique values of this column
this_set = list(set(df[col].values))
# Creating a combined list
list_sets = list_sets + this_set
# Doing a set of the combined list
final_set = list(set(list_sets))
# For completion's sake, you can remove the space introduced by the fillna step
final_set.remove('')
Edit :
I think i know what happens. You must have some float columns, and fillna is failing on those, as the code i gave you was replacing missing values with an empty string. Try those :
df = df.fillna(np.nan) or
df = df.fillna(0)
For the first point, you'll need to import numpy first (import numpy as np). It must already be installed as you have pandas.

Categories