How to modify a Pandas dataframe while iterating over it - python

So I have a dataframe that I am iterating over, and about halfway through the df I want to modify a column name but continue my iteration. I have code like this:
for index, row in df.iterrows():
do something with row
if certain condition is met:
df.rename(columns={'old_name':'new_name'}, inplace=True)
After I do the rename, the column name is changed in the 'df' variable for subsequent iterations, but the value of 'row' still contains the old column name. How can I fix this? I know I have encountered similar situations in pandas before. Maybe the iterator doesn't get updated even the dataframe itself is modified?

Changing the source of something you're iterating over is not a good practice.
You could set a flag if the condition is met, and then after the iteration, make any necessary changes to the dataframe.
Edited to add: I have a large dataset that needs "line by line" parsing, but that instruction was given to me by a non-programmer. Here's what I did: I added a boolean condition to the dataframe, split the dataframe into two separate dataframes based on that condition, stored one for later integration and moved on with the other dataframe. At the end I used pd.concat to put everything back together. But if you change a column name that pd.concat will create extra columns in the end.

Related

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Ask Pandas to delete all rows beneath a certain row

I have imported an Excel file as a dataframe using pandas.
I now need to delete all rows from row 41,504 (index 41,505) and below.
I have tried df.drop(df.index[41504]), although that only catches the one row. How do I tell Pandas to delete onwards from that row?
I did not want to delete by an index range as the dataset has tens of thousands of rows, and I would prefer not to scroll through the whole thing.
Thank you for your help.
Kind regards
df.drop(df.index[41504:])
Drop the remaining range. If you don't mind creating a new df, then use a filter, keeping rows [:41594].
You can reassign the range you do want back into the variable instead of removing the range you do not want.
You can just get the first rows you that you need, ignoring all the rest:
result=df[:41504]
df = df.iloc[:41504]
just another way

Weird inconsistency between df.drop() and df.idxmin()

I am encountering a weird issue with pandas. After some careful debugging I have found the problem, but I would like a fix, and an explanation as to why this is happening.
I have a dataframe which consists of a list of cities with some distances. I have to iteratively find a city which is closest to some "Seed" city (details are not too important here).
To locate the "closest" city to my seed city, i use:
id_new_point = df["Time from seed"].idxmin(skipna=True)
Then, I want to remove the city I just found from the dataframe, for which I use:
df.drop(index=emte_df.index[id_new_point], inplace=True)
Now comes the strange part; I've connected the pycharm debugger and closely observed my dataframe as I stepped through my loops.
When you delete a row in pandas with df.drop, it seems deletes the entire row by row number. This means for example that if I delete row #126, the "Index column" of the dataframe doesn't adapt accordingly. (See screenshot below)
idxmin seems to return the actual index of the row, so the index associated with the row in the pandas data-frame.
Now here comes the issue: Say I want to remove row 130 (EMTE WIJCHEN). I would use df.idxmin which would give me the index, namely 130. Then i call df.drop[index], but since df.drop uses row numbers to delete instead of dataframe indices and I just deleted a row (#126) it now (wrongly) removes row #131 which slided up into place 130.
Why is this happening, and how can I improve my code so I can drop the right rows in my dataframe? Any suggestions are greatly appreciated (Also if there's a better practice for approaching these kinds of problems, maybe a solution which doesn't adapt the dataframe in place?)
Pandas don't update the index column because it is not always a range of numbers, it can be, for example, names (John, Maria, etc) and thus it wouldn't make sense to "update" the index after dropping a specific row. You can reset_index after each iteration like this:
df.drop(index=emte_df.index[id_new_point], inplace=True)
df = df.reset_index(drop=True)

Pandas: Dealing with missing column in input dataframe

I have a python code which performs mathematical calculations on multiple columns of the dataframe. This input comes from various sources so there is a possibility that sometimes one column is missing from the same.
This column is missing because its insignificant but i need to have a null column atleast for the code to run without errors.
I can add a null column using if loop but there are around 120 columns and i do not want to slow down the code. Is there any other way where the code can check each column is present in the original dataframe and then if any column is not present it adds a null column and then starts with execution of the actual code?
If you know that the column name is the same for every dataframe you could do something like this without having to loop over the column names
if col_name not in df.columns:
df[col_name] = '' # or whatever value you want to set it to
If speed is a super concern, which I can't tell, you could always convert the the columns to a set with set(df.columns) and reduce the search to O(1) time because it will be a hashed search. You can read more in detail on the efficiency of the in operator at this link How efficient is Python's 'in' or 'not in' operators?

How to add new values to dataframe's columns based on specific row without overwrite existing data

I have a batch of identifier and a pair of values that behave in following manner within an iteration.
For example,
print(indexIDs[i], (coordinate_x, coordinate_y))
Sample output looks like
I would like to add these data into dataframe, where I can use indexIDs[i] as row and append incoming pair of values with same identifier in the next consecutive columns
I have attempted to perform following code, which didn't work.
spatio_location = pd.DataFrame()
spatio_location.loc[indexIDs[i], column_counter] = (coordinate_x, coordinate_y)
It was an ideal initial to associate indexIDs[i] as row, however I could not progress to take incoming data without overwriting previous dataframe. I am aware it has something to do with the second line which uses "=" sign.
I am aware my second line is keep overwriting previous result over and over again. I am looking for an appropriate way change my second line to insert new incoming data to existing dataframe without overwriting from time to time.
Appreciate your time and effort, thanks.
I'm a bit confuesed from the nature of coordinate_x (is it a list or what?) anyway maybe try to use append
you could define an empty df with three columns
df=pd.DataFrame([],columns=['a','b','c'])
after populate it with a loop on your lists
for i in range TOFILL:
df=df.append({'a':indexIDs[i],'b':coordinate_x[i],'c':coordinate_y[i]},ignore_index=True)
finally set a columns as index
df=df.set_index('a')
hope it helps

Categories