I am encountering a weird issue with pandas. After some careful debugging I have found the problem, but I would like a fix, and an explanation as to why this is happening.
I have a dataframe which consists of a list of cities with some distances. I have to iteratively find a city which is closest to some "Seed" city (details are not too important here).
To locate the "closest" city to my seed city, i use:
id_new_point = df["Time from seed"].idxmin(skipna=True)
Then, I want to remove the city I just found from the dataframe, for which I use:
df.drop(index=emte_df.index[id_new_point], inplace=True)
Now comes the strange part; I've connected the pycharm debugger and closely observed my dataframe as I stepped through my loops.
When you delete a row in pandas with df.drop, it seems deletes the entire row by row number. This means for example that if I delete row #126, the "Index column" of the dataframe doesn't adapt accordingly. (See screenshot below)
idxmin seems to return the actual index of the row, so the index associated with the row in the pandas data-frame.
Now here comes the issue: Say I want to remove row 130 (EMTE WIJCHEN). I would use df.idxmin which would give me the index, namely 130. Then i call df.drop[index], but since df.drop uses row numbers to delete instead of dataframe indices and I just deleted a row (#126) it now (wrongly) removes row #131 which slided up into place 130.
Why is this happening, and how can I improve my code so I can drop the right rows in my dataframe? Any suggestions are greatly appreciated (Also if there's a better practice for approaching these kinds of problems, maybe a solution which doesn't adapt the dataframe in place?)
Pandas don't update the index column because it is not always a range of numbers, it can be, for example, names (John, Maria, etc) and thus it wouldn't make sense to "update" the index after dropping a specific row. You can reset_index after each iteration like this:
df.drop(index=emte_df.index[id_new_point], inplace=True)
df = df.reset_index(drop=True)
Related
So I have a dataframe that I am iterating over, and about halfway through the df I want to modify a column name but continue my iteration. I have code like this:
for index, row in df.iterrows():
do something with row
if certain condition is met:
df.rename(columns={'old_name':'new_name'}, inplace=True)
After I do the rename, the column name is changed in the 'df' variable for subsequent iterations, but the value of 'row' still contains the old column name. How can I fix this? I know I have encountered similar situations in pandas before. Maybe the iterator doesn't get updated even the dataframe itself is modified?
Changing the source of something you're iterating over is not a good practice.
You could set a flag if the condition is met, and then after the iteration, make any necessary changes to the dataframe.
Edited to add: I have a large dataset that needs "line by line" parsing, but that instruction was given to me by a non-programmer. Here's what I did: I added a boolean condition to the dataframe, split the dataframe into two separate dataframes based on that condition, stored one for later integration and moved on with the other dataframe. At the end I used pd.concat to put everything back together. But if you change a column name that pd.concat will create extra columns in the end.
The problem
I'm working with a data set, which has been given to me as a csv file with lines of the form id,data. I would like to work with this data in a pandas dataframe, with the id as the index.
Unfortunately, somewhere along the data pipeline, my csv files has a number of rows where the id's are missing. Fortunately, the rows of my data are not fully independent, so I can recreate the missing values: each row is linked to its predecessor, and I have access to an oracle which, when given an id, can give me all its data. This includes the id of its predecessor.
My question is therefore whether there's a simple way of filling in these missing values in my dataframe
My Solution
I don't have much experience working with pandas, but after playing around for a bit I came up with the following approach. I start by reading the csv file into a dataframe without setting the index, so I end up with a RangeIndex. I then
Find the location of the rows with missing ids
Add 1 to the index to get the children of each row
Ask the oracle for the parents of each child
Merge the parents and children on the child id
Subtract one from the index again, and set the parent ids
In code:
children = df.loc[df[df['id'].isna()].index + 1, 'id']
parents = pd.Series({x.id: x.parent_id for x in ask_oracle(children)},
name='parent_id')
combined = pd.merge(children, parents, left_on='id', right_index=True)
combined.set_index(combined.index - 1, inplace=True)
df.loc[combined.index, 'id'] = combined['parent_id']
This works, but I'm 95% sure it's going to look like scary black magic in a few months time.
In particular, I'm unhappy with
The way I get the location of the nan rows. Three lots of df[ in one line is just too many
The manual fiddling about with the indices I have to do to get the rows to match up.
Does anyone have any suggestions for a better way of doing things?
The format of the input data is fixed, as are the properties of the oracle, but if there's a smarter way of organising my dataframe, I'm more than happy to hear it.
As a very very new beginner with Python & Pandas, I am looking for your support regarding an issue.
I need to iterate over columns and find out the maximum value in the concerning rows of a dataframe and write it in a new variable for each row. The number of columns is not manageable, almost 200 columns, therefore I do not want to write each required column id manually. And most importantly that I need to start from a given column id and continue with two columns id increments till a given last columns id.
Will appreciate sample codes, see attachment too.
Try:
df['x']=df.max(axis=1)
Replace x with the name for your desired output column.
I have a dataframe customers with some "bad" rows, the key in this dataframe is CustomerID. I know I should drop these rows. I have a list called badcu that says [23770, 24572, 28773, ...] each value corresponds to a different "bad" customer.
Then I have another dataframe, lets call it sales, so I want to drop all the records for the bad customers, the ones in the badcu list.
If I do the following
sales[sales.CustomerID.isin(badcu)]
I got a dataframe with precisely the records I want to drop, but if I do a
sales.drop(sales.CustomerID.isin(badcu))
It returns a dataframe with the first row dropped (which is a legitimate order), and the rest of the rows intact (it doesn't delete the bad ones), I think I know why this happens, but I still don't know how to drop the incorrect customer id rows.
You need
new_df = sales[~sales.CustomerID.isin(badcu)]
You can also use query
sales.query('CustomerID not in #badcu')
I think the best way is to drop by index,try it and let me know
sales.drop(sales[sales.CustomerId.isin(badcu)].index.tolist())
I have a dataframe customers with some "bad" rows, the key in this dataframe is CustomerID. I know I should drop these rows. I have a list called badcu that says [23770, 24572, 28773, ...] each value corresponds to a different "bad" customer.
Then I have another dataframe, lets call it sales, so I want to drop all the records for the bad customers, the ones in the badcu list.
If I do the following
sales[sales.CustomerID.isin(badcu)]
I got a dataframe with precisely the records I want to drop, but if I do a
sales.drop(sales.CustomerID.isin(badcu))
It returns a dataframe with the first row dropped (which is a legitimate order), and the rest of the rows intact (it doesn't delete the bad ones), I think I know why this happens, but I still don't know how to drop the incorrect customer id rows.
You need
new_df = sales[~sales.CustomerID.isin(badcu)]
You can also use query
sales.query('CustomerID not in #badcu')
I think the best way is to drop by index,try it and let me know
sales.drop(sales[sales.CustomerId.isin(badcu)].index.tolist())