I am working in Python and have a large dataset in a pandas dataframe. I have taken a section of this data and put it into another dataframe, where I have created a new column and populated it. I now want to put this new column back into the original dataframe, overwriting one of the existing columns, but only for the section I have edited.
Please can you help advise how this is best done? The only unique identifier is the index that is automatically generated. The 2nd dataframe has kept the same index values as the larger one so it should be quite straight forward but I cannot work out how to
a) reference the automatically created indexes
b) use these indexes to overwrite the existing data in the column from another dataframe
So, it should be something like this (I realise this is a mashup of syntax but just trying to better explain what I am trying to do!):
where df1.ROW.INDEX == df2.ROW.INDEX insert into
df1['col_name'].value from df2.['col_name'].value
Any help would be greatly appreciated.
UPDATE:
I now have this code which almost works:
index_values = edited_df.index.values
for i in index_values:
main_df.iloc[i]['pop'] =
edited_df.iloc[i]['new_col']
I get a caveats error, and the main_df is not changed. It looks like it is making copies in each iteration rather than updating the main dataframe.
UPDATE: FIXED
I finally managed to work out the kinks, solution below for anyone that has a similar problem.
index_values = edited_df.index.values
for i in index_values:
main_df.iloc[i, main_df.columns.get_loc('pop')] =
edited_df.iloc[i]['new_col']
Consider using pandas.DataFrame.update for an inplace update from passed in dataframe. Be sure column names match both datasets.
main_df.update(edited_df, join='left', overwrite=True)
I appreciate that you've found a solution that works. However, you're using a for loop when you don't need to. I'll start by improving your loop. Then I'll back up #Partfait's update idea
You use loc to reference by index and column values. You're relying on the coincidence that your index values are sequenced integers.
index_values = edited_df.index.values
for i in index_values:
main_df.loc[i, 'pop'] = edited_df.loc[i, 'new_col']
However, loc can take array like indexers and you're only using scalar indexers. That means you're better off using at
index_values = edited_df.index.values
for i in index_values:
main_df.at[i, 'pop'] = edited_df.at[i, 'new_col']
Or you can go even faster with set_value
index_values = edited_df.index.values
for i in index_values:
main_df.set_value(i, 'pop', edited_df.get_value(i, 'new_col'))
All that said, here is how you could use loc in one go
main_df.loc[:, 'pop'] = edited_df['new_col']
Or as #Partfait suggested
main_df.update(edited_df['new_col'].rename('pop'))
Related
I'm new to working with Pandas and I'm trying to do a very simple thing with it. Using the flights.csv file I'm defining a new column which defines a new column with underperforming if the number of passengers is below average, the value is 1. My problem is that it might be something wrong with the logic since it's not updating the values. Here is an example:
df = pd.read_csv('flights.csv')
passengers_mean = df['passengers'].mean()
df['underperforming'] = 0
for idx, row in df.iterrows():
if (row['passengers'] < passengers_mean):
row['underperforming'] = 1
print(df)
print(passengers_mean)
Any clue?
According to the docs:
You should never modify something you are iterating over. This is not guaranteed to work in all cases.
iterrows docs
What you can do instead is:
df["underperforming"] = (df.passengers < x.passengers.mean()).astype('int')
Quoting the documentation:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
Kindly use vectorized operations like apply()
I am trying to assign values to rows where a condition is verified (True/False).
for i in range(0,3):
new_dataset=df[str(i)][df[str(i)]["Current Amount"] != "3m"]
for i in range(0,3):
df[i]['Value'] = np.where(df[i]['Amount']== True, 100, 50)
where i can span from 0 to 3. Value is the new column that I would like to create; Amount is a column already existing in the original dataframe. In the first part, I create new dataframes filtering rows having current amounts equal to 3 million.
However I got the following error:
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
after removing the cwd from sys.path.
I have tried to follow the steps suggested in this post: How to deal with SettingWithCopyWarning in Pandas?, but it seems that it is still continuing to be not clear to me how to fix the issue.
Could you please help me to fix the issue? I would really appreciate it.
Why is not used solution without [i] and compare by True if column is boolean?
df['Value'] = np.where(df['Amount'], 100, 50)
EDIT: Here is necessary DataFrame.copy:
new_dataset=df[str(i)][df[str(i)]["Current Amount"] != "3m"].copy()
I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr's mutate in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign is better?
For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:
df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])
but the pandas.DataFrame.assign documentation recommends doing this:
df.assign(ln_A = lambda x: np.log(x.A))
# or
newcol = np.log(df['A'])
df.assign(ln_A=newcol)
Both methods return the same dataframe. In fact, the first method (my 'on the fly' assignment) is significantly faster (0.202 seconds for 1000 iterations) than the .assign method (0.353 seconds for 1000 iterations).
So is there a reason I should stop using my old method in favour of df.assign?
The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.
In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.
In your particular case:
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign
>>> new_df = df.assign(A=1)
If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.
The premise on assign is that it returns:
A new DataFrame with the new columns in addition to all the existing columns.
And also you cannot do anything in-place to change the original dataframe.
The callable must not change input DataFrame (though pandas doesn't check it).
On the other hand df['ln_A'] = np.log(df['A']) will do things inplace.
So is there a reason I should stop using my old method in favour of df.assign?
I think you can try df.assign but if you do memory intensive stuff, better to work what you did before or operations with inplace=True.
Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!
I couldn't find an answer to this in the existing SettingWithCopy warning questions, because the common .loc solution doesn't seem to apply. I'm loading a table into pandas then trying to create some mask columns based on values in the other columns. For some reason, this returns a SettingWithCopy warning even when I'm wrapping the test in a pd.Series constructor.
Here's the relevant code. The output at the end seems to be right, but does anyone know what would be causing this?
all_invs = pd.read_table('all_quads.inv.bed', index_col=False,
header=None, names=clustered_names)
invs = all_invs[all_invs['uniqueIDs'].str.contains('p1')]
samples = [line.strip() for line in open('success_samples.list')]
for sample in samples:
invs[sample] = invs['uniqueIDs'].str.contains(sample)
It happens with another boolean test as well.
invs["%s_private_denovo" % proband] = pd.Series(
invs[proband] & ~invs[father] & ~invs[mother] &
invs["%s_private" % proband])
Thanks!
I guess invs causes the warning. To resolve that, copy it explicitly like this:
invs = all_invs[all_invs['uniqueIDs'].str.contains('p1')].copy()
This is a copy of the selected answer from this post.
This warning comes because your dataframe x is a copy of a slice. This is not easy to know why, but it has something to do with how you have come to the current state of it.
You can either create a proper dataframe out of x by doing
x = x.copy()
This will remove the warning, but it is not the proper way!
You should be using the DataFrame.loc method, as the warning suggests, like this:
x.loc[:,'Mass32s'] = pandas.rolling_mean(x.Mass32, 5).shift(-2)