Let's say I have a dataframe with a multi-index column:
P = pd.DataFrame(
[[100, 101],
[101, 102],
[ 98, 99]],
columns=pd.MultiIndex.from_tuples(
[('price', 'bid'),
('price', 'ask')]
)
)
P
and I want to add a new column which shows me the data from the previous row:
P['price_prev'] = P['price'].shift(1)
This throws the error
ValueError: Cannot set a DataFrame with multiple columns to the single column price_prev
I understand why this happens, and doing
P[[('price_prev', 'bid'), ('price_prev', 'ask')]] = P['price'].shift(1)
gives me what I want without errors:
But is there really no way to do this which avoids repeating the names of the subcolumns? I.e., telling pandas to copy the respective column including all of its subcolumns, renaming the top level to whatever was specified, and then shifting all of the data one row down?
try this:
P.join(P.shift().rename(lambda x: f'{x}_prev', axis=1, level=0))
#ziying35's answer does work, but only if I want to shift my entire dataframe.
Here's a similar and slightly less verbose version that also works for individual columns (in this case price):
P = P.join(P[['price']].shift(), rsuffix='_prev')
The one drawback of this compared to the explicit
P[[('price_prev', 'bid'), ('price_prev', 'ask')]] = P['price'].shift()
is a higher memory usage, so there seems to be a memory leak somewhere when using join. However, this might also just be my Jupyter Notebook acting up.
Related
I have a dataframe that looks like this:
df = pd.DataFrame(data=list(range(0,10)),
index=pd.MultiIndex.from_product([[str(list(range(0,1000)))],list(range(0,10))],
names=["ind1","ind2"]),
columns=["col1"])
df['col2']=str(list(range(0,1000)))
Unfortunately, the display of the above dataframe looks like this:
If I try to set: pd.options.display.max_colwidth = 5, then col2 behaves and it is displayed in a single row, but ind1 doesn't behave:
Since ind1 is part of a multiindex, I don't care it occupies multiple rows, but I would like to limit itself in width. If I could prescribe for each row to also occupy at most the height of a single line, that would be great as well. I don't care that individual cells are being truncated on display, because I prefer to have to scroll less, in any direction, to see a cell.
I am aware I can create my own HTML display. That's great and all, but I think it's too complex for my use case of just wanting smaller width columns for data analysis in jupyter notebooks. Nevertheless, such a solution might help other similar use cases, if you are inclined to write one.
What I'm looking for is some setting, which I thought it's pd.options.display.max_colwidth, that limits the column width, even if it's an index. Something that would disable wrapping for long texts would probably help with the same issue as well.
I also tried to just print without the index df.style.hide_index(), in combination with pd.options.display.max_colwidth = 5, but then col2 stops behaving:
About now I run out of ideas. Any suggestions?
Here is one way to do it:
import pandas as pd
df = pd.DataFrame(
data=list(range(0, 10)),
index=pd.MultiIndex.from_product(
[[str(list(range(0, 1000)))], list(range(0, 10))], names=["ind1", "ind2"]
),
columns=["col1"],
)
df["col2"] = str(list(range(0, 1000)))
In the next Jupyter cell, run:
df.style.set_properties(**{"width": "10"}).set_table_styles(
[{"selector": "th", "props": [("vertical-align", "top")]}]
)
Which outputs:
Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!
I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)
This code works, but generates SettingWithCopyWarning. Since the warning can be useful, I'd rather not turn it off globally. In other cases I've found ways to achieve the same result, without triggering the warning, but I can't think of an alternative here:
# df does not yet have a ColZ
# df is indexed by Date and Name (which is not use here)
df["ColZ"] = 0.0
Z = df.xs(myDate, level="Date", drop_level=False)["ColZ"]
Z[:36] = 99.0
df.loc[(myDate,), ("ColZ",)] = Z
I can't take the cross section (xs) and then assign to a new column, because the cross section will give me a copy. And I can't take a cross section for just the cells I want to set to 99, because I need to slice by index AND by row, so I need a blend of iloc and loc. One possibility would be to reset the index to drop the Name level and then put it back again afterward, but that seems yucky.
Any suggestions, or do I just live with the warning?
you can disable the warning by following code:
with pd.option_context('mode.chained_assignment',None):
you code here
in a Pandas (v0.8.0) DataFrame I want to overwrite one slice of columns with another.
The below code throws the listed error.
What would be an efficient alternative method for achieving this?
df = DataFrame({'a' : range(0,7),
'b' : np.random.randn(7),
'c' : np.random.randn(7),
'd' : np.random.randn(7),
'e' : np.random.randn(7),
'f' : np.random.randn(7),
'g' : np.random.randn(7)})
# overwrite cols
df.ix[:,'b':'d'] = df.ix[:, 'e':'g']
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 68, in __setitem__
self._setitem_with_indexer(indexer, value)
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 98, in _setitem_with_indexer
raise ValueError('Setting mixed-type DataFrames with '
ValueError: Setting mixed-type DataFrames with array/DataFrame pieces not yet supported
Edited
And as a permutation, how could I also specify a subset of the rows to set
df.ix[df['a'] < 3, 'b':'d'] = df.ix[df['a'] < 3, 'e':'g']
The issue is that using .ix[] returns a view to the actual memory objects for that subset of the DataFrame, rather than a new DataFrame made out of its contents.
Instead use
# The left-hand-side does not use .ix, since we're assigning into it.
df[['b','c']] = df.ix[:,'e':'f'].copy()
Note that you will need .copy() if you are intent on using .ix to do the slicing, otherwise it would set columns 'b' and 'c' as the same objects in memory as the columns 'e' and 'f', which does not seem like what you want to do here.
Alternatively, to avoid worrying about the copying you, you can just do:
df[['b','c']] = df[['e','f']]
If the convenience of indexing matters to you, one way to simulate this effect is to write your own function:
def col_range(df, col1, col2):
return list(dfrm.ix[dfrm.index.values[0],col1:col2].index)
Now you could do the following:
df[col_range(df,'b','d')] = df.ix[:,'e':'g'].copy()
Note: in the definition of col_range I used the first index which will select the first row of the data frame. I did this because making a view of the whole data frame just to select a range of columns seems wasteful, whereas one row probably won't matter. Since slicing this way produces a Series, the way to extract the columns is to actually grab the index, and I return them as a list.
Added for additional row slice request:
To specify a set of rows in the assignment, you can use .ix, but you need to specify just a matrix of values on the right-hand side. Having the structure of a sub-DataFrame on the right-hand side will cause problems.
df.ix[0:4,col_range(df,'b','d')] = df.ix[0:4,'e':'g'].values
You can replace the [0:4] with [df.index.values[i]:df.index.values[j]] or [df.index.values[i] for i in range(N)] or even with logical values such as [df['a']>5] to only get rows where the 'a' column exceeds 5, for example.
The full slice for an example of logical indexing where you want column 'a' bigger than 5 and column 'e' less than 10 might look like this:
import numpy as np
my_rows = np.logical_and(df['a'] > 5), df['e'] < 10)
df.ix[my_rows,col_range(df,'b','d')] = df.ix[my_rows,'e':'g'].values
In many cases, you will not need to use the .ix on the left-hand side (I recommend against it because it only works in some cases and not in others). For instance, something like:
df["A"] = np.repeat(False, len(df))
df["A"][df["B"] > 0] = True
will work as is, no special .ix needed for identifying the rows where the condition is true. The .ix seems to be needed on the left when the thing on the right is complicated.