Python Pandas Series - order output via Slice - python

Trying to produce an 'ordered' output from a Pandas Series using the Slice notation. The example below seems to be in err.
x = pd.Series([2,4,6], index=['a', 'b', 'c'])
print(x[2,1]) #Fail

To expand a little bit on the comments, pandas uses the [...] to figure out the row, column coordinates it's looking up. So if you wanted to pick out 2 items: the item at row 2 and the item at row 1, you need to tell pandas that it's looking for a list of rows. Hence:
# x[rows, columns]
x[[2,1]]
will return what you want.

Related

Looping through rows based on column values

Trying to loop through a report and eliminate/hide/replace cell values if they are repeated in the row above. This is conditional to certain columns in the row but not the entire row as each row will contain at least 1 piece of data that is unique to the row. I know I am close but I'm missing my mark and looking for a nudge in the right direction. Trying to eliminate redundant information to increase legibility of the final report. Essentially what I am trying to do is:
for cell in row:
if column["column_name"] == (line above):
cell.value = " "
Because each row has a unique piece of data drop duplicates does not work.
Once I can clear the intended column in each row where applicable I will expand the process to loop through and apply to other columns where the initial is blanked out. I should be able to work that out once the first domino falls. Any advice is appreciated.
I've tried
np.where(cell) = [iloc-1]
and
masking based on the same parameter.
I get errors that 'row' and 'iloc' are undefined or None of [Index (all content)] are in the [index].
You can use shift() to compare the row elements. If I understand your issue then the example code below indicates an approach you can use (it replaces duplicated numbers by 0):
import pandas as pd
df = pd.DataFrame({ 'A': [1, 2, 2, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e']
})
df['A'] = df['A'].where(df['A'] != df.shift(-1)['A'], 0)
print(df)

Remove index from MultiIndex dataframe if child index has column value meeting criteria

I had originally asked this question here, and I believe it was incorrectly marked as a duplicate. I will do my best here to clarify my question and how I believe it is unique.
Given the following example MultiIndex dataframe:
import pandas as pd
import numpy as np
first = ['A', 'B', 'C']
second = ['a', 'b', 'c', 'd']
third = ['1', '2', '3']
indices = [first, second, third]
index = pd.MultiIndex.from_product(indices, names=['first', 'second', 'third'])
df = pd.DataFrame(np.random.randint(10, size=(len(first)*len(second)*len(third), 4)), index=index, columns=['Val1','Val2',' Val3', 'Val4'])
Goal: I would like to retain a specific level=1 index (such as 'a') if the value of column 'Val2' corresponding to index value 1 in level=2 is greater than 5 for that level=1 index. Therefore, if this criteria is not met (i.e. column 'Val2' is less than or equal to 5 for index 1 in level=2), then the corresponding level=1 index would be removed from the dataframe. If all level=1 indices do not meet the criteria for a given level=0 index, then that level=0 index would also be removed. My previous post contains my expected output (I can add it here, but I wanted this post to be as succinct as possible for clarity).
Here is my current solution, the performance of which I'm sure can be improved:
grouped = df.groupby(level=0)
output = pd.concat([grouped.get_group(key).groupby(level=1).filter(lambda x: (x.loc[pd.IndexSlice[:, :, '1'], 'Val2']>5).any()) for key, group in grouped])
This does produce my desired output, but for a dataframe with 100,000's of rows, the performance is rather poor. Is there something obvious I am missing here to better utilize the under-the-hood optimization of pandas?
I got the same result as your example solution by doing the following:
df.loc[df.xs('1', level=2)['Val2'] > 5]
Comparing time performance this is ~15X faster (in my machine your example takes 36ms while this take 2ms).

Index objects in pandas--why pd.columns returns index rather than list

Coming from R background, I find the (very high) use of Index objects in pandas a little disconcerting. For example, if train is a pandas DataFrame, is there some special reason why train.columns should return an Index rather than a list? What purpose would additionally be served if it is an Index object? As per the definition of pandas.Index, it is the basic object storing axis labels for all pandas objects. While train.index.values does return the row labels (axis=0), how can I get column labels or columns names from pandas.index? In this question unlike in an earlier question, I have a specific example in mind.
A pd.Index is an array-like container of the column names, so in some sense it doesn't make sense to ask how to get the labels from the index, because the index is the labels.
That said, you can always get the underlying numpy array with df.columns.values, or convert to a python list with tolist() as #Mitch showed.
In terms of why an index is used over a bare array - an Index provides extra functionality/performance used throughout pandas - the core of which is hash table based indexing.
By example, consider the following frame / columns.
df = pd.DataFrame(np.random.randn(10, 10),
columns=list('abcdefghkm'))
cols = df.columns
cols
Out[16]: Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'm'], dtype='object')
Now say you want to select column 'h' out of the frame. With a list or array version of the columns, you would have loop over the columns to find the position of 'h', which is O(n) in the number of columns - something like this:
for i, col in enumerate(cols):
if col == 'h':
found_loc = i
break
found_loc
Out[18]: 7
df.values[:, found_loc]
Out[19]:
array([-0.62916208, 2.04403495, 0.29498066, 1.07939374, -1.49619915,
-0.54592646, -1.04382192, -0.45934113, -1.02935858, 1.62439231])
df['h']
Out[20]:
0 -0.629162
1 2.044035
2 0.294981
3 1.079394
4 -1.496199
5 -0.545926
6 -1.043822
7 -0.459341
8 -1.029359
9 1.624392
Name: h, dtype: float64
With the Index, pandas constructs a hash table of the column values, so finding the location of 'h' is an amortized O(1) operation, generally significantly faster, especially if the number of columns is significant.
df.columns.get_loc('h')
Out[21]: 7
This example was only selecting a single column, but as #ayhan notes in the comment, this same hash table structure also speeds up many other operations like merging, alignment, filtering, and grouping.
From the documentation for pandas.Index
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects
Having a regular list as an index for a DataFrame could cause issues with unorderable or unhashable objects, evidently - since it is backed by a hash table, the same principles apply as to why lists can't be dictionary keys in regular Python.
At the same time, the Index object being explicit permits us to use different types as an Index, as compared to the implicit integer index that NumPy has for instance, and perform fast lookups.
If you want to retrieve a list of column names, the Index object has a tolist method.
>>> df.columns.tolist()
['a', 'b', 'c']

python pandas dataframe - can't figure out how to lookup an index given a value from a df

I have 2 dataframes of numerical data. Given a value from one of the columns in the second df, I would like to look up the index for the value in the first df. More specifically, I would like to create a third df, which contains only index labels - using values from the second to look up its coordinates from the first.
listso = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
data = pd.DataFrame(listso,index=list('abcdefghij'), columns=list('AB'))
rollmax = pd.DataFrame(data.rolling(center=False,window=5).max())
So for the third df, I hope to use the values from rollmax and figure out which row they showed up in data. We can call this third df indexlookup.
For example, rollmax.ix['j','A'] = 30, so indexlookup.ix['j','A'] = 'g'.
Thanks!
You can build a Series with the indexing the other way around:
mapA = pd.Series(data.index, index=data.A)
Then mapA[rollmax.ix['j','A']] gives 'g'.

Adding a column to a python pandas data frame based on the value of another column

I have some pandas data frame, and I would like to add a column that is the difference of a column, based on the value of a third column. Here is a toy example:
import pandas as pd
import numpy as np
d = {'one' : pd.Series(range(4), index=['a', 'b', 'c', 'd']),
'two' : pd.Series(range(4), index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df['three'] = [2,2,3,3]
four = []
for i in set(df['three']):
for j in range(len(df) -1):
four.append(df[df['three'] == i]['two'][j + 1] - df[df['three']==i]['two'][j])
four.append(0)
df['four'] = four
The final column should be [1, 1, 1, Nan], since that is the difference between each of the rows in the 'two' column
This makes more sense in the context of my original code -- my data frame is organized by some IDs, and then by time, and when I take the subset of the data frame by IDs, I'm left with the time series evolution of the variables for each individual ID. However, I keep on either receiving a key error, or attempting to edit a copy of the original data frame. What is the right way to go about this?
You could replace df[df['three'] == i] with a groupby on column three. And perhaps replace ['two'][j + 1] - ['two'][j] with df['two'].shift(-1) - df['two'].
I think that would be identical to what you are doing now within the nested loop. It depends a bit on what format you want as a result on how you would implement this. One way would be:
df.groupby('three').apply(lambda grp: pd.Series(grp['two'].shift(-1) - grp['two']))
Which would result in:
two a b
three
2 1 NaN
3 1 NaN
The columns names become a bit meaningless after this operation.
If all you want to do is to get the difference between the rows of column two you use the shift method.
df['four'] = df.two.shift(-1) - df.two

Categories