pandas get position of a given index in DataFrame - python

Let's say I have a DataFrame like this:
df
A B
5 0 1
18 2 3
125 4 5
where 5, 18, 125 are the index
I'd like to get the line before (or after) a certain index. For instance, I have index 18 (eg. by doing df[df.A==2].index), and I want to get the line before, and I don't know that this line has 5 as an index.
2 sub-questions:
How can I get the position of index 18? Something like df.loc[18].get_position() which would return 1 so I could reach the line before with df.iloc[df.loc[18].get_position()-1]
Is there another solution, a bit like options -C, -A or -B with grep ?

For your first question:
base = df.index.get_indexer_for((df[df.A == 2].index))
or alternatively
base = df.index.get_loc(18)
To get the surrounding ones:
mask = pd.Index(base).union(pd.Index(base - 1)).union(pd.Index(base + 1))
I used Indexes and unions to remove duplicates. You may want to keep them, in which case you can use np.concatenate
Be careful with matches on the very first or last rows :)

If you need to convert more than 1 index, you can use np.where.
Example:
# df
A B
5 0 1
18 2 3
125 4 5
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [0,2,4], "B": [1,3,5]}, index=[5,18,125])
np.where(df.index.isin([18,125]))
Output:
(array([1, 2]),)

Related

Idiomatic way to create pandas dataframe as concatenation of function of another's rows

Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]

Display unique values & count of a data-frame side by side in Python

I know how to display the number of unique values in a column & the count of the number of columns, but I want to know if there is a way to display this information side by side?
That is, I want to know if there is a way to also display the number of columns (1338) next to the values 47, 2, 548, 6, ... respectively.
I.e. how do you output this number next to each of the nunique values.
It may seem unnecessary/redundant, but I would like to know if this is possible.
Current code & output:
Desired output (or something similar):
How about:
dataframe.groupby('name of key').count()
This should do the trick:
import pandas as pd
df=pd.DataFrame({"a": [3,5,4,3,6,5,4,3,7,1], "b": list("aaabccaabb"), "c": list("pqqqpppqqq")})
df.agg(["nunique", "count"]).T
Outputs:
nunique count
a 6 10
b 3 10
c 2 10
Edit
To add name to an index:
df2=df.agg(["nunique", "count"]).T.reset_index().rename(columns={"index": "column"})
Outputs:
column nunique count
0 a 6 10
1 b 3 10
2 c 2 10

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

pandas DataFrame sum method works counterintuitively

my_df = DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
my_df.sum(axis="rows")
O/P is
a 22
b 26
c 30
// I expect it to sum by rows thereby giving
0 6
1 15
2 24
3 33
my_df.sum(axis="columns") //helps achieve this
Why does it work counterintutively?
In a similar context, drop method works as it should i.e when i write
my_df.drop(['a'],axis="columns")
// This drops column "a".
Am I missing something? Please enlighten.
Short version
It is a naming convention. The sum of the columns gives a row-wise sum. You are looking for axis='columns').
Long version
Ok that was interesting. In pandas normally 0 is for columns and 1 is for rows.
However looking in the docs we find that the allowed params are:
axis : {index (0), columns (1)}
You are passing a param that does not exist which results in the default. This can thus be read as: The sum of the columns returns the row sum. The sum of the index returns the column sum. What you want to use it axis=1 or axis='columns' which results in your desired output:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
print(df.sum(axis=1))
Returns:
0 6
1 15
2 24
3 33
dtype: int64

Is there a way to do a Series.map in place, but keep original value if no match?

The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.

Categories