pandas select subset of pivot_table - python

There are a few questions here on this topic, but none seem to be helpful in my case. Here's a dumbed down version of what I want:
This is the csv file of interest: http://pastebin.com/rP7tPDse
I'm creating the pivot table as:
piv = pd.read_csv("test.csv",delimiter = "\s+").pivot_table('z','x','y')
And this returns
y 0.0 1.0 1.3 2.0
x
0.0 1.0 5.0 NaN 4.0
1.0 3.0 4.0 NaN 6.0
1.5 NaN NaN 7.0 NaN
2.0 3.0 5.0 NaN 7.0
I would like to find a slice of this array as a pivot_table, such as:
y 1.3 2.0
x
0.0 NaN 4.0
1.0 NaN 6.0
Based on the x and y values. I want to include the NaN's as well, to do processing on them later. Help much appreciated.
EDIT: updating the question to be more specific.
I'm looking to extract a pivot table that has values denoted by the column 'z' and indexed by 'x' and 'y', with the condition that:
All x values between arbitrary xmin and xmax
All y values between arbitrary ymin and ymax
From piv, as defined above, I want to do something like:
piv.loc[(piv.y <= 2.0) &
(piv.y >= 1.3) &
(piv.x >= 0.0) &
(piv.x <= 1.2)]
And this would yield me the example answer, above.
Also, in the actual dataset, which I did not post here, there are many more columns. 'x', 'y' and 'z' are just some of them.

When I copied dataframe, the columns were strings and rows were floats.
To get the columns as float
df.columns = df.columns.astype(float)
Now you can pd.IndexSlice
df.loc[pd.IndexSlice[0:1], pd.IndexSlice[1.3:2]]

Related

Replace all NaN values with value from other column

I have the following dataframe:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, 5, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
I want to do a ffill() on column B with df["B"].ffill(inplace=True) which results in the following df:
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 4.0 5.0 NaN
3 NaN 3.0 NaN 4.0
Now I want to replace all NaN values with their corresponding value from column B. The documentation states that you can give fillna() a Series, so I tried df.fillna(df["B"], inplace=True). This results in the exact same dataframe as above.
However, if I put in a simple value (e.g. df.fillna(0, inplace=True), then it does work:
A B C D
0 0.0 2.0 0.0 0.0
1 3.0 4.0 0.0 1.0
2 0.0 4.0 5.0 0.0
3 0.0 3.0 0.0 4.0
The funny thing is that the fillna() does seem to work with a Series as value parameter when operated on another Series object. For example, df["A"].fillna(df["B"], inplace=True) results in:
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 4.0 4.0 NaN 5
3 3.0 3.0 NaN 4
My real dataframe has a lot of columns and I would hate to manually fillna() all of them. Am I overlooking something here? Didn't I understand the docs correctly perhaps?
EDIT I have clarified my example in such a way that 'ffill' with axis=1 does not work for me. In reality, my dataframe has many, many columns (hundreds) and I am looking for a way to not have to explicitly mention all the columns.
Try changing the axis to 1 (columns):
df = df.ffill(1).bfill(1)
If you need to specify the columns, you can do something like this:
df[["B","C"]] = df[["B","C"]].ffill(1)
EDIT:
Since you need something more general and df.fillna(df.B, axis = 1) is not implemented yet, you can try with:
df = df.T.fillna(df.B).T
Or, equivalently:
df.T.fillna(df.B, inplace=True)
This works because the indices of df.B coincides with the columns of df.T so pandas will know how to replace it. From the docs:
value: scalar, dict, Series, or DataFrame.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
So, for example, the NaN in column 0 at row A (in df.T) will be replaced for the value with index 0 in df.B.

Pandas - df.compare() how to change self/other labels?

Using df.compare in Pandas, is it possible to change the labels of self/other from the output?
I need to send this output directly to less technically savvy users and would like to change them to more descriptive labels.
My code:
if df_1.equals(df_2):
return None
else:
return df_1.compare(df_2, align_axis=0)
You can rename the index level to something more obvious:
df1 = pd.DataFrame([[1,2,3,4], [1,2,3,4]])
df2 = pd.DataFrame([[1,2,5,4], [5,2,3,1]])
df1.compare(df2, align_axis=0).rename(index={'self': 'left', 'other': 'right'}, level=-1)
0 2 3
0 left NaN 3.0 NaN
right NaN 5.0 NaN
1 left 1.0 NaN 4.0
right 5.0 NaN 1.0

combine rows with identical index

How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.

python- flagging a second set of items in a series

I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]
You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan
you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()

Replace values in column based on other column

Using:
import pandas as pd
import numpy as np
a = pd.read_csv('Bvitoria_argos.csv', na_values=[' -99999.0'])
The dataframe is something like that:
HS Tp
3.0 12.0
2.0 11.3
nan 19.2
nan 5.9
5.6 7.0
The objective is to replace values in ''Tp'' column based on ''HS'' values and get something like that:
HS Tp
3.0 12.0
2.0 11.3
nan nan
nan nan
5.6 7.0
I've tried to use this, but it's not working:
c.loc[c.HS==np.nan,'Tp']=np.nan
To be more specifc, when is nan in ''HS'' column ''Tp'' column need to be nan to. Would be thankful if someone could help.
Use isnull():
df.loc[df['HS'].isnull(),'Tp'] = np.nan
You could use np.where. If cond is a boolean array, and A and B are arrays, then
C = np.where(cond, A, B)
defines C to be equal to A where cond is True, and B where cond is False.
Compare Indexing where condition.

Categories