Set Pandas Dataframe Value with another list of Value - python

I have a data frame with column A, df['A']
df is something like
index A
1 nan
2 nan
3 nan
4 nan
5 nan
I have a list of True/False value which is a mask of data frame, where True means the value should be replaced.
mask = [False, True, False, True, True]
I have another list of value which I want to use to replace the df['A'] with index from 2 -
value = [1, 3, 2]
The result I want is -
index A
1 nan
2 1
3 nan
4 3
5 2
I try to use df['A'][mask] = value
But it's not working.
Anyone can help? Thank you!

Use DataFrame.loc for working with slice of DataFrame, not copy:
df.loc[mask, 'A'] = value
print (df)
A
1 NaN
2 1.0
3 NaN
4 3.0
5 2.0

Related

Map only first occurrence of key/value match in dataframe

Is it possible to map only the first occurrence of key in a dataframe?
Ex:
testDict = { A : 1, B: 2}
df
Name Num
A
A
B
B
Expected output
Name Num
A 1
A
B 2
B
Use duplicated to find the first occurrence and then map:
df['Num'] = df.Name[df.Name.duplicated(keep='last')].map(testDict)
print(df)
Output
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
To remove the NaN values, if you wish, do:
df = df.fillna('')
map the drop_duplicates, assuming you have a unique Index for alignment. (Probably best to keep NaN so the column remains numeric)
df['Num'] = df['Name'].drop_duplicates().map(testDict)
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
You can use duplicated and map:
df['Num'] = np.where(~df['Name'].duplicated(), df['Name'].map(testDict), '')
Output:
Name Num
0 A 1
1 A
2 B 2
3 B

Fill few missing values in python

I want to fill missing values of a specific column only if a condition is met.
e.g. A B
Nan 0
Nan 0
0 0
Nan 1
Nan 1
.....................
.....................
In the above case I want to fill Nan values in column A only when corresponding value in column B is 0. Rest values in A (with Nan) should not change.
Use mask with fillna:
df['A'] = df['A'].mask(df['B'] == 0, df['A'].fillna(3))
Alternatives with loc, numpy.where:
df.loc[df['B'] == 0, 'A'] = df['A'].fillna(3)
df['A'] = np.where(df['B'] == 0, df['A'].fillna(3), df['A'])
print (df)
A B
0 3.0 0
1 3.0 0
2 0.0 0
3 NaN 1
4 NaN 1
np.where is quicke and simple solution.
In [47]: df['A'] = np.where(np.isnan(df['A']) & df['B'] == 0, 3, df['A'])
In [48]: df
Out[48]:
A B
0 3.0 0
1 3.0 0
2 3.0 0
3 NaN 1
4 NaN 1
You should use a loop over all elements, something like this:
for i in range(len(A))
if numpy.isnan(A[i]) && B[i] == 0:
A[i] = value
There are nicer ways to implement these loops, but I don't know what structures you are using.

How to select a list of rows by name in Pandas dataframe

I am trying to extract rows from a Pandas dataframe using a list of row names, but it can't be done. Here is an example
# df
alleles chrom pos strand assembly# center protLSID assayLSID
rs#
TP3 A/C 0 3 + NaN NaN NaN NaN
TP7 A/T 0 7 + NaN NaN NaN NaN
TP12 T/A 0 12 + NaN NaN NaN NaN
TP15 C/A 0 15 + NaN NaN NaN NaN
TP18 C/T 0 18 + NaN NaN NaN NaN
test = ['TP3','TP12','TP18']
df.select(test)
This is what I was trying to do with just element of the list and I am getting this error TypeError: 'Index' object is not callable. What am I doing wrong?
You can use df.loc[['TP3','TP12','TP18']]
Here is a small example:
In [26]: df = pd.DataFrame({"a": [1,2,3], "b": [3,4,5], "c": [5,6,7]})
In [27]: df.index = ["x", "y", "z"]
In [28]: df
Out[28]:
a b c
x 1 3 5
y 2 4 6
z 3 5 7
[3 rows x 3 columns]
In [29]: df.loc[["x", "y"]]
Out[29]:
a b c
x 1 3 5
y 2 4 6
[2 rows x 3 columns]
There are at least 3 ways to access the element of of a pandas dataframe.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.uniform(size=(10,10)),columns= list('PQRSTUVWXY'),index= list("ABCDEFGHIJ"))
Using df[['P','Q']] you can only access the columns of the dataframe. You can use the dataframe.loc[] (stands for location) or dataframe.iloc[] (stands for index location) numpy style slicing of the dataframe.
df.loc[:,['P','Q']]
Above will give you columns named by 'P' and 'Q'.
df.loc[['A','B'],:]
Above will return rows with keys 'A' and 'B'.
You can also use number based slicing using iloc method.
df.iloc[:,[1,2]]
This will return columns numbered by 1 and 2.
While,
df.iloc[[1,2],:]
will return rows 1st and 2nd.
You can access any specific element by
df.iloc[1,2]
or,
df.loc['A','Q']
You can select the rows by position:
df.iloc[[0,2,4], :]

Apply Across Dynamic Number of Columns

I have a pandas dataframe and I want to make the last N columns null values. N is dependent on the value in another column.
Here is an example:
df = pd.DataFrame(np.random.randn(4, 5))
df['lookup_key'] = df.index #(actual data does not use index here)
lkup_dict = {0:1,1:2,2:2,3:3}
In this DataFrame, I want to use the value in the 'lookup_key' column to determine which columns to set to null.
Row 0 -> df.ix[0,lkup_dict[0]:4] = np.nan #key = 0, value = 1
Row 1 -> df.ix[1,lkup_dict[1]:4] = np.nan #key = 1, value = 2
Row 2 -> df.ix[2,lkup_dict[2]:4] = np.nan #key = 2, value = 2
Row 3 -> df.ix[3,lkup_dict[3]:4] = np.nan #key = 3, value = 3
The end result looking like this:
0 1 2 3 4 lookup_key
0 -0.882864 NaN NaN NaN NaN 0
1 1.358663 -0.024898 NaN NaN NaN 1
2 0.885058 0.673621 NaN NaN NaN 2
3 -1.487506 0.031021 -1.313646 NaN NaN 3
In this example I have to manually type out the df.ix... for each row. I need something that will do this for all rows of my DataFrame
You can do this with a for loop. To demonstrate, I generate a DataFrame with some random values. I then insert a lookup_key column in the front with some random integers. Finally, I generate lkup_dict dictionary with some random values.
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
>>> df.insert(0, 'lookup_key', np.random.randint(0, 5, 10))
>>> print df
lookup_key A B C D
0 0 0.048738 0.773304 -0.912366 -0.832459
1 3 -0.573221 -1.381395 -0.644223 1.888484
2 0 0.198043 -0.751243 0.138277 2.006188
3 2 -1.692605 -1.586282 -0.656690 0.647510
4 3 -0.847591 -0.368447 0.510250 -0.172055
5 1 0.927243 -0.447478 0.796221 0.372763
6 3 0.027285 0.177276 1.087456 -0.420614
7 4 -1.147004 -0.172367 -0.767347 -0.855318
8 1 -0.649695 -0.572409 -0.664149 0.863050
9 4 -0.820982 -0.499889 -0.624889 1.397271
>>> lkup_dict = {i: np.random.randint(0, 5) for i in range(5)}
>>> print lkup_dict
{0: 3, 1: 0, 2: 0, 3: 4, 4: 1}
Now I iterate over the rows in the DataFrame. key gets the value under the lookup_key column for that row. nNulls uses the key to get the number of null values from lkup_dict. startIndex gets the index for the first column with a null value in that row. The final line replaces the relevant values with null values.
>>> for i, row in df.iterrows():
... key = row['lookup_key'].astype(int)
... nNulls = lkup_dict[key]
... startIndex = df.shape[1] - nNulls
... df.loc[i, startIndex:] = np.nan
>>> print df
lookup_key A B C D
0 0 0.048738 NaN NaN NaN
1 3 NaN NaN NaN NaN
2 0 0.198043 NaN NaN NaN
3 2 -1.692605 -1.586282 -0.656690 0.647510
4 3 NaN NaN NaN NaN
5 1 0.927243 -0.447478 0.796221 0.372763
6 3 NaN NaN NaN NaN
7 4 -1.147004 -0.172367 -0.767347 NaN
8 1 -0.649695 -0.572409 -0.664149 0.863050
9 4 -0.820982 -0.499889 -0.624889 NaN
That's it. Hopefully that's what you're looking for.

Permute Matrix while keeping some items in place

I've got a numpy array (actually a pandas Data Frame, but the array will do) whose values I would like to permute. The catch is that there are a number of non-randomly positioned NaN's that I'd need to keep in place. So far I have an iterative solution involving populating a list of indices, making a permuted copy of that list and then assigning values from the original matrix from the original index to the permuted index. Any suggestions on how to do this more quickly? The matrix has millions of values and optimally I'd like to do many permutations but it's prohibitively slow with the iterative solution.
Here's the iterative solution:
import numpy, pandas
df = pandas.DataFrame(numpy.random.randn(3,3), index=list("ABC"), columns=list("abc"))
df.loc[[0,2], "a"] = numpy.nan
indices = []
for row in df.index:
for col in df.columns:
if not numpy.isnan(df.loc[row, col]):
indices.append((row, col))
permutedIndices = numpy.random.permutation(indices)
permuteddf = pandas.DataFrame(index=df.index, columns=df.columns)
for i in range(len(indices)):
permuteddf.loc[permutedIndices[i][0], permutedIndices[i][1]] = df.loc[indices[i][0], indices[i][1]]
With results:
In [19]: df
Out[19]:
a b c
A NaN 0.816350 -1.187731
B -0.58708 -1.054487 -1.570801
C NaN -0.290624 -0.453697
In [20]: permuteddf
Out[20]:
a b c
A NaN -0.290624 0.8163501
B -1.570801 -0.4536974 -1.054487
C NaN -0.5870797 -1.187731
How about:
>>> df = pd.DataFrame(np.random.randn(5,5))
>>> df[df < 0.1] = np.nan
>>> df
0 1 2 3 4
0 NaN 1.721657 0.446694 NaN 0.747747
1 1.178905 0.931979 NaN NaN NaN
2 1.547098 NaN NaN NaN 0.225014
3 NaN NaN NaN 0.886416 0.922250
4 0.453913 0.653732 NaN 1.013655 NaN
[5 rows x 5 columns]
>>> movers = ~np.isnan(df.values)
>>> df.values[movers] = np.random.permutation(df.values[movers])
>>> df
0 1 2 3 4
0 NaN 1.013655 1.547098 NaN 1.721657
1 0.886416 0.446694 NaN NaN NaN
2 1.178905 NaN NaN NaN 0.453913
3 NaN NaN NaN 0.747747 0.653732
4 0.922250 0.225014 NaN 0.931979 NaN
[5 rows x 5 columns]

Categories