I have a pandas dataframe and I want to make the last N columns null values. N is dependent on the value in another column.
Here is an example:
df = pd.DataFrame(np.random.randn(4, 5))
df['lookup_key'] = df.index #(actual data does not use index here)
lkup_dict = {0:1,1:2,2:2,3:3}
In this DataFrame, I want to use the value in the 'lookup_key' column to determine which columns to set to null.
Row 0 -> df.ix[0,lkup_dict[0]:4] = np.nan #key = 0, value = 1
Row 1 -> df.ix[1,lkup_dict[1]:4] = np.nan #key = 1, value = 2
Row 2 -> df.ix[2,lkup_dict[2]:4] = np.nan #key = 2, value = 2
Row 3 -> df.ix[3,lkup_dict[3]:4] = np.nan #key = 3, value = 3
The end result looking like this:
0 1 2 3 4 lookup_key
0 -0.882864 NaN NaN NaN NaN 0
1 1.358663 -0.024898 NaN NaN NaN 1
2 0.885058 0.673621 NaN NaN NaN 2
3 -1.487506 0.031021 -1.313646 NaN NaN 3
In this example I have to manually type out the df.ix... for each row. I need something that will do this for all rows of my DataFrame
You can do this with a for loop. To demonstrate, I generate a DataFrame with some random values. I then insert a lookup_key column in the front with some random integers. Finally, I generate lkup_dict dictionary with some random values.
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
>>> df.insert(0, 'lookup_key', np.random.randint(0, 5, 10))
>>> print df
lookup_key A B C D
0 0 0.048738 0.773304 -0.912366 -0.832459
1 3 -0.573221 -1.381395 -0.644223 1.888484
2 0 0.198043 -0.751243 0.138277 2.006188
3 2 -1.692605 -1.586282 -0.656690 0.647510
4 3 -0.847591 -0.368447 0.510250 -0.172055
5 1 0.927243 -0.447478 0.796221 0.372763
6 3 0.027285 0.177276 1.087456 -0.420614
7 4 -1.147004 -0.172367 -0.767347 -0.855318
8 1 -0.649695 -0.572409 -0.664149 0.863050
9 4 -0.820982 -0.499889 -0.624889 1.397271
>>> lkup_dict = {i: np.random.randint(0, 5) for i in range(5)}
>>> print lkup_dict
{0: 3, 1: 0, 2: 0, 3: 4, 4: 1}
Now I iterate over the rows in the DataFrame. key gets the value under the lookup_key column for that row. nNulls uses the key to get the number of null values from lkup_dict. startIndex gets the index for the first column with a null value in that row. The final line replaces the relevant values with null values.
>>> for i, row in df.iterrows():
... key = row['lookup_key'].astype(int)
... nNulls = lkup_dict[key]
... startIndex = df.shape[1] - nNulls
... df.loc[i, startIndex:] = np.nan
>>> print df
lookup_key A B C D
0 0 0.048738 NaN NaN NaN
1 3 NaN NaN NaN NaN
2 0 0.198043 NaN NaN NaN
3 2 -1.692605 -1.586282 -0.656690 0.647510
4 3 NaN NaN NaN NaN
5 1 0.927243 -0.447478 0.796221 0.372763
6 3 NaN NaN NaN NaN
7 4 -1.147004 -0.172367 -0.767347 NaN
8 1 -0.649695 -0.572409 -0.664149 0.863050
9 4 -0.820982 -0.499889 -0.624889 NaN
That's it. Hopefully that's what you're looking for.
Related
I have a df that looks like this:
|ID|PREVIOUS |CURRENT|NEXT|
|--| --- | --- |---|
|1||A||
|1||B||
|2||C||
|2||D||
|2||E||
|2||F||
|3||G||
|4||H||
|4||I||
I want it to populate PREVIOUS and NEXT columns like this:
|ID|PREVIOUS |CURRENT|NEXT|
|--| --- | --- |---|
|1|nan|A|B|
|1|A|B|nan|
|2|nan|C|D|
|2|C|D|E|
|2|D|E|F|
|2|E|F|nan|
|3|nan|G|nan|
|4|nan|H|I|
|4|H|I|nan|
So for each unique ID I want to populate PREVIOUS and next columns based on the values of CURRENT column.
Until now I figured out how to do it if the df had only one type of ID (exept the case where there is no PREVIOUS and NEXT i.e ID=3) but I am struggling to generalize it for more ID-s.
for i in range(0,len(df)):
if i==0:
df["PREVIOUS"].iloc[i] = str(np.NaN)
df["NEXT"].iloc[i] = df["CURRENT"].iloc[i+1]
if i == (len(df)-1):
df["NEXT"].iloc[i] = str(np.NaN)
df["PREVIOUS"].iloc[i] = df["CURRENT"].iloc[i-1]
if (i > 0) and (i < (len(df)-1)):
df["PREVIOUS"].iloc[i] = df["CURRENT"].iloc[i-1]
df["NEXT"].iloc[i] = df["CURRENT"].iloc[i+1]
I am guessing it should employe a groupby and size() but until now I couldn't achieve the result I wanted.
This should do what your question asks:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,2,2,2,2,3,4,4], 'CURRENT':list('ABCDEFGHI')})
print(df)
from collections import defaultdict
valById = defaultdict(list)
df.apply(lambda x: valById[x['ID']].append(x['CURRENT']), axis = 1)
df = pd.DataFrame([{'ID':k, 'PREVIOUS': v[i-1] if i else np.nan, 'CURRENT': v[i], 'NEXT': v[i+1] if i+1 < len(v) else np.nan} for k, v in valById.items() for i in range(len(v))])
print(df)
Output:
ID CURRENT
0 1 A
1 1 B
2 2 C
3 2 D
4 2 E
5 2 F
6 3 G
7 4 H
8 4 I
ID PREVIOUS CURRENT NEXT
0 1 NaN A B
1 1 A B NaN
2 2 NaN C D
3 2 C D E
4 2 D E F
5 2 E F NaN
6 3 NaN G NaN
7 4 NaN H I
8 4 H I NaN
I have a large pandas dataframe, consisting of a different number of columns throughout the dataframe.
Here is an example: Current dataframe example
I would like to split the dataframe into multiple dataframes, based on the number of columns it has.
Example output image here:Output image
Thanks.
If you have a dataframe of say 10 columns and you want to put the records with 3 NaN values in another result dataframe as those with 1 NaN, you can do this as follows:
# evaluate the number of NaNs per row
num_counts=df.isna().sum('columns')
# group by this number and add the grouped
# dataframe to a dictionary
results= dict()
num_counts=df.isna().sum('columns')
for key, sub_df in df.groupby(num_counts):
results[key]= sub_df
After executing this code, results contains subsets of df where each subset contains the same number of NaNs (so the same number of non-NaNs).
If you want to write your results to a excel file, you just need to execute the following code:
with pd.ExcelWriter('sorted_output.xlsx') as writer:
for key, sub_df in results.items():
# if you want to avoid the detour of using dicitonaries
# just replace the previous line by
# for key, sub_df in df.groupby(num_counts):
sub_df.to_excel(
writer,
sheet_name=f'missing {key}',
na_rep='',
inf_rep='inf',
float_format=None,
index=True,
index_label=True,
header=True)
Example:
# create an example dataframe
df=pd.DataFrame(dict(a=[1, 2, 3, 4, 5, 6], b=list('abbcac')))
df.loc[[2, 4, 5], 'c']= list('xyz')
df.loc[[2, 3, 4], 'd']= list('vxw')
df.loc[[1, 2], 'e']= list('qw')
It looks like this:
Out[58]:
a b c d e
0 1 a NaN NaN NaN
1 2 b NaN NaN q
2 3 b x v w
3 4 c NaN x NaN
4 5 a y w NaN
5 6 c z NaN NaN
If you execute the code above on this dataframe, you get a dictionary with the following content:
0: a b c d e
2 3 b x v w
1: a b c d e
4 5 a y w NaN
2: a b c d e
1 2 b NaN NaN q
3 4 c NaN x NaN
5 6 c z NaN NaN
3: a b c d e
0 1 a NaN NaN NaN
The keys of the dictionary are the number of NaNs in the row and the values are the dataframes which contain only rows with that number of NaNs in them.
If I'm getting you right, what you want to do is to split existing 1 dataframe with n columns into ceil(n/5) dataframes, each with 5 columns, and the last one with the reminder of n/5.
If that's the case this will do the trick:
import pandas as pd
import math
max_cols=5
dt={"a": [1,2,3], "b": [6,5,3], "c": [8,4,2], "d": [8,4,0], "e": [1,9,5], "f": [9,7,9]}
df=pd.DataFrame(data=dt)
dfs=[df[df.columns[max_cols*i:max_cols*i+max_cols]] for i in range(math.ceil(len(df.columns)/max_cols))]
for el in dfs:
print(el)
And output:
a b c d e
0 1 6 8 8 1
1 2 5 4 4 9
2 3 3 2 0 5
f
0 9
1 7
2 9
[Program finished]
I am trying to extract rows from a Pandas dataframe using a list of row names, but it can't be done. Here is an example
# df
alleles chrom pos strand assembly# center protLSID assayLSID
rs#
TP3 A/C 0 3 + NaN NaN NaN NaN
TP7 A/T 0 7 + NaN NaN NaN NaN
TP12 T/A 0 12 + NaN NaN NaN NaN
TP15 C/A 0 15 + NaN NaN NaN NaN
TP18 C/T 0 18 + NaN NaN NaN NaN
test = ['TP3','TP12','TP18']
df.select(test)
This is what I was trying to do with just element of the list and I am getting this error TypeError: 'Index' object is not callable. What am I doing wrong?
You can use df.loc[['TP3','TP12','TP18']]
Here is a small example:
In [26]: df = pd.DataFrame({"a": [1,2,3], "b": [3,4,5], "c": [5,6,7]})
In [27]: df.index = ["x", "y", "z"]
In [28]: df
Out[28]:
a b c
x 1 3 5
y 2 4 6
z 3 5 7
[3 rows x 3 columns]
In [29]: df.loc[["x", "y"]]
Out[29]:
a b c
x 1 3 5
y 2 4 6
[2 rows x 3 columns]
There are at least 3 ways to access the element of of a pandas dataframe.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.uniform(size=(10,10)),columns= list('PQRSTUVWXY'),index= list("ABCDEFGHIJ"))
Using df[['P','Q']] you can only access the columns of the dataframe. You can use the dataframe.loc[] (stands for location) or dataframe.iloc[] (stands for index location) numpy style slicing of the dataframe.
df.loc[:,['P','Q']]
Above will give you columns named by 'P' and 'Q'.
df.loc[['A','B'],:]
Above will return rows with keys 'A' and 'B'.
You can also use number based slicing using iloc method.
df.iloc[:,[1,2]]
This will return columns numbered by 1 and 2.
While,
df.iloc[[1,2],:]
will return rows 1st and 2nd.
You can access any specific element by
df.iloc[1,2]
or,
df.loc['A','Q']
You can select the rows by position:
df.iloc[[0,2,4], :]
I am trying to add a column to a pandas dataframe, like so:
df = pd.DataFrame()
df['one'] = pd.Series({'1':4, '2':6})
print (df)
df['two'] = pd.Series({'0':4, '2':6})
print (df)
This yields:
one two
1 4 NaN
2 6 6
However, I would the result to be,
one two
0 NaN 4
1 4 NaN
2 6 6
How do you do that?
One possibility is to use pd.concat:
ser1 = pd.Series({'1':4, '2':6})
ser2 = pd.Series({'0':4, '2':6})
df = pd.concat((ser1, ser2), axis=1)
to get
0 1
0 NaN 4
1 4 NaN
2 6 6
You can use join, telling pandas exactly how you want to do it:
df = pd.DataFrame()
df['one'] = pd.Series({'1':4, '2':6})
df.join(pd.Series({'0':4, '2':6}, name = 'two'), how = 'outer')
This results in
one two
0 NaN 4
1 4 NaN
2 6 6
I've got a numpy array (actually a pandas Data Frame, but the array will do) whose values I would like to permute. The catch is that there are a number of non-randomly positioned NaN's that I'd need to keep in place. So far I have an iterative solution involving populating a list of indices, making a permuted copy of that list and then assigning values from the original matrix from the original index to the permuted index. Any suggestions on how to do this more quickly? The matrix has millions of values and optimally I'd like to do many permutations but it's prohibitively slow with the iterative solution.
Here's the iterative solution:
import numpy, pandas
df = pandas.DataFrame(numpy.random.randn(3,3), index=list("ABC"), columns=list("abc"))
df.loc[[0,2], "a"] = numpy.nan
indices = []
for row in df.index:
for col in df.columns:
if not numpy.isnan(df.loc[row, col]):
indices.append((row, col))
permutedIndices = numpy.random.permutation(indices)
permuteddf = pandas.DataFrame(index=df.index, columns=df.columns)
for i in range(len(indices)):
permuteddf.loc[permutedIndices[i][0], permutedIndices[i][1]] = df.loc[indices[i][0], indices[i][1]]
With results:
In [19]: df
Out[19]:
a b c
A NaN 0.816350 -1.187731
B -0.58708 -1.054487 -1.570801
C NaN -0.290624 -0.453697
In [20]: permuteddf
Out[20]:
a b c
A NaN -0.290624 0.8163501
B -1.570801 -0.4536974 -1.054487
C NaN -0.5870797 -1.187731
How about:
>>> df = pd.DataFrame(np.random.randn(5,5))
>>> df[df < 0.1] = np.nan
>>> df
0 1 2 3 4
0 NaN 1.721657 0.446694 NaN 0.747747
1 1.178905 0.931979 NaN NaN NaN
2 1.547098 NaN NaN NaN 0.225014
3 NaN NaN NaN 0.886416 0.922250
4 0.453913 0.653732 NaN 1.013655 NaN
[5 rows x 5 columns]
>>> movers = ~np.isnan(df.values)
>>> df.values[movers] = np.random.permutation(df.values[movers])
>>> df
0 1 2 3 4
0 NaN 1.013655 1.547098 NaN 1.721657
1 0.886416 0.446694 NaN NaN NaN
2 1.178905 NaN NaN NaN 0.453913
3 NaN NaN NaN 0.747747 0.653732
4 0.922250 0.225014 NaN 0.931979 NaN
[5 rows x 5 columns]