I am looking for a function that achieves the following. It is best shown in an example. Consider:
pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=['x', 'y1', 'y2'])
which looks like:
x y1 y2
0 1 2 3
1 4 5 NaN
I would like to collapase the y1 and y2 columns, lengthening the DataFame where necessary, so that the output is:
x y
0 1 2
1 1 3
2 4 5
That is, one row for each combination between either x and y1, or x and y2. I am looking for a function that does this relatively efficiently, as I have multiple ys and many rows.
You can use stack to get things done i.e
pd.DataFrame(df.set_index('x').stack().reset_index(level=0).values,columns=['x','y'])
x y
0 1.0 2.0
1 1.0 3.0
2 4.0 5.0
Repeat all the items in first column based on counts of not null values in each row. Then simply create your final dataframe using the rest of not null values in other columns. You can use DataFrame.count() method to count not null values and numpy.repeat() to repeat an array based on a respective count array.
>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
Demo:
>>> df
x y1 y2 y3 y4
0 1 2.0 3.0 NaN 6.0
1 4 5.0 NaN 9.0 3.0
2 10 NaN NaN NaN NaN
3 9 NaN NaN 6.0 NaN
4 7 6.0 NaN NaN NaN
>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
x y
0 1 2.0
1 1 3.0
2 1 6.0
3 4 5.0
4 4 9.0
5 4 3.0
6 9 6.0
7 7 6.0
Here's one based on NumPy, as you were looking for performance -
def gather_columns(df):
col_mask = [i.startswith('y') for i in df.columns]
ally_vals = df.iloc[:,col_mask].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
Sample run -
In [78]: df #(added more cols for variety)
Out[78]:
x y1 y2 y5 y7
0 1 2 3.0 NaN NaN
1 4 5 NaN 6.0 7.0
In [79]: gather_columns(df)
Out[79]:
x y
0 1 2.0
1 1 3.0
2 4 5.0
3 4 6.0
4 4 7.0
If the y columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so -
def gather_columns_v2(df):
ally_vals = df.iloc[:,1:].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
Related
I have a slightly specific problem.
A pandas DataFrame with let's say 6 columns. Each column has a unique set of values, which need to be looked for and removed / updated with a new value.
The lookup list would look like this:
lookup_values_per_column = [[-99999], [9999], [99, 98],[9],[99],[996, 997, 998, 999]]
Now, what I want to do is: Look at column 1 of the dataframe and check if -99999 is present, if yes, remove / update each instance with a fixed value (lets say NA's)
Then we move to the next column and then check for all 9999 and also update them with NA's.
If we don't find a match, we just leave the column as it is.
I couldn't find a solution and I guess it's not so hard anyways.
We can use DataFrame.replace with a dictionary built from the list and columns:
df = df.replace(
to_replace=dict(zip(df.columns, lookup_values_per_column)),
value=np.NAN
)
Sample output:
A B C D E F
0 4.0 1.0 NaN 2.0 3.0 NaN
1 NaN 3.0 2.0 NaN 4.0 NaN
2 1.0 4.0 NaN 1.0 1.0 NaN
3 2.0 2.0 3.0 3.0 2.0 1.0
4 3.0 NaN 1.0 4.0 NaN NaN
Setup Used:
from random import sample, seed
from string import ascii_uppercase
import numpy as np
import pandas as pd
lookup_values_per_column = [
[-99999], [9999], [99, 98], [9], [99], [996, 997, 998, 999]
]
df_len = max(map(len, lookup_values_per_column)) + 1
seed(10)
df = pd.DataFrame({
k: sample(v + list(range(1, df_len + 1 - len(v))), df_len)
for k, v in
zip(ascii_uppercase, lookup_values_per_column)
})
df:
A B C D E F
0 4 1 99 2 3 997
1 -99999 3 2 9 4 998
2 1 4 98 1 1 999
3 2 2 3 3 2 1
4 3 9999 1 4 99 996
I have a data frame like
df = pd.DataFrame({"A":[1,np.nan,5],"B":[np.nan,10,np.nan], "C":[2,3,np.nan]})
A B C
0 1 NaN 5
1 NaN 10 NaN
2 2 3 NaN
I want to left shift all the values to occupy the nulls. Desired output:
A B C
0 1 5 NaN
1 10 NaN NaN
2 2 3 NaN
I tried doing this using a series of df['A'].fillna(df['B'].fillna(df['C']) but in my actual data there are more than 100 columns. Is there a better way to do this?
Let us do
out = df.T.apply(lambda x : sorted(x,key=pd.isnull)).T
Out[41]:
A B C
0 1.0 5.0 NaN
1 10.0 NaN NaN
2 2.0 3.0 NaN
I also figured out another way to do this without the sort:
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
out = df.T.apply(lambda arr: shift_null(arr)).T
This was faster for big dataframes.
I have a pd.dataframe that looks like this:
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 NaN 12 NaN NaN NaN
value_01 NaN 7 NaN NaN NaN
value_02 7 4 y NaN NaN
value_02 NaN 5 NaN NaN NaN
value_02 NaN 6 NaN NaN NaN
value_03 19 15 z NaN NaN
So now based on the key_value,
For column 'a' & 'c', I want to copy over the last cell's value from the same column 'a' & 'c' based off of the key_value.
For another column 'd', I want to copy over the row 'i - 1' cell value from column 'b' to column 'd' i'th cell.
Lastly, for column 'e' I want to copy over the sum of 'i - 1' cell's from column 'b' to column 'e' i'th cell .
For every key_value the columns 'a', 'b' & 'c' have some value in their first row, based on which the next values are being copied over or for different columns the values are being generated for.
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 1 12 x 10 10
value_01 1 7 x 12 22
value_02 7 4 y NaN NaN
value_02 7 5 y 4 4
value_02 7 6 y 5 9
value_03 8 15 z NaN NaN
My current approach:
size = df.key_value.size
for i in range(size):
if pd.isna(df.a[i]) and df.key_value[i] == output.key_value[i - 1]:
df.a[i] = df.a[i - 1]
df.c[i] = df.c[i - 1]
df.d[i] = df.b[i - 1]
df.e[i] = df.e[i] + df.b[i - 1]
For columns like 'a' and 'b' the NaN values are all in the same row indexes.
My approach works but takes very long since my datframe has over 50000 records, I was wondering if there is a different way to do this, since I have multiple columns like 'a' & 'b' where values need to be copied over based on 'key_value' and some columns where the values are being computed using say a column like 'b'
pd.concat with groupby and assign
pd.concat([
g.ffill().assign(d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
for _, g in df.groupby('key_value')
])
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y NaN NaN
4 value_02 7.0 5 y 4.0 4.0
5 value_02 7.0 6 y 5.0 9.0
6 value_03 19.0 7 z NaN NaN
groupby and apply
def h(g):
return g.ffill().assign(
d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
df.groupby('key_value', as_index=False, group_keys=False).apply(h)
You can use groupby + ffill for the groupwise filling. The other operations require shift and cumsum.
In general, note that many common operations have been implemented efficiently in Pandas.
g = df.groupby('key_value')
df['a'] = g['a'].ffill()
df['c'] = g['c'].ffill()
df['d'] = df['b'].shift()
df['e'] = df['d'].cumsum()
print(df)
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y 3.0 6.0
4 value_02 7.0 5 y 4.0 10.0
5 value_02 7.0 6 y 5.0 15.0
6 value_03 19.0 7 z 6.0 21.0
I'm working with a huge dataframe in python and sometimes I need to add an empty row or several rows in a definite position to dataframe. For this question I create a small dataframe df in order to show, what I want to achieve.
> df = pd.DataFrame(np.random.randint(10, size = (3,3)), columns =
> ['A','B','C'])
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 8 1 9
Let's say I need to add an empty row, if I have a zero-value in the column 'C'. Here the empty row should be added after the second row. So at the end I want to have a new dataframe like:
>new_df
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 nan nan nan
> 3 8 1 9
I tried with concat and append, but I didn't get what I want to. Could you help me please?
You can try in this way:
l = df[df['C']==0].index.tolist()
for c, i in enumerate(l):
dfs = np.split(df, [i+1+c])
df = pd.concat([dfs[0], pd.DataFrame([[np.NaN, np.NaN, np.NaN]], columns=df.columns), dfs[1]], ignore_index=True)
print df
Input:
A B C
0 4 3 0
1 4 0 4
2 4 4 2
3 3 2 1
4 3 1 2
5 4 1 4
6 1 0 4
7 0 2 0
8 2 0 3
9 4 1 3
Output:
A B C
0 4.0 3.0 0.0
1 NaN NaN NaN
2 4.0 0.0 4.0
3 4.0 4.0 2.0
4 3.0 2.0 1.0
5 3.0 1.0 2.0
6 4.0 1.0 4.0
7 1.0 0.0 4.0
8 0.0 2.0 0.0
9 NaN NaN NaN
10 2.0 0.0 3.0
11 4.0 1.0 3.0
Last thing: it can happen that the last row has 0 in 'C', so you can add:
if df["C"].iloc[-1] == 0 :
df.loc[len(df)] = [np.NaN, np.NaN, np.NaN]
Try using slice.
First, you need to find the rows where C == 0. So let's create a bool df for this. I'll just name it 'a':
a = (df['C'] == 0)
So, whenever C == 0, a == True.
Now we need to find the index of each row where C == 0, create an empty row and add it to the df:
df2 = df.copy() #make a copy because we want to be safe here
for i in df.loc[a].index:
empty_row = pd.DataFrame([], index=[i]) #creating the empty data
j = i + 1 #just to get things easier to read
df2 = pd.concat([df2.ix[:i], empty_row, df2.ix[j:]]) #slicing the df
df2 = df2.reset_index(drop=True) #reset the index
I must say... I don't know the size of your df and if this is fast enough, but give it a try
In case you know the index where you want to insert a new row, concat can be a solution.
Example dataframe:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
Your new row as a dataframe with index 1:
new_row = pd.DataFrame({'A': np.nan, 'B': np.nan,'C': np.nan}, index=[1])
Inserting your new row after the second row:
new_df = pd.concat([df.loc[:1], new_row, df.loc[2:]]).reset_index(drop=True)
# A B C
# 0 1.0 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN NaN NaN
# 3 3.0 6.0 9.0
something like this should work for you:
for key, row in df.iterrows():
if row['C'] == 0:
df.loc[key+1] = pd.Series([np.nan])
I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN