modifying nans position in the dataframe - python

I'm hoping I can explain this well. I have this df with 2 clumns: group and numbers. I'm trying to get that np.nan and pop it into it's new group.
def check_for_nan():
# for example let's say my new value is 14.5
new_nan_value=14.5
data = {"group:" : [-1,0,1,2,3],
'numbers': [[np.nan], [11, 12], [14, 15], [16, 17], [18, 19]],
}
df = pd.DataFrame(data=data)
# *** add some code ***
# I created a new dataframe to visually show how it should look like but we would want to operate only on the same df from above
data_2 = {"group" : [0,1,2,3],
'numbers': [[11, 12], [14,np.nan, 15], [16, 17], [18, 19]],
}
df_2 = pd.DataFrame(data=data_2)
# should return the new group number where the nan would live
return data_2["group"][1]
Output:
current:
group: numbers
0 -1 [nan]
1 0 [11, 12]
2 1 [14, 15]
3 2 [16, 17]
4 3 [18, 19]
Desired output when new_nan_value =14.5
group numbers
0 0 [11, 12]
1 1 [14, nan, 15]
2 2 [16, 17]
3 3 [18, 19]
return 1

With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"group": [-1, 0, 1, 2, 3],
"numbers": [[pd.NA], [11, 12], [14, 15], [16, 17], [18, 19]],
}
)
new_nan_value = 14.5
Here is one way to do it:
def move_nan(df, new_nan_value):
"""Helper function.
Args:
df: input dataframe.
new_nan_value: insertion value.
Returns:
Dataframe with nan value at insertion point, new group.
"""
# Reshape dataframe along row axis
df = df.explode("numbers").dropna().reset_index(drop=True)
# Insert new row
insert_pos = df.loc[df["numbers"] < new_nan_value, "numbers"].index[-1] + 1
df = pd.concat(
[
df.loc[: insert_pos - 1, :],
pd.DataFrame({"group": [pd.NA], "numbers": pd.NA}, index=[insert_pos]),
df.loc[insert_pos:, :],
]
)
df["group"] = df["group"].fillna(method="bfill")
# Find new group value
new_group = df.loc[df["numbers"].isna(), "group"].values[0]
# Groupby and reshape dataframe along column axis
df = df.groupby("group").agg(list).reset_index(drop=False)
return df, new_group
So that:
df, new_group = move_nan(df, 14.5)
print(df)
# Output
group numbers
0 0 [11, 12]
1 1 [14, nan, 15]
2 2 [16, 17]
3 3 [18, 19]
print(new_group) # 1

Related

How to find the overlapping count of rows between two keys of a multindex dataframe?

Two dataframes have been concatenated with different keys (multiindex dataframe) with same index. Dates are the index. There are different products in each dataframe as column names and their prices. I basically had to find the correlation between these two dataframes and overlapping period count. Correlation is done but how to find the count of overlapping rows with each product from each dataframe and produce result as a dataframe with products from dataframe 1 as column name and products from dataframe2 as row names and values as the number of overlapping rows for the same period. It should be a matrix.
For example: Dataframe1:
df1 = pd.DataFrame(data = {'col1' : [1/12/2020, 2/12/2020, 3/12/2020,],
'col2' : [10, 11, 12], 'col3' :[13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1/12/2020, 2/12/2020, 3/12/2020,],
'A' : [10, 9, 12], 'B' :[4, 14, 2]})
df1=df1.set_index('col1')
df2=df2.set_index('col1')
concat_data1 = pd.concat([df1, df2], axis=1, keys=['df1', 'df2'])
concat_data1
df1 df2
col2 col3 A B
col1
1/12/2020 10 13 10 4
2/12/2020 11 14 9 14
3/12/2020 12 10 12 2
Need output result as: Overlapping period=
col2 col3
A 2 0
B 0 1
This is a way of doing it:
import itertools
import pandas as pd
data1 = {
'col1': ['1/12/2020', '2/12/2020', '3/12/2020', '4/12/2020'],
'col2': [10, 11, 12, 14],
'col3': [13, 14, 10, 6],
'col4': [10, 9, 15, 10],
'col5': [10, 9, 15, 5],
}
data2 = {
'col1': ['1/12/2020', '2/12/2020', '3/12/2020', '4/12/2020'],
'A': [10, 9, 12, 14],
'B' :[4, 14, 2, 9],
'C': [6, 9, 1, 3],
'D': [6, 9, 1, 8]
}
df1 = pd.DataFrame(data1).set_index('col1')
df2 = pd.DataFrame(data2).set_index('col1')
concat_data = pd.concat([df1, df2], axis=1, keys=['df1', 'df2'])
columns = {df: list(concat_data[df].columns) for df in set(concat_data.columns.get_level_values(0))}
matrix = pd.DataFrame(data=0, columns=columns['df1'], index=columns['df2'])
for row in concat_data.iterrows():
for cols in list(itertools.product(columns['df1'], columns['df2'])):
matrix.loc[cols[1], cols[0]] += row[1]['df1'][cols[0]] == row[1]['df2'][cols[1]]
print(matrix)

Modyfing 2d array

I'm stuck on trying to modify 2d array... Nothing I try seem to work... I'm trying to write a function that will add a value to its specific location in the numbers column...
import pandas as pd
def twod_array(num):
data = {"group": [-1, 0, 1, 2],
'numbers': [[2], [14, 15], [16, 17], [19, 20, 21]],
}
df = pd.DataFrame(data=data)
print(df)
return 0
Currently it prints this:
group numbers
0 -1 [2]
1 0 [14, 15]
2 1 [16, 17]
3 2 [19, 20, 21]
What I'd like to do is to add a value based on the passed input, so for example if I pass 14.5 as a num, this is the output I'd like to see:
group numbers
0 -1 [2]
1 0 [14,14.5 15]
2 1 [16, 17]
3 2 [19, 20, 21]
Another example:
If I pass 18 as a num:
group numbers
0 -1 [2]
1 0 [14, 15]
2 1 [16, 17, 18]
3 2 [19, 20, 21]
I'm hoping someone can help with this.
df = pd.DataFrame({"group": [-1, 0, 1, 2],
'numbers': [[2], [14, 15], [16, 17], [19, 20, 21]],
})
arr = df['numbers'].to_list()
in_num = 18
for i, sub_arr in enumerate(arr):
for j, num in enumerate(sub_arr):
if arr[i][j]>in_num:
if j!=0: arr[i].insert(j,in_num)
else: arr[i-1].insert(-1 ,in_num)
df['numbers'] = arr

Need a list of row-values in pandas

What I have, and what I need
I have a pandas DataFrame p with cols 'a', 'b', 'c' (col names stored in pc).
From that I would like to create a DataFrame pn of the same shape, but each cell as a list of values from selected rows.
The DataFrame n tells me which rows to select from p for each row in pn.
import pandas as pd
pc = ['a', 'b', 'c']
p = pd.DataFrame([[11, 12, 13],
[21, 22, 23]],
columns=pc,
index=[1001,
1002])
n = pd.DataFrame([[[1001] ],
[[1001, 1002]]],
columns=['sel_row'],
index=[1001,
1002])
What I could (and want to) achieve
The farthest I could get... gives me a list of cols, rather than rows.
So, am I confusing the nested for loops ?
pn = pd.DataFrame([ [p.loc[ix, pc].values for ix in n.loc[indx].values[0]]
for indx in n.index ])
print (pn)
# The actual output:
# 0 1
# 0 [11, 12, 13] None
# 1 [11, 12, 13] [21, 22, 23]
# The required output:
# 0 1 2
# 0 [11] [12] [13]
# 1 [11, 21] [12, 22] [13, 23]
Stray thoughts
Maybe I should also iterate something like p.loc[ix, c] ... for c in pc... but how can there be 3 loops ??
A further (optional) wish
Is this possible with lambda too ? My intuition is: that would be faster-- but not sure !
Thanks for going through the question or any help offered.
You can explode the n, use that to slice p and groupby:
s = n['sel_row'].explode()
p.loc[s].groupby(s.index).agg(list)
Output:
a b c
1001 [11] [12] [13]
1002 [11, 21] [12, 22] [13, 23]
You can write a custom function here.
pc = ['a', 'b', 'c']
p = pd.DataFrame([[11, 12, 13],
[21, 22, 23]],
columns=pc,
index=[1001,
1002])
n = pd.DataFrame([[[1001] ],
[[1001, 1002]]],
columns=['sel_row'],
index=[1001,
1002])
def f(idx):
return pd.Series(p.loc[idx, :].values.T.tolist())
n.sel_row.apply(f)
0 1 2
1001 [11] [12] [13]
1002 [11, 21] [12, 22] [13, 23]
With lambda could rewrite above as:
n.sel_row.apply(lambda idx: pd.Series(p.loc[idx, :].values.T.tolist()))
IIUC, you could do:
data = [[[*x] for x in zip(*p.loc[idxs].values)] for idxs in n['sel_row']]
result = pd.DataFrame(data=data, columns=p.columns, index=p.index)
print(result)
Output
a b c
1001 [11] [12] [13]
1002 [11, 21] [12, 22] [13, 23]

Creating a pandas Dataframe from a matrix of occurrences and a list of values

I have an occurrences DataFrame :
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,3,size=(4,3)))
Out[0] :
0 1 2
0 2 2 1
1 2 2 2
2 1 1 1
3 2 1 2
and a list of values :
L = np.random.random_integers(10,15,size=df.values.sum())
Out[1] :
array([13, 11, 15, 11, 15, 13, 12, 11, 12, 15, 11, 11, 10, 11, 13, 11, 14,
10, 12])
I need your assistance for creating a new DataFrame of the same size than df which has the values of the list L given the occurrences matrix df :
0 1 2
0 [13, 11] [15, 11] [15]
1 [13, 12] [11, 12] [15, 11]
2 [11] [10] [11]
3 [13, 11] [14] [10, 12]
Simple nested loop variant:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,3,size=(4,3)))
L = np.random.random_integers(10,15,size=df.values.sum())
new_df = df.astype(object).copy()
L_ind = 0
for i in range(df.shape[0]):
for j in range(df.shape[1]):
new_df.loc[i, j] = list(L[L_ind: L_ind + df.iloc[i, j]])
L_ind += df.iloc[i, j]
df:
0 1 2
0 2 2 1
1 1 1 2
2 1 2 2
3 2 2 2
L:
array([15, 12, 10, 12, 13, 15, 13, 13, 15, 13, 15, 15, 12, 11, 14, 11, 10,
15, 15, 13])
new_df:
0 1 2
0 [15, 12] [10, 12] [13]
1 [15] [13] [13, 15]
2 [13] [15, 15] [12, 11]
3 [14, 11] [10, 15] [15, 13]
this code might help
import numpy as np
import pandas as pd
np.random.seed(7)
df = pd.DataFrame(np.random.randint(1,3,size=(4,3)))
# print df
L = np.random.random_integers(10,15,size=df.values.sum())
currentIndex=0
new_df = pd.DataFrame()
for c in df.columns.tolist():
new_list = []
for val in df[c]:
small_list = []
for i in range(val):
small_list.append(L[currentIndex])
currentIndex+=1
new_list.append(small_list)
new_df.insert(c,c,new_list)
print new_df
new_df
0 1 2
0 [10, 11] [14] [14, 15]
1 [12] [10, 13] [10, 10]
2 [12, 10] [12, 13] [15]
3 [14, 10] [14] [10, 13]

Transforming multiindex to row-wise multi-dimensional NumPy array.

Suppose I have a MultiIndex DataFrame similar to an example from the MultiIndex docs.
>>> df
0 1 2 3
first second
bar one 0 1 2 3
two 4 5 6 7
baz one 8 9 10 11
two 12 13 14 15
foo one 16 17 18 19
two 20 21 22 23
qux one 24 25 26 27
two 28 29 30 31
I want to generate a NumPy array from this DataFrame with a 3-dimensional structure like
>>> desired_arr
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
How can I do so?
Hopefully it is clear what is happening here - I am effectively unstacking the DataFrame by the first level and then trying to turn each top level in the resulting column MultiIndex to its own 2-dimensional array.
I can get half way there with
>>> df.unstack(1)
0 1 2 3
second one two one two one two one two
first
bar 0 4 1 5 2 6 3 7
baz 8 12 9 13 10 14 11 15
foo 16 20 17 21 18 22 19 23
qux 24 28 25 29 26 30 27 31
but then I am struggling to find a nice way to turn each column into a 2-dimensional array and then join them together, beyond doing so explicitly with loops and lists.
I feel like there should be some way for me to specify the shape of my desired NumPy array beforehand, fill it with np.nan and then use a specific iterating order to fill the values with my DataFrame, but I have not managed to solve the problem with this approach yet .
To generate the sample DataFrame
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
ind = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.arange(8*4).reshape((8, 4)), index=ind)
Some reshape and swapaxes magic -
df.values.reshape(4,2,-1).swapaxes(1,2)
Generalizable to -
m,n = len(df.index.levels[0]), len(df.index.levels[1])
arr = df.values.reshape(m,n,-1).swapaxes(1,2)
Basically splitting the first axis into two of lengths 4 and 2 creating a 3D array and then swapping the last two axes, i.e. pushing in the axis of length 2 to the back (as the last one).
Sample output -
In [35]: df.values.reshape(4,2,-1).swapaxes(1,2)
Out[35]:
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
to complete the answer of #divakar, for a multidimensionnal generalisation :
# sort values by index
A = df.sort_index()
# fill na
for idx in A.index.names:
A = A.unstack(idx).fillna(0).stack(1)
# create a tuple with the rights dimensions
reshape_size = tuple([len(x) for x in A.index.levels])
# reshape
arr = np.reshape(A.values, reshape_size ).swapaxes(0,1)

Categories