How to concatenating the elements of an array with different length? - python

I have a list1 = [1,2,3,4,5,6,7,8,9,0]. I want to take out one element "4", then split the remaining list with np.array_split(list1,5), I will get [array([1, 2]), array([5, 6]), array([7, 8]), array([ 9, 10]), array([11])] as result. When I try to convert it into an pandas Data Frame, the out put result would be as:
index
0
1
0
1
2.0
1
5
6.0
2
7
8.0
3
9
10.0
4
11
NaN
But I want to get the result as just one column data frame with single value in each cell without NaN value at the end.
Any suggestion with this matter would be appreciated.

Put your array into a dict and create your dataframe from that:
list1 = [1,2,3,4,5,77,8,9,0]
x = np.array_split(list1, 5)
df = pd.DataFrame({'column': x})
Output:
>>> df
column
0 [1, 2]
1 [3, 4]
2 [5, 77]
3 [8, 9]
4 [0]

Related

How to find column numbers in increasing order

I have a pandas dataframe, with a column containing item numbers that are supposed to increase by 1, each row.
df1 = pd.DataFrame({
"item_number" : [1, 2, 3, 4, 5, 6, 8, 10],
"col_A" : ['aaa','bbb','ccc','ddd','eee','fff','hhh', 'jjj']})
df1
item_number col_A
0 1 aaa
1 2 bbb
2 3 ccc
3 4 ddd
4 5 eee
5 6 fff
6 8 hhh
7 10 jjj
As you can see, the item number increases by two between 6 and 8 and 8 and 10. Is there a way to write a function that will a list of the skipped numbers ie. ['7','9'] otherwise, return True
s=pd.Series(range(df['item_number'].min(), (df['item_number'].max()+1)))
s[~s.isin(df['item_number'])].values
array([7, 9], dtype=int64)
one-liner:
set(range(df1.item_number.min(), df1.item_number.max()+1)) - set(df1.item_number) or True
You can take advantage of Python sets and lists operations to find out if the condition you are proposing meets on the input list:
li = [1, 2, 3, 4, 5, 6, 8, 10]
def fun(l):
a = list(set(list(range(l[0], l[-1]+1))) - set(l))
if a == []:
return True
else:
return a
print(fun(li))
Output:
[9, 7]
Also, you can use return sorted(a) if you want the list elements to be returned in order.
Use range with np.setdiff1d:
In [1518]: import numpy as np
In [1519]: rng = range(df1.item_number.min(), df1.item_number.max() + 1)
In [1523]: res = np.setdiff1d(rng, df1.item_number)
In [1524]: res
Out[1524]: array([7, 9])
This will do it:
def foo(df):
x = df.set_index('item_number').reindex(range(df.item_number.min(), df.item_number.max() + 1))
x = list(x.index[x.col_A.isna()])
return x if x else True
Examples:
y = foo(df1)
print(y)
y = foo(df1.loc[range(1, 6)])
print(y)
Output:
[7, 9]
True

Is it okey to use lambda in this case?

I'm trying to figure out a way to loop over a panda DataFrame to generate a new key.
Here's an example of the dataframe:
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
And now I want to create a new key, that basically takes the first and last number of df["range"] and full the list with the numbers in the middle (ie, the first one will be [1 2 3 4 5 6 7 8 9 10])
So far I think that I should be using something like this, but I could be completely wrong:
df["total"] = df["range"].map(lambda x: #and here I should append all the "x" that are betwen df["range"][0] and df["range"][1]
Here's an example of the result that I'm looking for:
pdb beg end range total
0 a 1 10 1 10 1 2 3 4 5 6 7 8 9 10
1 b 2 11 2 11 2 3 4 5 6 7 8 9 10 11
I could use some help with the lambda function, I get really confused with the syntax.
Try with apply
df['new'] = df.apply(lambda x : list(range(x['beg'],x['end']+1)),axis=1)
Out[423]:
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
1 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
dtype: object
This should work:
df['total'] = df['range'].apply(lambda x: [n for n in range(x[0], x[1]+1)])
As per your output, you need
In [18]: df['new'] = df.apply(lambda x : " ".join(list(map(str,range(x['beg'],x['end']+1)))),axis=1)
In [19]: df
Out[19]:
pdb beg end range new
0 a 1 10 [1, 10] 1 2 3 4 5 6 7 8 9 10
1 b 2 11 [2, 11] 2 3 4 5 6 7 8 9 10 11
If you want to use iterrows then you can do it in the loop itself as follows:
Code :
import pandas as pd
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
df['total'] = [range(*x) for x in zip(df['beg'], df['end'])]
Output:
pdb beg end range total
0 a 1 10 [1, 10] (1, 2, 3, 4, 5, 6, 7, 8, 9)
1 b 2 11 [2, 11] (2, 3, 4, 5, 6, 7, 8, 9, 10)

numpy: efficient summation of values within variably-sized array blocks

The problem is very similar to that in How to evaluate the sum of values within array blocks, where I need to sum up elements in a matrix by blocks. What is different here is that the blocks could have different sizes. For example, given a 4-by-5 matrix
1 1 | 1 1 1
----|------
1 1 | 1 1 1
1 1 | 1 1 1
1 1 | 1 1 1
and block sizes 1 3 along the rows and 2 3 along the columns, the result should be a 2-by-2 matrix:
2 3
6 9
Is there a way of doing this without loops?
Seems like a good fit to use np.add.reduceat to basically sum along rows and then along cols -
def sum_blocks(a, row_sizes, col_sizes):
# Sum rows based on row-sizes
s1 = np.add.reduceat(a,np.r_[0,row_sizes[:-1].cumsum()],axis=0)
# Sum cols from row-summed output based on col-sizes
return np.add.reduceat(s1,np.r_[0,col_sizes[:-1].cumsum()],axis=1)
Sample run -
In [45]: np.random.seed(0)
...: a = np.random.randint(0,9,(4,5))
In [46]: a
Out[46]:
array([[5, 0, 3, 3, 7],
[3, 5, 2, 4, 7],
[6, 8, 8, 1, 6],
[7, 7, 8, 1, 5]])
In [47]: row_sizes = np.array([1,3])
...: col_sizes = np.array([2,3])
In [48]: sum_blocks(a, row_sizes, col_sizes)
Out[48]:
array([[ 5, 13],
[36, 42]])

Add column for squares/cubes/etc for each column in numpy/pandas

I’m trying to take a set of data that consists of N rows, and expand each row to include the squares/cube/etc of each column in that row (what power to go up to is determined by a variable j). The data starts out as a pandas DataFrame but can be turned into a numpy array.
For example:
If the row is [3,2] and j is 3, the row should be transformed to [3, 2, 9, 4, 27, 8]
I currently have a semi working version that consists of a bunch of nested for loops and is pretty ugly. I’m hoping for a cleaner way to make this transformation so things will be a bit easier for me to debug.
The behavior I’m looking for is basically the same as sklearns PolynomialFeature, but I’m trying to do it in numpy and or pandas only.
Thanks!
Use NumPy broadcasting for a vectorized solution -
In [66]: a = np.array([3,2])
In [67]: j = 3
In [68]: a**np.arange(1,j+1)[:,None]
Out[68]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
And there's a NumPy builtin : np.vander -
In [142]: np.vander(a,j+1).T[::-1][1:]
Out[142]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
Or with increasing flat set as True -
In [180]: np.vander(a,j+1,increasing=True).T[1:]
Out[180]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
Try concat with ignore_index option to remove duplicate in column names:
df = pd.DataFrame(np.arange(9).reshape(3,3))
j = 3
pd.concat([df**i for i in range(1,j+1)], axis=1,ignore_index=True)
Output:
0 1 2 3 4 5 6 7 8
0 0 1 2 0 1 4 0 1 8
1 3 4 5 9 16 25 27 64 125
2 6 7 8 36 49 64 216 343 512

Convert column suffixes from pandas join into a MultiIndex

I have two pandas DataFrames with (not necessarily) identical index and column names.
>>> df_L = pd.DataFrame({'X': [1, 3],
'Y': [5, 7]})
>>> df_R = pd.DataFrame({'X': [2, 4],
'Y': [6, 8]})
I can join them together and assign suffixes.
>>> df_L.join(df_R, lsuffix='_L', rsuffix='_R')
X_L Y_L X_R Y_R
0 1 5 2 6
1 3 7 4 8
But what I want is to make 'L' and 'R' sub-columns under both 'X' and 'Y'.
The desired DataFrame looks like this:
>>> pd.DataFrame(columns=pd.MultiIndex.from_product([['X', 'Y'], ['L', 'R']]),
data=[[1, 5, 2, 6],
[3, 7, 4, 8]])
X Y
L R L R
0 1 5 2 6
1 3 7 4 8
Is there a way I can combine the two original DataFrames to get this desired DataFrame?
You can use pd.concat with the keys argument, along the first axis:
df = pd.concat([df_L, df_R], keys=['L','R'],axis=1).swaplevel(0,1,axis=1).sort_index(level=0, axis=1)
>>> df
X Y
L R L R
0 1 2 5 6
1 3 4 7 8
For those looking for an answer to the more general problem of joining two data frames with different indices or columns into a multi-index table:
# Prepend a key-level to the column index
# https://stackoverflow.com/questions/14744068
df_L = pd.concat([df_L], keys=["L"], axis=1)
df_R = pd.concat([df_R], keys=["R"], axis=1)
# Join the two dataframes
df = df_L.join(df_R)
# Reorder levels if needed:
df = df.reorder_levels([1,0], axis=1).sort_index(axis=1)
Example:
# Data:
df_L = pd.DataFrame({'X': [1, 3, 5], 'Y': [7, 9, 11]})
df_R = pd.DataFrame({'X': [2, 4], 'Y': [6, 8], 'Z': [10, 12]})
# Result:
# X Y Z
# L R L R R
# 0 1 2.0 7 6.0 10.0
# 1 3 4.0 9 8.0 12.0
# 2 5 NaN 11 NaN NaN
This also solves the special case of the OP with equal indices and columns.
df_L.columns = pd.MultiIndex.from_product([["L", ], df_L.columns])

Categories