Add data from one column to another column on every other row - python

I have two data frames:
import pandas as pd
import numpy as np
sgRNA = pd.Series(["ABL1_sgABL1_130854834","ABL1_sgABL1_130862824","ABL1_sgABL1_130872883","ABL1_sgABL1_130884018"])
sequence = pd.Series(["CTTAGGCTATAATCACAATG","GGTTCATCATCATTCAACGG","TCAGTGATGATATAGAACGG","TTGCTCCCTCGAAAAGAGCG"])
df1=pd.DataFrame(sgRNA,columns=["sgRNA"])
df1["sequence"]=sequence
df2=pd.DataFrame(columns=["column"],
index=np.arange(len(df1) * 2))
I want to add values from both columns from df1 to df2 every other row, like this:
ABL1_sgABL1_130854834
CTTAGGCTATAATCACAATG
ABL1_sgABL1_130862824
GGTTCATCATCATTCAACGG
ABL1_sgABL1_130872883
TCAGTGATGATATAGAACGG
ABL1_sgABL1_130884018
TTGCTCCCTCGAAAAGAGCG
To do this for df1["sgRNA"] I used this code:
df2.iloc[0::2, :]=df1["sgRNA"]
But I get this error:
ValueError: could not broadcast input array from shape (4,) into shape (4,1).
What am I doing wrong?

I think you're looking for DataFrame.stack():
df2["column"] = df1.stack().reset_index(drop=True)
print(df2)
Prints:
column
0 ABL1_sgABL1_130854834
1 CTTAGGCTATAATCACAATG
2 ABL1_sgABL1_130862824
3 GGTTCATCATCATTCAACGG
4 ABL1_sgABL1_130872883
5 TCAGTGATGATATAGAACGG
6 ABL1_sgABL1_130884018
7 TTGCTCCCTCGAAAAGAGCG

Besides Andrej Kesely's superior solution, to answer the question of what went wrong in the code, it's really minor:
df1["sgRNA"] is a series, one-dimensional, while df2.iloc[0::2, :] is
a dataframe, two-dimensional.
The solution would be to make the "df2" part one-dimensional by selecting the
one and only column, instead of selecting a slice of "all one columns", so to
say:
df2.iloc[0::2, 0] = df1["sgRNA"]

Related

Numpy Array fill empty data to "uniformity"

I have a 2D numpy array with lack of data, and I want to fill them by giving a mathematical uniformity to the array. I got something like this :
[[72829],
[nan],
[73196],
[73087],
[nan],
[nan],
[72294.5]]
I want to fill those empy cells with the mean between the closest cells, with return with something like this :
[[72829],
[73012.5],
[73196],
[73087],
[72888.875],
[72492.625],
[72294.5]]
I tried to use SimpleImputer and KNNImputer from Scikit-learn, but all what I got is the same value to all data, not the mean between the cells as I mentioned before. Thats the code :
for label, column in data.iteritems():
reshaped = np.array(column.values) # Creating a np array to use scikitlearn
reshaped = reshaped.reshape(-1,1) # changing shape of data to a 2D array
normalized = imputer.fit_transform(reshaped) # transforming data
data[label] = normalized # changing the column value to the new one
With KNNImputer, I got something like this (The way that I don't want):
[[72829],
[68088.71106114],
[73196],
[73087],
[68088.71106114],
[68088.71106114],
[72294.5]]
Someone knows any ideia or algorithm that could give a "uniformity" to the array numbers like this ? The ideia is that the return of this method gives me the possibility to plot graphs without missing data. If were something with pandas/numpy/scikit-learn would be better, thanks.
Convert data to a dataframe and use b(efore)fill and f(orward)fill
x = [[72829],
[np.nan],
[73196],
[73087],
[np.nan],
[np.nan],
[72294.5]]
df = pd.DataFrame(x)
df = (df[0].bfill() + df[0].ffill())/2
df
>>>
0 72829.00
1 73012.50
2 73196.00
3 73087.00
4 72690.75
5 72690.75
6 72294.50
In[0]:
import pandas as pd
series = pd.Series([72829,
None,
73196,
73087,
None,
None,
72294.5])
series.interpolate(method='linear')
Out[0]:
0 72829.000000
1 73012.500000
2 73196.000000
3 73087.000000
4 72822.833333
5 72558.666667
6 72294.500000
dtype: float64

Create a 2-dimensional NumPy array with 1 row and 2 columns

Is it possible to create a 2-dimensional NumPy array with 1 row and 2 columns (row vector)?
This is what I'm doing (from the documentation), but I'd like to know if it's possible to do it in one (easier) step:
X_new2 = np.array([8.5,156])
X_new2 = X_new2[np.newaxis, :]
I've also tried:
X_new2 = np.array([[8.5], [156]])
But this is returning a column instead.
You can use the following syntax to achieve the same result as in your example:
X_new2 = np.array([[8.5,156]])
(Notice the extra [ and ] to make the array the correct shape.)
try this:
y = np.expand_dims(x, axis=0)
print(y.shape)

Element-by-element division in pandas dataframe with "/"?

Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.

Why do these two arrays have the same shape?

So I am trying to create an array and then access the columns by name. So I came up with something like this:
import numpy as np
data = np.ndarray(shape=(3,1000),
dtype=[('x',np.float64),
('y',np.float64),
('z',np.float64)])
I am confused as to why
data.shape
and
data['x'].shape
both come back as (3,1000), this is causing me issues when I'm trying to populate my data fields
data['x'] = xvalues
where xvalues has a shape of (1000,). Is there a better way to do this?
The reason why it comes out the same is because 'data' has a bit more structure than the one revealed by shape.
Example:
data[0][0] returns:
(6.9182540632428e-310, 6.9182540633353e-310, 6.9182540633851e-310)
while data['x'][0][0]:
returns 6.9182540632427993e-310
so data contains 3 rows and 1000 columns, and the element of that is a 3-tuple.
data['x'] is the first element of that tuple of all combinations of 3 rows and 1000 columns, so the shape is (3,1000) as well.
Just set shape=(1000,). The triple dtype will create 3 columns.

Pythonic way to compare sign of numpy array with Dataframe

I have a pandas data frame,df. The contents of the first row are as follows:
-1387.900
1 -1149.000
2 1526.300
3 1306.300
4 1134.300
5 -1077.200
6 -734.890
7 -340.870
8 -268.970
9 -176.070
10 -515.510
11 283.440
12 -55.148
13 -1701.800
14 -63.294
15 -270.720
16 2216.800
17 4251.200
18 1459.000
19 -613.680
Which is basically a series. I have a (1x20) numpy array, as follows:
array([[ 1308.22000654, -920.02730748, 1285.54273707, -1119.67498439,
789.50281435, -331.14325768, 756.67399745, -101.9251545 ,
157.17779635, -333.17043669, -191.10517521, -127.80219696,
698.32168135, 154.30798847, -1055.54268665, -1795.96042107,
202.53471769, 25.58830318, 793.63902134, 220.94259961]])
Now what I want is that for each cell value of this top row of df data frame, I need to check if the sign of that cell is same as that of the corresponding cell sign of the above numpy array. If the sign is different then for all the rows in df, for that corresponding co-ordinate, flip the signs of each corresponding co-ordinate value in df. For ex. if you see the first cell value. Df has -1387 while numpy array has 1380. So now the first column of df frame should have it's sign reversed. Same with other columns.
I am doing it using a for loop.
Like
for x in range(20):
if(np.sign(Y1[0][x])!=np.sign(df.ix[0][x])):
if(np.sign(Y1[0][x])==0 and np.sign(df.ix[0][x]>0)):
df[x]=df[x]*1
else:
df[x]=df[x]*(-1)
I also need to make sure that if np.sign(Y[x])=0 then the sign which it takes is not zero but +1. I can add that condition in the above code, but point is how to make it more pythonic?.
EDIT: I have added the code which I wrote which seems to work fine and flip the signs of df column based on the conditions mentioned above. ANy idea how to do this in pythonic way?
EDITII: I have one more doubt. My numpy array is supposed to be single dimensional. But as you see above it is coming as 2 dimensional and I have to unnecessarily access the cell by 2 indexes. Why is that?. This is how I created numpy array(Dot product of two 1x11025 row of a df with 11025x20 matrix giving 1x20 array. But it is coming as array of array as you see above. code to create numpy array:
Y1=np.dot(X_smilie_norm[0:1],W)
X_smilie_norm is a 28x11025 pandas dataframe. I am accessing just the first row of that and doing a dot product with W which is a 11025x20 matrix. It is giving a double dimensional array when all I want is a single dimensional so that I could access Y1 values just with single index.
Here is the code, but I don't know what the result you want when the first row of df contians zero.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-10, 10, (10, 12)))
sign = np.random.randint(-10, 10, 12)
df.loc[:, (df.iloc[0] >= 0) ^ (sign >= 0)] *= -1
You could use a mask and apply it to the dataframe
mask = (arr <= 0) != (df <= 0) # true if signs are different
df[mask] = -df[mask] # flip the signs on those members where mask is true

Categories