I have a 2D numpy array with lack of data, and I want to fill them by giving a mathematical uniformity to the array. I got something like this :
[[72829],
[nan],
[73196],
[73087],
[nan],
[nan],
[72294.5]]
I want to fill those empy cells with the mean between the closest cells, with return with something like this :
[[72829],
[73012.5],
[73196],
[73087],
[72888.875],
[72492.625],
[72294.5]]
I tried to use SimpleImputer and KNNImputer from Scikit-learn, but all what I got is the same value to all data, not the mean between the cells as I mentioned before. Thats the code :
for label, column in data.iteritems():
reshaped = np.array(column.values) # Creating a np array to use scikitlearn
reshaped = reshaped.reshape(-1,1) # changing shape of data to a 2D array
normalized = imputer.fit_transform(reshaped) # transforming data
data[label] = normalized # changing the column value to the new one
With KNNImputer, I got something like this (The way that I don't want):
[[72829],
[68088.71106114],
[73196],
[73087],
[68088.71106114],
[68088.71106114],
[72294.5]]
Someone knows any ideia or algorithm that could give a "uniformity" to the array numbers like this ? The ideia is that the return of this method gives me the possibility to plot graphs without missing data. If were something with pandas/numpy/scikit-learn would be better, thanks.
Convert data to a dataframe and use b(efore)fill and f(orward)fill
x = [[72829],
[np.nan],
[73196],
[73087],
[np.nan],
[np.nan],
[72294.5]]
df = pd.DataFrame(x)
df = (df[0].bfill() + df[0].ffill())/2
df
>>>
0 72829.00
1 73012.50
2 73196.00
3 73087.00
4 72690.75
5 72690.75
6 72294.50
In[0]:
import pandas as pd
series = pd.Series([72829,
None,
73196,
73087,
None,
None,
72294.5])
series.interpolate(method='linear')
Out[0]:
0 72829.000000
1 73012.500000
2 73196.000000
3 73087.000000
4 72822.833333
5 72558.666667
6 72294.500000
dtype: float64
Related
I have two data frames:
import pandas as pd
import numpy as np
sgRNA = pd.Series(["ABL1_sgABL1_130854834","ABL1_sgABL1_130862824","ABL1_sgABL1_130872883","ABL1_sgABL1_130884018"])
sequence = pd.Series(["CTTAGGCTATAATCACAATG","GGTTCATCATCATTCAACGG","TCAGTGATGATATAGAACGG","TTGCTCCCTCGAAAAGAGCG"])
df1=pd.DataFrame(sgRNA,columns=["sgRNA"])
df1["sequence"]=sequence
df2=pd.DataFrame(columns=["column"],
index=np.arange(len(df1) * 2))
I want to add values from both columns from df1 to df2 every other row, like this:
ABL1_sgABL1_130854834
CTTAGGCTATAATCACAATG
ABL1_sgABL1_130862824
GGTTCATCATCATTCAACGG
ABL1_sgABL1_130872883
TCAGTGATGATATAGAACGG
ABL1_sgABL1_130884018
TTGCTCCCTCGAAAAGAGCG
To do this for df1["sgRNA"] I used this code:
df2.iloc[0::2, :]=df1["sgRNA"]
But I get this error:
ValueError: could not broadcast input array from shape (4,) into shape (4,1).
What am I doing wrong?
I think you're looking for DataFrame.stack():
df2["column"] = df1.stack().reset_index(drop=True)
print(df2)
Prints:
column
0 ABL1_sgABL1_130854834
1 CTTAGGCTATAATCACAATG
2 ABL1_sgABL1_130862824
3 GGTTCATCATCATTCAACGG
4 ABL1_sgABL1_130872883
5 TCAGTGATGATATAGAACGG
6 ABL1_sgABL1_130884018
7 TTGCTCCCTCGAAAAGAGCG
Besides Andrej Kesely's superior solution, to answer the question of what went wrong in the code, it's really minor:
df1["sgRNA"] is a series, one-dimensional, while df2.iloc[0::2, :] is
a dataframe, two-dimensional.
The solution would be to make the "df2" part one-dimensional by selecting the
one and only column, instead of selecting a slice of "all one columns", so to
say:
df2.iloc[0::2, 0] = df1["sgRNA"]
I have a very large numpy array that has 2 dimensions, one with 369 elements and the other contains 370 elements. They are all floats.
The array is designed as such: [[[-18.08621204 -18.08622591 ... -18.08850475
-18.08850187]]] ... [-45.95094274 -45.94523995 ... -44.90436858 -44.90151675]]]
My final output would be to have something like this:
Column1 Column2
0 -18.473131 -45.404821
1 -18.475842 -45.404828
2 -18.478553 -45.404834
3 -18.481265 -45.404841
4 -18.483976 -45.404847
I have no idea how to achieve this though. My (terrible) attempt at this went as following : I flattened the array and turned it into a dictionary, then I turned it into a pandas dataframe like this and named my desired column:
data = pd.DataFrame.from_dict(di, orient='index')
data.columns = ['Column 1']
Then what I did to get something similar to the example I gave above was:
data['Column1'] = data['Column2'] < -44
This results in:
Column1 Column2
1 -18.086212 False
2 -18.086226 False
... ... ...
273056 -44.910072 True
273057 -44.907220 True
273058 -44.904369 True
The row organization I gave in the example is very import and must be kept, since they represent converted coordinates. I achieved my first example by a selected conversion of points, but ideally it needs to be done in a bulk, which gives me that numpy array first mentioned.
EDIT:
So before doing everything I described above I had this DF:
Column1 Column2 RGBA
0 0 0 (0, 0, 0, 255)
1 0 1 (0, 0, 0, 255)
...
136529 369 368 (255,255, 255, 255)
Then I applied the conversions described here Converting X, Y to lat and long like this:
xx, yy = cell_transform * np.meshgrid(np.arange(369), np.arange(368))
bulkx_proj = xx
bulky_proj = yy
yy_latlng, xx_latlng = proj_latlng.transform(bulkx_proj, bulky_proj)
xx_latlng and yy latlng are numpy arrays and I've verified it and the values are indeed following the proper order from my original dataframe, so that part is going as expected . I then tried to store the values inside my dataframe and without changing the order by doing this:
df['Column1'] = xx_latlng
df['Column2'] = yy_latlng
But then it returns me this value error: Length of values (368) does not match length of index (136530) exactly at this point. What I expected, and desire, is for each value of the numpy array to be stored inside the dataframe and under the columns I specified.
I was able to solve it myself. To solve it I flattened the numpy arrays then I turned them into a series and after that I just needed to concatenate the series. The order of the data was not damaged. This was achieved using flatten() in the arrays, pd.Series(1dimensional_array) and finally pd.concat([series1, series2]).
So I have an Array X with is (398,5)
I am trying to replace all missing values in this array with 0's and printing out the last 15 values of the attribute with missing values.
I did convert X into a numpy array using a dataframe. I am told that I will be able to tell the attribute with missing values by looking at the DataFrame info I generated earlier.
My dataframe is X_df
I'm a bit confused by this so any help would be appreciated.
Edit:
For more clarrification. I had a dataframe with nan values called X_df
I turned that into a numpy array called X
I then replaced all nan values of X with 0 using the code below. He wants me to print out the last 15 changed rows. That is where I am a bit stuck
index = np.isnan(X)
X[index] = 0
On a DataFrame:
df.where(~np.isnan(df), 0) # replace NaNs with 0
df.tail(15) # show last 15 rows
On a numpy ndarray:
a[np.where(np.isnan(a))] = 0 # Set NaNs to 0
a[-15:, :] # Last 15 rows
I have a dataset with 4 variables(Bearing 1 to Bearing 4) and 20152319 no of observations. It looks like this:
Now, I am trying to find the correlation matrix of the 4 variables. The code I use is this:
corr_mat = Data.corr(method = 'pearson')
print(corr_mat)
However in the result, I get the correlation information for only Bearing 2 to Bearing 4. Bearing 1 is nowhere to be seen. I am providing a snapshot of the result down below:
I have tried removing NULL values from each of the variables and also tried looking for missing values but nothing works. What is interesting is that, if I isolate the first two variables (Bearing 1 and Bearing 2) and then try to find the correlation matrix between them, Bearing 1 does not come up and the matrix is a 1x1 matrix with only Bearing 2
Any explanation on why this occurs and how to solve it would be appreciated.
Try to see if the first column 'Bearing 1' is numeric.
Data.dtypes # This will show the type of each column
cols = Data.columns # Saving column names to a variable
Data[cols].apply(pd.to_numeric, errors='coerce') # Converting the columns to numeric
Now apply your Calculations,
corr_mat = Data.corr(method = 'pearson')
print(corr_mat)
Dtype of first column is object, so pandas by default omit it. Solution is convert it to numeric:
Data['Bearing 1'] = Data['Bearing 1'].astype(float)
Or if some non numeric values use to_numeric with errors='coerce' for parse these values to NaNs:
Data['Bearing 1'] = pd.to_numeric(Data['Bearing 1'], errors='coerce')
If want convert all columns to numeric:
Data = Data.astype(float)
Or:
Data = Data.apply(pd.to_numeric, errors='coerce')
I have a pandas data frame,df. The contents of the first row are as follows:
-1387.900
1 -1149.000
2 1526.300
3 1306.300
4 1134.300
5 -1077.200
6 -734.890
7 -340.870
8 -268.970
9 -176.070
10 -515.510
11 283.440
12 -55.148
13 -1701.800
14 -63.294
15 -270.720
16 2216.800
17 4251.200
18 1459.000
19 -613.680
Which is basically a series. I have a (1x20) numpy array, as follows:
array([[ 1308.22000654, -920.02730748, 1285.54273707, -1119.67498439,
789.50281435, -331.14325768, 756.67399745, -101.9251545 ,
157.17779635, -333.17043669, -191.10517521, -127.80219696,
698.32168135, 154.30798847, -1055.54268665, -1795.96042107,
202.53471769, 25.58830318, 793.63902134, 220.94259961]])
Now what I want is that for each cell value of this top row of df data frame, I need to check if the sign of that cell is same as that of the corresponding cell sign of the above numpy array. If the sign is different then for all the rows in df, for that corresponding co-ordinate, flip the signs of each corresponding co-ordinate value in df. For ex. if you see the first cell value. Df has -1387 while numpy array has 1380. So now the first column of df frame should have it's sign reversed. Same with other columns.
I am doing it using a for loop.
Like
for x in range(20):
if(np.sign(Y1[0][x])!=np.sign(df.ix[0][x])):
if(np.sign(Y1[0][x])==0 and np.sign(df.ix[0][x]>0)):
df[x]=df[x]*1
else:
df[x]=df[x]*(-1)
I also need to make sure that if np.sign(Y[x])=0 then the sign which it takes is not zero but +1. I can add that condition in the above code, but point is how to make it more pythonic?.
EDIT: I have added the code which I wrote which seems to work fine and flip the signs of df column based on the conditions mentioned above. ANy idea how to do this in pythonic way?
EDITII: I have one more doubt. My numpy array is supposed to be single dimensional. But as you see above it is coming as 2 dimensional and I have to unnecessarily access the cell by 2 indexes. Why is that?. This is how I created numpy array(Dot product of two 1x11025 row of a df with 11025x20 matrix giving 1x20 array. But it is coming as array of array as you see above. code to create numpy array:
Y1=np.dot(X_smilie_norm[0:1],W)
X_smilie_norm is a 28x11025 pandas dataframe. I am accessing just the first row of that and doing a dot product with W which is a 11025x20 matrix. It is giving a double dimensional array when all I want is a single dimensional so that I could access Y1 values just with single index.
Here is the code, but I don't know what the result you want when the first row of df contians zero.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-10, 10, (10, 12)))
sign = np.random.randint(-10, 10, 12)
df.loc[:, (df.iloc[0] >= 0) ^ (sign >= 0)] *= -1
You could use a mask and apply it to the dataframe
mask = (arr <= 0) != (df <= 0) # true if signs are different
df[mask] = -df[mask] # flip the signs on those members where mask is true