I have a pandas data frame,df. The contents of the first row are as follows:
-1387.900
1 -1149.000
2 1526.300
3 1306.300
4 1134.300
5 -1077.200
6 -734.890
7 -340.870
8 -268.970
9 -176.070
10 -515.510
11 283.440
12 -55.148
13 -1701.800
14 -63.294
15 -270.720
16 2216.800
17 4251.200
18 1459.000
19 -613.680
Which is basically a series. I have a (1x20) numpy array, as follows:
array([[ 1308.22000654, -920.02730748, 1285.54273707, -1119.67498439,
789.50281435, -331.14325768, 756.67399745, -101.9251545 ,
157.17779635, -333.17043669, -191.10517521, -127.80219696,
698.32168135, 154.30798847, -1055.54268665, -1795.96042107,
202.53471769, 25.58830318, 793.63902134, 220.94259961]])
Now what I want is that for each cell value of this top row of df data frame, I need to check if the sign of that cell is same as that of the corresponding cell sign of the above numpy array. If the sign is different then for all the rows in df, for that corresponding co-ordinate, flip the signs of each corresponding co-ordinate value in df. For ex. if you see the first cell value. Df has -1387 while numpy array has 1380. So now the first column of df frame should have it's sign reversed. Same with other columns.
I am doing it using a for loop.
Like
for x in range(20):
if(np.sign(Y1[0][x])!=np.sign(df.ix[0][x])):
if(np.sign(Y1[0][x])==0 and np.sign(df.ix[0][x]>0)):
df[x]=df[x]*1
else:
df[x]=df[x]*(-1)
I also need to make sure that if np.sign(Y[x])=0 then the sign which it takes is not zero but +1. I can add that condition in the above code, but point is how to make it more pythonic?.
EDIT: I have added the code which I wrote which seems to work fine and flip the signs of df column based on the conditions mentioned above. ANy idea how to do this in pythonic way?
EDITII: I have one more doubt. My numpy array is supposed to be single dimensional. But as you see above it is coming as 2 dimensional and I have to unnecessarily access the cell by 2 indexes. Why is that?. This is how I created numpy array(Dot product of two 1x11025 row of a df with 11025x20 matrix giving 1x20 array. But it is coming as array of array as you see above. code to create numpy array:
Y1=np.dot(X_smilie_norm[0:1],W)
X_smilie_norm is a 28x11025 pandas dataframe. I am accessing just the first row of that and doing a dot product with W which is a 11025x20 matrix. It is giving a double dimensional array when all I want is a single dimensional so that I could access Y1 values just with single index.
Here is the code, but I don't know what the result you want when the first row of df contians zero.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-10, 10, (10, 12)))
sign = np.random.randint(-10, 10, 12)
df.loc[:, (df.iloc[0] >= 0) ^ (sign >= 0)] *= -1
You could use a mask and apply it to the dataframe
mask = (arr <= 0) != (df <= 0) # true if signs are different
df[mask] = -df[mask] # flip the signs on those members where mask is true
Related
I have a Dataframe with several lines and columns and I have transformed it into a numpy array to speed-up the calculations.
The first five columns of the Dataframe looked like this:
par1 par2 par3 par4 par5
1.502366 2.425301 0.990374 1.404174 1.929536
1.330468 1.460574 0.917349 1.172675 0.766603
1.212440 1.457865 0.947623 1.235930 0.890041
1.222362 1.348485 0.963692 1.241781 0.892205
...
These columns are now stored in a numpy array a = df.values
I need to check whether at least two of the five columns satisfy a condition (i.e., their value is larger than a certain threshold). Initially I wrote a function that performed the operation directly on the dataframe. However, because I have a very large amount of data and need to repeat the calculations over and over, I switched to numpy to take advantage of the vectorization.
To check the condition I was thinking to use
df['Result'] = np.where(condition_on_parameters > 2, True, False)
However, I cannot figure out how to write the condition_on_parameters such that it returns a True of False when at least 2 out of the 5 parameters are larger than the threshold. I thought to use the sum() function on the condition_on_parameters but I am not sure how to write such condition.
EDIT
It is important to specify that the thresholds are different for each parameter. For example thr1=1.2, thr2=2.0, thr3=1.5, thr4=2.2, thr5=3.0. So I need to check that par1 > thr1, par2 > thr2, ..., par5 > thr5.
Assuming condition_on_parameters returns an array the sames size as a with entries as True or False, you can use np.sum(condition_on_parameters, axis=1) to sum over the true values (True has a numerical values of 1) of each row. This provides a 1D array with entries as the number of columns that meet the condition. This array can then be used with where to get the row numbers you are looking for.
df['result'] = np.where(np.sum(condition_on_parameters, axis=1) > 2)
Can you exploit pandas functionalities? For example, you can efficiently check conditions on multiple rows/columns with .apply and then .sum(axis=1).
Here some sample code:
import pandas as pd
df = pd.DataFrame([[1.50, 2.42, 0.88], [0.98,1.3, 0.56]], columns=['par1', 'par2', 'par3'])
# custom_condition, e.g. value less or equal than threshold
def leq(x, t):
return x<=t
condition = df.apply(lambda x: leq(x, 1)).sum(axis=1)
# filter
df.loc[condition >=2]
I think this should be equivalent to numpy in terms of efficiency as pandas is ultimately build on top of that, however I'm not entirely sure...
It seems you are looking for numpy.any
a = np.array(\
[[1.502366, 2.425301, 0.990374, 1.404174, 1.929536],
[1.330468, 1.460574, 0.917349, 1.172675, 0.766603 ],
[1.212440, 1.457865, 0.947623, 1.235930, 0.890041 ],
[1.222362, 1.348485, 0.963692, 1.241781, 0.892205 ]]);
df = pd.DataFrame(a, columns=[f'par{i}' for i in range(1, 6)])
df['Result'] = np.any(df > 1.46, axis=1) # append the result column
Gives the following dataframe
We have two dataframes, first one contains some float values (which mean average speed).
0 1 2
1 15.610826 19.182879 6.678087
2 13.740250 15.666897 17.640749
3 2.379010 2.889702 2.955097
4 20.540628 9.661226 9.479921
And another dataframe with geographical coordinates, where the average speed takes place.
0 1 2
1 [52.2399255, 21.0654495] [52.23893150000001, 21.06087] [52.23800850000001,21.056779]
2 [52.2449705, 21.0755175] [52.2452905, 21.075118000000003] [52.245557500000004, 21.0748175]
3 [52.2401885, 21.012981500000002] [52.239134, 21.009432] [52.238420500000004, 21.007080000000002]
4 [52.221506500000004, 20.9665085] [52.222458, 20.968952] [52.224409, 20.969248999999998]
Now I want to create a list with coordinates where average speed is above 18, in this case this would be
list_above_18=[[52.23893150000001, 21.06087] , [52.221506500000004, 20.9665085]]
How can I select values from a dataframe based on values in another dataframe?
You can use enumerate to zip the dataframes and work on the elements seperately. See below (A,B are your dataframes, in same order you provided them):
list_above_18=[]
p=list(enumerate(zip(A.values, B.values)))
for i in p:
for k in range(3):
if i[1][0][k]>18:
list_above_18.append(i[1][1][k])
Output:
>>>print(list_above_18)
[[52.23893150000001, 21.06087] , [52.221506500000004, 20.9665085]]
Considering the shape of the Average Speed dataset will remain same as the coordinates dataset, you can try the below
coord_df[data_df.iloc[:,:] > 18].T.stack().values
Here,
coord_df = DataFrame with coordinate values
data_df = Average Speed values
This would return a numpy array with just the coordinate values where the Average speed is greater than 18
How this works :
data_df.iloc[:,:] > 18
Creates a dataframe mask such that all the values which are smaller than 18 are marked as False and rest as True
coord_df[data_df.iloc[:,:] > 18]
Passes the mask in the Target Dataframe i.e. coordinate dataframe which then results in a dataframe which shows coordinate values only for those cells where the mask has True i.e. where the average speed was above 18
.T.stack().values
This then retrieves only the non-null values from the resultant dataframe and returns a numpy array
References I took :
Get non-null elements in a pandas DataFrame --- To get only the non null values from a dataframe (.T.stack().values)
Let the first df be df1 and second df be df2
output_array = df2[df1>18].values.flatten() # df1>18 would create the mask
output_array = [val for val in output_array if type(val) == list] # removing the nan values. We can't use np.isnan as it would not work for list
Sample Input:
df1
df2
output_array
[[15.1, 20.5], [91.5, 95.8]]
So I have an Array X with is (398,5)
I am trying to replace all missing values in this array with 0's and printing out the last 15 values of the attribute with missing values.
I did convert X into a numpy array using a dataframe. I am told that I will be able to tell the attribute with missing values by looking at the DataFrame info I generated earlier.
My dataframe is X_df
I'm a bit confused by this so any help would be appreciated.
Edit:
For more clarrification. I had a dataframe with nan values called X_df
I turned that into a numpy array called X
I then replaced all nan values of X with 0 using the code below. He wants me to print out the last 15 changed rows. That is where I am a bit stuck
index = np.isnan(X)
X[index] = 0
On a DataFrame:
df.where(~np.isnan(df), 0) # replace NaNs with 0
df.tail(15) # show last 15 rows
On a numpy ndarray:
a[np.where(np.isnan(a))] = 0 # Set NaNs to 0
a[-15:, :] # Last 15 rows
I would like to create a somewhat dynamical query based on numpy array. Ideally, the indices of one array (uu) should be returned, given the conditions for each column in a second array (cond).
The sample below demonstrates what I have in mind, and it works using a loop. I am wondering if there is a more efficient method. Thanks for your help.
import numpy as np
# create an array that has four columns, each contain a vector (here: identical vectors)
n = 101
u = np.linspace(0,100,n)
uu = np.ones((4, n)) * u
# ideally I would like the indices of uu
# (so I could access another array of the same shape as uu)
# that meets the following conditions:
# the condition based on which values in uu should be selected
# in the first column (index 0) all values <= 10. should be selected
# in the second column (index 1) all values <= 20. should be selected
cond = np.array([10,20])
# this gives the correct indices, but in a series of 1D solutions
# this would work as a work-around
for i in range(cond.size):
ix = np.where(uu[i,:] <= cond[i])
print(ix)
# this seems like a work-around using True/False
# but here I am not sure how to best convert this to indices
for i in range(cond.size):
uu[i,:] = uu[i,:] <= cond[i]
print(uu)
Numpy allows to compare arrays directly:
import numpy as np
# just making my own uu with random numbers
n = 101
uu = np.random.rand(n,4)
# then if you have an array or a list of values...
cond = [.2,.5,.7,.8]
# ... you can directly compare the array to it
comparison = uu <= cond
# comparison now has True/False values, and the shape of uu
# you can directly use comparison to get the desired values in uu
uu[comparison] # gives the values of uu where comparison ir True
# but if you really need the indices you can do
np.where(comparison)
#returns 2 arrays containing i-indices and j-indices where comparison is True
I've run into an odd problem yet again.
Suppose I have the following dummy data frame (by way of demonstrating my problem):
import numpy as np
import pandas as pd
import string
# Test data frame
N = 3
col_ids = string.letters[:N]
df = pd.DataFrame(
np.random.randn(5, 3*N),
columns=['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')])
df
This produces:
A_x A_y A_z B_x B_y B_z C_x C_y C_z
0 -1.339040 0.185817 0.083120 0.498545 -0.569518 0.580264 0.453234 1.336992 -0.346724
1 -0.938575 0.367866 1.084475 1.497117 0.349927 -0.726140 -0.870142 -0.371153 -0.881763
2 -0.346819 -1.689058 -0.475032 -0.625383 -0.890025 0.929955 0.683413 0.819212 0.102625
3 0.359540 -0.125700 -0.900680 -0.403000 2.655242 -0.607996 1.117012 -0.905600 0.671239
4 1.624630 -1.036742 0.538341 -0.682000 0.542178 -0.001380 -1.126426 0.756532 -0.701805
Now I would like to use scipy.spatial.distance.pdist on this pandas data frame. This turns out to be a rather non-trivial process. What pdist does is to compute the distance between m points using Euclidean distance (2-norm) as the distance metric between the points. The points are arranged as m n-dimensional row vectors in the matrix X (source).
So, there are a couple of things that one has to do to create a function that operates on a pandas data frame, such that the pdist function can be used. You will note that pdist is convenient when the number of points gets very large. I've tried making my own, which works for a one-row data-frame, but I cannot get it to work, ideally, on the whole data frame at once.
Here's my attempt:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd
import string
def Euclidean_distance(df):
EcDist = pd.DataFrame(index=df.index) # results container
arr = df.values # Store data frame values into a numpy array
tag_list = [num for elem in arr for num in elem] # flatten numpy array into single list
tag_list_3D = zip(*[iter(tag_list)]*3) # separate list into length = 3 sub-lists, that pdist() can work with
EcDist = pdist(tag_list_3D) # the distance between m points using Euclidean distance (2-norm)
return EcDist
First I begin my creating a results container in pandas form, to store the result in. Secondly I save the pandas data frame as a numpy array, in order to get it into list form in the next step. It has to be list form because the pdist function does only operate on lists. When saving the data frame into an array, it stores it as a list within a list. This has to be flattened which is saved in the 'tag_list' variable. Thirdly, the tag_list is furthered reduced into sub-lists of length three, such that the x, y and z coordinates can be obtained for each point, which can the be used to find the Euclidean distance between all of these points (in this example there are three points: A,B and C each being three dimensional).
As said, the function works if the data frame is a single row, but when using the function in the given example it calculates the Euclidean distance for 5x3 points, which yields a total of 105 distances. What I want it to do is to calculate the distances per row (so pdist should only work on a 1x3 vector at a time). Such that my final results, for this example, would look something like this:
dist_1 dist_2 dist_3
0 0.807271 0.142495 1.759969
1 0.180112 0.641855 0.257957
2 0.196950 1.334812 0.638719
3 0.145780 0.384268 0.577387
4 0.044030 0.735428 0.549897
(these are just dummy numbers to show the desired shape)
Hence how do I get my function to apply to the data frame in a row-wise fashion?
Or better yet, how can I get it to perform the function on the entire data frame at once, and then store the result in a new data frame?
Any help would be very appreciated. Thanks.
If I understand correctly, you have "groups" of points. In your example each group has three points, which you call A, B and C. A is represented by three columns A_x, A_y, A_z, and likewise for B and C.
What I suggest is that you restructure your "wide-form" data into a "long" form in which each row contains only one point. Each row then will have only three columns for the coordinates, and then you will add an additional column to represent which group a point is in. Here's an example:
>>> d = pandas.DataFrame(np.random.randn(12, 3), columns=["X", "Y", "Z"])
>>> d["Group"] = np.repeat([1, 2, 3, 4], 3)
>>> d
X Y Z Group
0 -0.280505 0.888417 -0.936790 1
1 0.823741 -0.428267 1.483763 1
2 -0.465326 0.005103 -1.107431 1
3 -1.009077 -1.618600 -0.443975 2
4 0.535634 0.562617 1.165269 2
5 1.544621 -0.858873 -0.349492 2
6 0.839795 0.720828 -0.973234 3
7 -2.273654 0.125304 0.469443 3
8 -0.179703 0.962098 -0.179542 3
9 -0.390777 -0.715896 -0.897837 4
10 -0.030338 0.746647 0.250173 4
11 -1.886581 0.643817 -2.658379 4
The three points with Group==1 correspond to A, B and C in your first row; the three points with Group==2 correspond to A, B, and C in your second row; etc.
With this structure, computing the pairwise distances by group using pdist becomes straightforward:
>>> d.groupby('Group')[["X", "Y", "Z"]].apply(lambda g: pandas.Series(distance.pdist(g), index=["D1", "D2", "D3"]))
D1 D2 D3
Group
1 2.968517 0.918435 2.926395
2 3.119856 2.665986 2.309370
3 3.482747 1.314357 2.346495
4 1.893904 2.680627 3.451939
It is possible to do a similar thing with your existing setup, but it will be more awkward. The problem with the way you set it up is that you have encoded critical information in a difficult-to-extract way. The information about which columns are X coordinates and which are Y or Z coordinates, as well as the information about which columns refer to point A versus B or C, in your setup, is encoded in the textual names of the columns. You as a human can see which columns are X values just by looking at them, but specifying that programmatically requires parsing the string names of the columns.
You can see this in how you made the column names with your '{}_{}'.format(letter, coord) business. This means that in order to get to use pdist on your data, you will have to do the reverse operation of parsing the column names as strings in order to decide which columns to compare. Needless to say, this will be awkward. On the other hand, if you put the data into "long" form, there is no such difficulty: the X coordinates of all points line up in one column, and likewise for Y and Z, and the information about which points are to be compared is also contained in one column (the "Group" column).
When you want to do large-scale operations on subsets of data, it's usually better to split out things into separate rows. This allows you to leverage the power of groupby, and is also usually what is expected by scipy tools.