numpy array converted to pandas dataframe drops values - python

I need to calculate statistics for each node of a 2D grid. I figured the easy way to do this was to take the cross join (AKA cartesian product) of two ranges. I implemented this using numpy as this function:
def node_grid(x_range, y_range, x_increment, y_increment):
x_min = float(x_range[0])
x_max = float(x_range[1])
x_num = (x_max - x_min)/x_increment + 1
y_min = float(y_range[0])
y_max = float(y_range[1])
y_num = (y_max - y_min)/y_increment + 1
x = np.linspace(x_min, x_max, x_num)
y = np.linspace(y_min, y_max, y_num)
ng = list(product(x, y))
ng = np.array(ng)
return ng, x, y
However when I convert this to a pandas dataframe it drops values. For example:
In [2]: ng = node_grid(x_range=(-60, 120), y_range=(0, 40), x_increment=0.1, y_increment=0.1)
In [3]: ng[0][(ng[0][:,0] > -31) & (ng[0][:,0] < -30) & (ng[0][:,1]==10)]
Out[3]: array([[-30.9, 10. ],
[-30.8, 10. ],
[-30.7, 10. ],
[-30.6, 10. ],
[-30.5, 10. ],
[-30.4, 10. ],
[-30.3, 10. ],
[-30.2, 10. ],
[-30.1, 10. ]])
In [4]: node_df = pd.DataFrame(ng[0])
node_df.columns = ['xx','depth']
print(node_df[(node_df.depth==10) & node_df.xx.between(-30,-31)])
Out[4]:Empty DataFrame
Columns: [xx, depth]
Index: []
The dataframe isn't empty:
In [5]: print(node_df.head())
Out[5]: xx depth
0 -60.0 0.0
1 -60.0 0.1
2 -60.0 0.2
3 -60.0 0.3
4 -60.0 0.4
values from the numpy array are being dropped when they are being put into the pandas array. Why?

the "between" function demands that the first argument be less than the latter.
In: print(node_df[(node_df.depth==10) & node_df.xx.between(-31,-30)])
xx depth
116390 -31.0 10.0
116791 -30.9 10.0
117192 -30.8 10.0
117593 -30.7 10.0
117994 -30.6 10.0
118395 -30.5 10.0
118796 -30.4 10.0
119197 -30.3 10.0
119598 -30.2 10.0
119999 -30.1 10.0
120400 -30.0 10.0
For clarity the product() function used comes from the itertools package, i.e., from itertools import product

I can't fully reproduce your code.
But I find the problem is that you have to turn the lower and upper boundaries around in the between query. The following works for me:
print(node_df[(node_df.depth==10) & node_df.xx.between(-31,-30)])
when using:
ng = np.array([[-30.9, 10. ],
[-30.8, 10. ],
[-30.7, 10. ],
[-30.6, 10. ],
[-30.5, 10. ],
[-30.4, 10. ],
[-30.3, 10. ],
[-30.2, 10. ],
[-30.1, 10. ]])
node_df = pd.DataFrame(ng)

Related

Seaborn custom axis sxale: matplotlib.scale.FuncScale

I'm trying to figure out how to get a custom scale for my axis. My x-axis goes from 0 to 1,000,000 in 100,000 step increments, but I want to scale each of these numbers by 1/100, so that they go from 0 to 1,000 in 100 step increments. matplotlib.scale.FuncScale, but I'm having trouble getting it to work.
Here's what the plot currently looks like:
My code looks like this:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
dataPlot = pd.DataFrame({"plot1" : [1, 2, 3], "plot2" : [4, 5, 6], "plot3" : [7, 8, 9]})
ax = sns.lineplot(data = dataPlot, dashes = False, palette = ["blue", "red", "green"])
ax.set_xlim(1, numRows)
ax.set_xticks(range(0, numRows, 100000))
plt.ticklabel_format(style='plain')
plt.scale.FuncScale("xaxis", ((lambda x : x / 1000), (lambda y : y * 1000)))
When I run this code specifically, I get AttributeError: module 'matplotlib.pyplot' has no attribute 'scale', so I tried adding import matplotlib as mpl to the top of the code and then changing the last line to be mpl.scale.FuncScale("xaxis", ((lambda x : x / 1000), (lambda y : y * 1000))) and that actually ran without error, but but it didn't change anything.
How can I get this to properly scale the axis?
Based on the clarification from the question comments a straightforward solution scaling the x-axis data in the dataframe (x-data in the question case being the df index) and then plot.
Using example data since the code from the question wasn't running on its own.
x starting range is 0 to 100, and then scaled to 0 to 10, but that's equivalent to any other starting range and scaling.
1st the default df.plot: (just as reference)
import pandas as pd
import numpy as np
arr = np.arange(0, 101, 1) * 1.5
df = pd.DataFrame(arr, columns=['y_data'])
print(df)
y_data
0 0.0
1 1.5
2 3.0
3 4.5
4 6.0
.. ...
96 144.0
97 145.5
98 147.0
99 148.5
100 150.0
df.plot()
Note that per default df.plot uses the index as x-axis.
2nd scaling the x-data in the dataframe:
The interims dfs are only displayed to follow along.
Preparation
df.reset_index(inplace=True)
Getting the original index data as a column to further work with (see scaling below).
index y_data
0 0 0.0
1 1 1.5
2 2 3.0
3 3 4.5
4 4 6.0
.. ... ...
96 96 144.0
97 97 145.5
98 98 147.0
99 99 148.5
100 100 150.0
df = df.rename(columns = {'index':'x_data'}) # just to be more explicit
x_data y_data
0 0 0.0
1 1 1.5
2 2 3.0
3 3 4.5
4 4 6.0
.. ... ...
96 96 144.0
97 97 145.5
98 98 147.0
99 99 148.5
100 100 150.0
Scaling
df['x_data'] = df['x_data'].apply(lambda x: x/10)
x_data y_data
0 0.0 0.0
1 0.1 1.5
2 0.2 3.0
3 0.3 4.5
4 0.4 6.0
.. ... ...
96 9.6 144.0
97 9.7 145.5
98 9.8 147.0
99 9.9 148.5
100 10.0 150.0
3rd df.plot with specific columns:
df.plot(x='x_data', y = 'y_data')
By x= a specific column instead of the default = index is used as the x-axis.
Note that the y data hasn't changed but the x-axis is now scaled compared to the "1st the default df.plot" above.

How do I mask only the output (labelled data). I don't have any problem in input data

I have so many Nan values in my output data and I padded those values with zeros. Please don't suggest me to delete Nan or impute with any other no. I want model to skip those nan positions.
example:
x = np.arange(0.5, 30)
x.shape = [10, 3]
x = [[ 0.5 1.5 2.5]
[ 3.5 4.5 5.5]
[ 6.5 7.5 8.5]
[ 9.5 10.5 11.5]
[12.5 13.5 14.5]
[15.5 16.5 17.5]
[18.5 19.5 20.5]
[21.5 22.5 23.5]
[24.5 25.5 26.5]
[27.5 28.5 29.5]]
y = np.arange(2, 10, 0.8)
y.shape = [10, 1]
y[4, 0] = 0.0
y[6, 0] = 0.0
y[7, 0] = 0.0
y = [[2. ]
[2.8]
[3.6]
[4.4]
[0. ]
[6. ]
[0. ]
[0. ]
[8.4]
[9.2]]
I expect keras deep learning model to predict zeros for 5th, 7th and 8th row as similar to the padded value in 'y'.

How to make a scatter plot with varying scatter size and color corresponding to a range of values from a dataframe?

I have a Dataframe
df =
Magnitude,Lon,Lat,Depth
3.5 33.3 76.2 22
3.5 33.1 75.9 34
2.5 30.5 79.6 25
5.5 30.4 79.5 40
5.1 32 78.8 58
4.5 31.5 74 NaN
2.1 33.9 74.7 64
5.1 30.8 79.1 33
1.1 32.6 78.2 78
NaN 33.3 76 36
5.2 32.7 79.5 36
NaN 33.6 78.6 NaN
I wanted to make a scatter plot with Lon in X-Axis Lat in Y-axis and scatter points with different size according to the range of values in Magnitude ;
size =1 : Magnitude<2 , size =1.5 : 2<Magnitude<3, size =2 : 3<Magnitude<4, size =2.5 : Magnitude>4.
and with different colour according to the range of values in Depth ;
color =red : Depth<30 , color =blue : 30<Depth<40, color =black : 40<Depth<60, color =yellow : Depth>60
I am thinking to solve this problem by defining a dictionary for the size and color. ( Just giving the idea ; need the correct syntax)
More like
def magnitude_size(df.Magnitude):
if df.Magnitude < 2 :
return 1
if df.Magnitude > 2 and df.Magnitude < 3 :
return 1.5
if df.Magnitude > 3 and df.Magnitude < 4 :
return 2
if df.Magnitude > 4 :
return 2.5
def depth_color(df.Depth):
if df.Depth < 30 :
return 'red'
if df.Depth > 30 and df.Depth < 40 :
return 'blue'
if df.Depth > 40 and df.Depth < 60 :
return 'black'
if df.Depth > 60 :
return 'yellow'
di = {
'size': magnitude_size(df.Magnitude),
'color' : depth_color(df.Depth)
}
plt.scatter(df.Lon,df.Lat,c=di['color'],s=di['size'])
plt.show()
If there any NaN values in Magnitude give a different symbol for the scatter point () and If there any NaN values in Depth give a different color (green)*
NEED HELP
You could use pandas.cut to create a couple of helper columns in df based on your color and size mappings. This should make it easier to pass these arguments to pyplot.scatter.
N.B. It's worth noting that the values you've chosen for size may not distinguish the markers very well in the plot - it'd be worth experimenting with different sizes until you get the desired results
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df['color'] = pd.cut(df['Depth'], bins=[-np.inf, 30, 40, 60, np.inf], labels=['red', 'blue', 'black', 'yellow'])
df['size'] = pd.cut(df['Magnitude'], bins=[-np.inf, 2, 3, 4, np.inf], labels=[1, 1.5, 2, 2.5])
plt.scatter(df['Lon'], df['Lat'], c=df['color'], s=df['size'])
Update
It's not what I would recommend, but if you insist on using dict and functions then use:
def magnitude_size(magnitude):
if magnitude < 2 :
return 1
if magnitude >= 2 and magnitude < 3 :
return 1.5
if magnitude >= 3 and magnitude < 4 :
return 2
if magnitude >= 4 :
return 2.5
def depth_color(depth):
if depth < 30 :
return 'red'
if depth >= 30 and depth < 40 :
return 'blue'
if depth >= 40 and depth < 60 :
return 'black'
if depth >= 60 :
return 'yellow'
if np.isnan(depth):
return 'green'
di = {
'size': df.Magnitude.apply(magnitude_size),
'color' : df.Depth.apply(depth_color)
}
plt.scatter(df.Lon,df.Lat,c=di['color'],s=di['size'])

Combine two numpy arrays and covert them into a dataframe

I have two Dataframes (X & y) sliced off the main dataframe df as below :
X = df.ix[:,df.columns!='Class']
y = df.ix[:,df.columns=='Class']
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_resampled , y_resampled = sm.fit_sample(X,y.values.ravel())
The last line returns a numpy 2-d array for X_resampled and y_resampled.
So I would want to know how to convert X_resampled and y_resampled back into a dataframe.
Example Data :
X_resampled :Dimensions(2,30) : 2 rows,30 columns
array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
y_resampled :Dimensions (2,) - Coressponding to the two rows of X_resampled.
array([0, 0], dtype=int64)
I believe you need numpy.hstack:
a = np. array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
b = np.array([0, 100])
c = pd.DataFrame(np.hstack((a,b[:, None])))
print (c)
0 1 2 3 4 5 6 7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
8 9 ... 21 22 23 24 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846
25 26 27 28 29 30
0 0.128539 -0.189115 0.133558 -0.021053 0.244964 0.0
1 0.167170 0.125895 -0.008983 0.014724 -0.342475 100.0
[2 rows x 31 columns]

deleting rows by default value

I have found code that i am interested in, on this forum.
But it's not working for my dataframe.
INPUT:
x , y ,value ,value2
1.0 , 1.0 , 12.33 , 1.23367543
2.0 , 2.0 , 11.5 , 1.1523123
4.0, 2.0 , 22.11 , 2.2112312
5.0, 5.0 , 78.13 , 7.8131239
6.0, 6.0 , 33.68 , 3.3681231
i need delete rows in distance between =1, and leave only one where is highest "value"
RESULT to get:
1.0 , 1.0 , 12.23 , 1.23367543
4.0, 2.0 , 22.11 , 2.2112312
5.0, 5.0 , 78.13 , 7.8131239
CODE:
def dist_value_comp(row):
x_dist = abs(df['y'] - row['y']) <= 1
y_dist = abs(df['x'] - row['x']) <= 1
xy_dist = x_dist & y_dist
max_value = df.loc[xy_dist, 'value2'].max()
return row['value2'] == max_value
df['keep_row'] = df.apply(dist_value_comp, axis=1)
df.loc[df['keep_row'], ['x', 'y','value', 'value2']]
PROBLEM:
When i am adding 4th columnvalue2 where valueshave more numbers after dot, code showing me only row with the highest value2 but result should be same as for value.
UPDATE:
it's working when i am using old pycharm and python 2.7 , on new version it's not, any idea why?

Categories