I want to achieve the following workflow (as an edge case in a larger project):
Create Dask Array from 2D numpy array
Correlate with map_overlap using depth=1 and no boundary, similar to dask_image.ndfilters.correlate
Compute and store dask array in original numpy array
I have trouble achieving step 3 without doubling memory usage. I get artifacts at the chunk boundaries when using dask_array.store(numpy_array, compute=True), but not when I use numpy_array = dask_array.compute().
My attempt at a minimum reproducible example which share my workflow is using dask_image.correlate:
import numpy as np
import dask.array as da
import matplotlib.pyplot as plt
import dask_image.ndfilters as da_ndf
def initalize_arrays():
array = np.ones((150,100),dtype=np.uint8)
dask_array = da.from_array(array,chunks=((8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6),
(8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4)))
return array, dask_array
array, dask_array = initalize_arrays()
weight_sums = da_ndf.correlate(dask_array,weights=np.ones((3,3)),mode='constant',cval=0.0)
weight_sums.store(array,compute=True)
array_store = array.copy()
array, dask_array = initalize_arrays()
weight_sums = da_ndf.correlate(dask_array,weights=np.ones((3,3)),mode='constant',cval=0.0)
array_compute = weight_sums.compute()
Image of the results, cannot embed images yet.
Image of 2D arrays showing artifacts at chunk boundaries
Related
I am learning pandas and NumPy. I am trying to write a script that will loop through a dataframe and calculate the R2 of an increasingly larger number of rows. This is what I came up with for now:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
df=pd.DataFrame()
a=[1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b=[2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
df['a']=a
df['b']=b
print(df)
plt.scatter(x=df['a'], y=df['b'])
lr = LinearRegression()
for i in range(len(df)):
n=0
X=np.column_stack([np.ones(len(df), dtype=np.float32),(df['a'].loc[0+n]).values()])
y=(df['b'].loc[0+n])
n=n+1
model = lr.fit(X,y)
print(f'R Squared: {model.score(X,y)}')
But I only get the error:
'numpy.float64' object has no attribute 'values'
When I use .values without the for-loop, it converts the values without any problem.
I don't fully understand your goal, but (df['a'].loc[0+n]) results in a single number, not a Series, so you can't call .values() on it.
I've added some hopefully helpful comments to parts of your code
Can you please clarify what you expect X to be in each iteration of the loop?
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
df=pd.DataFrame()
a=[1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b=[2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
df['a']=a
df['b']=b
print(df)
plt.scatter(x=df['a'], y=df['b'])
lr = LinearRegression()
n=0 #I moved this outside the for-loop. before n would always be reset to 0 in each iteration of the for-loop
#you can also consider using i instead of n since i
for i in range(len(df)):
print('n =',n,'i =',i) #keep track of what n and i are
X=np.column_stack([
np.ones(len(df), dtype=np.float32), #first column is all ones
#(df['a'].loc[0+n]).values() #do you need to add 0?
np.repeat(df['a'].loc[n], len(df)), #did you want the second column to all be the n-th value of column A?
])
y=(df['b'].loc[0+n]) #do you need to add 0?
n=n+1
print(X)
#skipping the below for now
#model = lr.fit(X,y) #Error! complains it can't fit since "TypeError: Singleton array array(2) cannot be considered a valid collection."
#print(f'R Squared: {model.score(X,y)}')
So from the database, I'm trying to plot a histogram using the matplot lib library in python.
as shown here:
cnx = sqlite3.connect('practice.db')
sql = pd.read_sql_query('''
SELECT CAST((deliverydistance/1)as int)*1 as bin, count(*)
FROM orders
group by 1
order by 1;
''',cnx)
which outputs
This
From the sql table, I try to extract the columns using a for loop and place them in array.
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
print(distance)
print(counts)
OUTPUT:
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
When I plot a histogram
plt.hist(counts,bins=distance)
I get this out put:
click here
My question is, how do I make it so that the count is on the Y axis and the distance is on the X axis? It doesn't seem to allow me to put it there.
you could also skip the for loop and plot direct from your pandas dataframe using
sql.bin.plot(kind='hist', weights=sql['count(*)'])
or with the for loop
import matplotlib.pyplot as plt
import pandas as pd
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
plt.hist(distance, bins=distance, weights=counts)
You can skip the middle section where you count the instances of each distance. Check out this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'distance':np.round(20 * np.random.random(100))})
df['distance'].hist(bins = np.arange(0,21,1))
Pandas has a built-in histogram plot which counts, then plots the occurences of each distance. You can specify the bins (in this case 0-20 with a width of 1).
If you are not looking for a bar chart and are looking for a horizontal histogram, then you are looking to pass orientation='horizontal':
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
# plt.style.use('dark_background')
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
plt.hist(counts,bins=distance, orientation='horizontal')
Use :
plt.bar(distance,counts)
I have a set of data like this:
numpy.array([[3, 7],[5, 8],[6, 19],[8, 59],[10, 42],[12, 54], [13, 32], [14, 19], [99, 19]])
which I want to split into number of chunkcs with a percantage of overlapping, for each column separatly... for example for column 1, splitting into 3 chunkcs with %50 overlapping (results in a 2-d array):
[[3, 5, 6, 8,],
[6, 8, 10, 12,],
[10, 12, 13, 14,]]
(ignoring last row which will result in [13, 14, 99] not identical in size as the rest).
I'm trying to make a function that takes the array, number of chunkcs and overlpapping percantage and returns the results.
That's a window function, so use skimage.util.view_as_windows:
from skimage.util import view_as_windows
out = view_as_windows(in_arr[:, 0], window_shape = 4, step = 2)
If you need numpy only, you can use this recipe
For numpy only, quite fast approach is:
def rolling(a, window, step):
shape = ((a.size - window)//step + 1, window)
strides = (step*a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
And you can call it like so:
rolling(arr[:,0].copy(), 4, 2)
Remark: I've got unexpected outputs for rolling(arr[:,0], 4, 2) so just took a copy instead.
Trying to understand how the time series of matplotlib works.
Unfortunately, this doc just load data straight from a file using bumpy, which makes it very cryptic for non-fluent numpy adepts.
From the doc:
with cbook.get_sample_data('goog.npz') as datafile:
r = np.load(datafile)['price_data'].view(np.recarray)
r = r[-30:] # get the last 30 days
# Matplotlib works better with datetime.datetime than np.datetime64, but the
# latter is more portable.
date = r.date.astype('O')
In my case, I have a dictionary of datetime (key) and int, which I can transform to an array or list, but I wasn't quite successful to get anything that pyplot would take and the doc isn't much of help, especially for timeseries.
def toArray(dict):
data = list(dict.items())
return np.array(data)
>>>
[datetime.datetime(2020, 5, 4, 16, 44) -13]
[datetime.datetime(2020, 5, 4, 16, 45) 7]
[datetime.datetime(2020, 5, 4, 16, 46) -11]
[datetime.datetime(2020, 5, 4, 16, 47) -75]
[datetime.datetime(2020, 5, 4, 16, 48) -41]
[datetime.datetime(2020, 5, 4, 16, 49) -39]
[datetime.datetime(2020, 5, 4, 16, 50) -4]
The most important part is to split X axis from Y axis (in your case - dates from values). Using your function toArray() to retrieve data, the following code produces a desired result:
import matplotlib.pyplot as plt
data = toArray(your_dict)
fig, ax = plt.subplots(figsize=(20, 10))
dates = [x[0] for x in data]
values = [x[1] for x in data]
ax.plot(dates, values, 'o-')
ax.set_title("Default")
fig.autofmt_xdate()
plt.show()
Note how we split data from 2D array of dates and values into two 1D arrays dates and values.
This is a simple question. I have
range(1, 11)[::-1]
which gives me
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
Is there a 'cleaner' way to generate the above list? With a single function perhaps?
You can use that range(10, 0, -1)
Using numpy you could do
import numpy as np
np.arange(10, 0, -1)
I don't know if it's cleaner but you could also use:
reversed(range(1, 11))