I am trying to subset a matrix by using values from another smaller matrix. The number of rows in each are the same, but the smaller matrix has fewer columns. Each column in the smaller matrix contains the value of the column in the larger matrix that should be referenced. Here is what I have done, along with comments that hopefully describe this better, along with what I have tried. (The wrinkle in this is that the values of the columns to be used in each row change...)
I have tried Google, searching on stackoverflow, etc and can't find what I'm looking for. (The closest I came was something in sage called matrix_from_columns, which isn't being used here) So I'm probably making a very simple referencing error.
TIA,
mconsidine
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
#Problem: for each row in a matrix/image I need to replace
# a value in a particular column in that row by a
# weighted average of some of the values on either
# side of that column in that row. The wrinkle
# is that the column that needs to be changed may
# vary from row to row. The columns that need to
# have their values changes is stored in an array.
#
# How do I do something like:
# img[:, selectedcolumnarray] = somefunction(img,targetcolumnmatrix)
#
# I can do this for setting the selectedcolumnarray to a value, like 0
# But I am not figuring out how to select the targeted values to
# average.
#dimensions of subset of the matrix/image that will be averaged
rows = 7
columns = 5
#weights that will be used to average surrounding values
the_weights = np.ones((rows,columns)).astype(float)*(1/columns)
print(the_weights)
#make up some data to create a set of column
# values that vary by row
y = np.asarray(range(0,rows)).astype(float)
x = -0.095*(y**2) - 0.05*y + 12.123
fit=[x.astype(int),x-x.astype(int),y]
print(np.asarray(fit)[0])
#create a test array, eg "image' of 20 columns that will have
# values in targeted columns replaced
testarray = np.asarray(range(1,21))
img = np.ones((rows,20)).astype(np.uint16)
img = img*testarray.T #give it some values
print(img)
#values of the rows that will be replaced
targetcolumn = np.asarray(fit)[0].astype(int)
print(targetcolumn)
#calculate the range of columns in each row that
# will be used in the averaging
startcol = targetcolumn-2
endcol = targetcolumn+2
testcoords=np.linspace(startcol,endcol,5).astype(int).T
#this is the correct set of columns in the corresponding
# row to use for averaging
print(testcoords)
img2=img.copy()
#this correctly replaces the targetcolumn values with 0
# but I want to replace them with the sum of the values
# in the respective row of testcoords, weighted by the_weights
img2[np.arange(rows),targetcolumn]=0
#so instead of selecting the one column, I want to select
# the block of the image represented by testcoords, calculate
# a weighted average for each row, and use those values instead
# of 0 to set the values in targetcolumn
#starting again with the 7x20 (rowsxcolumns) "image"
img3=img.copy()
#this gives me the wrong size, ie 7,7,5 when I think I want 7,5;
print(testcoords.shape)
#I thought "take" might help, but ... nope
#img3=np.take(img,testcoords,axis=1)
#something here maybe??? :
#https://stackoverflow.com/questions/40084931/taking-subarrays-from-numpy-array-with-given-stride-stepsize
# but I can't figure out what
##### plot surface to try to visualize what is going on ####
'''
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
# Make data.
X = np.arange(0, 20, 1)
Y = np.arange(0, rows, 1)
X, Y = np.meshgrid(X, Y)
Z = img2
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
# Customize the z axis.
ax.set_zlim(0, 20)
ax.zaxis.set_major_locator(LinearLocator(10))
# A StrMethodFormatter is used automatically
ax.zaxis.set_major_formatter('{x:.02f}')
# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
It turns out that "take_along_axis" does the trick:
imgsubset = np.take_along_axis(img3,testcoords,axis=1)
print(imgsubset)
newvalues = imgsubset * the_weights
print(newvalues)
newvalues = np.sum(newvalues, axis=1)
print(newvalues)
img3[np.arange(rows),targetcolumn] = np.round(newvalues,0)
print(img3)
(It becomes more obvious when non trivial weights are used.)
Thanks for listening...
mconsidine
I would like to colour code the scatter plot based upon the two data frame values such that for each different values of df[1], a new color is to be assigned and for each df[2] value having same df[1] value, the assigned color earlier needs the opacity variation with highest value of df[2] (among df[2] values having same df[1] value) getting 100 % opaque and the lowest getting least opaque among the group of the data points.
Here is the code:
def func():
...
df = pd.read_csv(PATH + file, sep=",", header=None)
b = 2.72
a = 0.00000009
popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])
perr = np.sqrt(np.diag(pcov))
plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure
plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure
plt.legend(loc="upper left")
Here is the sample dataset:
**df[0],df[1],df[2],df[3],df[4],df[5],df[6]**
file_name_1_i1,31,413,36120,10,9,10
file_name_1_i2,31,1240,60488,10,25,27
file_name_1_i3,31,2769,107296,10,47,48
file_name_1_i4,31,8797,307016,10,150,150
file_name_2_i1,34,72,10868,11,9,10
file_name_2_i2,34,6273,250852,11,187,196
file_name_3_i1,36,84,29568,12,9,10
file_name_3_i2,36,969,68892,12,25,26
file_name_3_i3,36,6545,328052,12,150,151
file_name_4_i1,69,116,40712,13,25,26
file_name_4_i2,69,417,80080,13,47,48
file_name_4_i2,69,1313,189656,13,149,150
file_name_4_i4,69,3009,398820,13,195,196
file_name_4_i5,69,22913,2855044,13,3991,4144
file_name_5_i1,85,59,48636,16,47,48
file_name_5_i2,85,163,64888,15,77,77
file_name_5_i3,85,349,108728,16,103,111
file_name_5_i4,85,1063,253180,14,248,248
file_name_5_i5,85,2393,526164,15,687,689
file_name_5_i6,85,17713,3643728,15,5862,5867
file_name_6_i1,104,84,75044,33,137,138
file_name_6_i2,104,455,204792,28,538,598
file_name_6_i3,104,1330,513336,31,2062,2063
file_name_6_i4,104,2925,1072276,28,3233,3236
file_name_6_i5,104,6545,2340416,28,7056,7059
...
So, the x-axis would be df[1] which are 31, 31, 31, 31, 34, 34,... and the y-axis is df[5], df[4], df[2] which are 9, 10, 413. For each different value of df[1], a new colour needs to be assigned. It would be fine to repeat the color cycles say after 6 unique colours. And among each color the opacity needs to be changed wrt to the value of df[2] (though y-axis is df[5], df[4], df[2]). The highest getting the darker version of the same color, and the lowest getting the lightest version of the same color.
and the scatter plot:
This is roughly how my desired solution of the color code needs to look like:
I have around 200 entries in the csv file.
Does using NumPy in this scenario is more advantageous ?
Let me know if this is appropriate or if I have misunderstood anything-
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# not needed for you
# df = pd.read_csv('~/Documents/tmp.csv')
max_2 = pd.DataFrame(df.groupby('1').max()['2'])
no_unique_colors = 3
color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
# assign colors to unique df2 in cyclic order
max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]
# calculate the opacities for each entry in the dataframe
colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
# repeat thrice so that df2, df4 and df5 share the same opacity
colors = [x for x in colors for _ in range(3)]
plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
plt.show()
Well, what do you know. I understood this task totally differently. I thought the point was to have alpha levels according to all df[2], df[4], and df[5] values for each df[1] value. Oh well, since I have done the work already, why not post it?
from matplotlib import pyplot as plt
import pandas as pd
from itertools import cycle
from matplotlib.colors import to_rgb
#read the data, column numbers will be generated automatically
df = pd.read_csv("data.txt", sep = ",", header=None)
#our figure with the ax object
fig, ax = plt.subplots(figsize=(10,10))
#definition of the colors
sc_color = cycle(["tab:orange", "red", "blue", "black"])
#get groups of the same df[1] value, they will also be sorted at the same time
dfgroups = df.iloc[:, [2, 4, 5]].groupby(by=df[1])
#plot each group with a different colour
for groupkey, groupval in dfgroups:
#create group dataframe with df[1] value as x and df[2], df[4], and df[5] values as y
groupval= groupval.melt(var_name="x", value_name="y")
groupval.x = groupkey
#get min and max y for the normalization
y_high = groupval.y.max()
y_low = groupval.y.min()
#read out r, g, and b values of the next color in the cycle
r, g, b = to_rgb(next(sc_color))
#create a colour array with nonlinear normalized alpha levels
#between 0.2 and 0.8, so that all data point are visible
group_color = [(r, g, b, 0.19 + 0.8 * ((y_high-val) / (y_high-y_low))**7) for val in groupval.y]
#and plot
ax.scatter(groupval.x, groupval.y, c=group_color)
plt.show()
Sample output of your data:
Two main problems here. One is that alpha in a scatter plot does not accept an array. But color does, hence, the detour to read out the RGB values and create an RGBA array with added alpha levels.
The other is that your data are spread over a rather wide range. A linear normalization makes changes near the lowest values invisible. There is surely some optimization possible; I like for instance this suggestion.
CSV1only is a dataframe uploaded from a CSV
Let CSV1only as a dataframe be a column such that:
TRADINGITEMID:
1233
2455
3123
1235
5098
as a small example
How can I plot a scatterplot accordingly, specifically the y-axis?
I tried:
import pandas as pd
import matplotlib.pyplot as plt
CSV1only.plot(kind='scatter',x='TRADINGITEMID', y= [1,2], color='b')
plt.xlabel('TRADINGITEMID Numbers')
plt.ylabel('Range')
plt.title('Distribution of ItemIDNumbers')
and it doesn't work because of the y.
So, my main question is just how I can get a 0, 1, 2 y-axis for this scatter plot, as I want to make a distribution graph.
The following code doesn't work because it doesn't match the amount of rows included in the original TRADINGITEMID column, which has 5000 rows:
newcolumn_values = [1, 2]
CSV1only['un et deux'] = newcolumn_values
#and then I changed the y = [1,2] from before into y = ['un et deux']
Therefore the solution would need to work from any integer 1 to N, N being the # of rows. Yet, it would only have a range of [0, 2] or some [0, m], m being some arbitrary integer.
Don't need to worry about the actual pandas data frame CSV1only.
The 'TRADINGITEMIDNUMBERS' contains 5000 rows of unique numbers, so I just wanna plot those numbers on a line, with the y-axis being instances (which will never pass 1 since it is unique).
I think you are looking for the following: You need to generate y-values starting from 0 until n-1 where n is the total number of rows
y = np.arange(len(CSV1only['TRADINGITEMID']))
plt.scatter(CSV1only['TRADINGITEMID'], y, c='DarkBlue')
Assume two dataframes, each with a datetime index, and each with one column of unnamed data. The dataframes are of different lengths and the datetime indexes may or may not overlap.
df1 is length 20. df2 is length 400. The data column consists of random floats.
I want to iterate through df2 taking 20 units per iteration, with each iteration incrementing the start vector by one unit - and similarly the end vector by one unit. On each iteration I want to calculate the correlation between the 20 units of df1 and the 20 units I've selected for this iteration of df2. This correlation coefficient and other statistics will then be recorded.
Once the loop is complete I want to plot df1 with the 20-unit vector of df2 that satisfies my statistical search - thus needing to keep up with some level of indexing to reacquire the vector once analysis has been completed.
Any thoughts?
Without knowing more specifics of the questions such as, why are you doing this or do dates matter, this will do what you asked. I'm happy to update based on your feedback.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
df1 = pd.DataFrame({'a':[random.randint(0, 20) for x in range(20)]}, index = pd.date_range(start = '2013-01-01',periods = 20, freq = 'D'))
df2 = pd.DataFrame({'b':[random.randint(0, 20) for x in range(400)]}, index = pd.date_range(start = '2013-01-10',periods = 400, freq = 'D'))
corr = pd.DataFrame()
for i in range(0,380):
t0 = df1.reset_index()['a'] # grab the numbers from df1
t1 = df2.iloc[i:i+20].reset_index()['b'] # grab 20 days, incrementing by one each time
t2 = df2.iloc[i:i+20].index[0] # set the index to be the first day of df2
corr = corr.append(pd.DataFrame({'corr':t0.corr(t1)}, index = [t2])) #calculate the correlation and append it to the DF
# plot it and save the graph
corr.plot()
plt.title("Correlation Graph")
plt.ylabel("(%)")
plt.grid(True)
plt.show()
plt.savefig('corr.png')
I am learning to use matplotlib with pandas and I am having a little trouble with it. There is a dataframe which has districts and coffee shops as its y and x labels respectively. And the column values represent the start date of the coffee-shops in respective districts
starbucks cafe-cool barista ........ 60 shops
dist1 2008-09-18 2010-05-04 2007-02-21 ...............
dist2 2007-06-12 2011-02-17
dist3
.
.
100 districts
I want to plot a scatter plot with x axis as time series and y axis as coffee-shops. Since I couldn't figure out a direct one line way to plot this, I extracted the coffee-shops as one list and dates as other list.
shops = list(df.columns.values)
dt = pd.DataFrame(df.ix['dist1'])
dates = dt.set_index('dist1')
First I tried plt.plot(dates, shops). Got a ZeroDivisionError: integer division or modulo by zero - error. I could not figure out the reason for it. I saw on some posts that the data should be numeric, so I used ytick function.
y = [1, 2, 3, 4, 5, 6,...60]
still plt.plot(dates, y) threw same ZeroDivisionError. If I could get past this may be I would be able to plot using tick function. Source -
http://matplotlib.org/examples/ticks_and_spines/ticklabels_demo_rotation.html
I am trying to plot the graph for only first row/dist1. For that I fetched the first row as a dataframe df1 = df.ix[1] and then used the following
for badges, dates in df.iteritems():
date = dates
ax.plot_date(date, yval)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(badges)
yval+=1
.
I got an error at line ax.plot_date(date, yval) saying x and y should be have same first dimension. Since I am plotting one by one for each coffe-shop for dist1 shouldn't the length always be one for both x and y? PS: date is a datetime.date object
To achieve this you need to convert the dates to datetimes, see here for
an example. As mentioned you also need to convert the coffee shops into
some numbering system then change the tick labels accordingly.
Here is an attempt
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
from datetime import datetime
def get_datetime(string):
"Converts string '2008-05-04' to datetime"
return datetime.strptime(string, "%Y-%m-%d")
# Generate datarame
df = pd.DataFrame(dict(
starbucks=["2008-09-18", "2007-06-12"],
cafe_cool=["2010-05-04", "2011-02-17"],
barista=["2007-02-21"]),
index=["dist1", "dist2"])
ax = plt.subplot(111)
label_list = []
label_ticks = []
yval = 1 # numbering system
# Iterate through coffee shops
for coffee_shop, dates in df.iteritems():
# Convert strings into datetime list
datetimes = [get_datetime(date) for date in dates]
# Create list of yvals [yval, yval, ...] to plot against
yval_list = np.zeros(len(dates))+yval
ax.plot_date(datetimes, yval_list)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(coffee_shop)
yval+=1 # Change the number so they don't all sit at the same y position
# Now set the yticks appropriately
ax.set_yticks(label_ticks)
ax.set_yticklabels(label_list)
# Set the limits so we can see everything
ax.set_ylim(ax.get_ylim()[0]-1,
ax.get_ylim()[1]+1)