I have multiple data files with two columns each(x,y). Now I want to take the mean of the y column from each dataset and write the mean as a new file. For example y1, y2, y3 are the columns of datasets dat1, dat2, dat3 respectively. Now let says column y has 3 entries y1_a1, y1_a2, y1_a3 and similarly other columns has y2_a1, y2_a2, y2_a3 ....... so on!! I want to calculate the mean along the rows of each columns s.t mean(y1_a1 + y2_a1 + y3_a1). Do someone have any idea how I can do that? Here is my code where I load the data files and split them to there x and y axis.
def plot_data(data):
xData, yData = np.hsplit(data, 2)
x = xData[:,0]
y = yData[:,0]
plt.plot(x,y)
plt.title("Graph number {index}".format(index=i))
fig, ax = plt.subplots(figsize=(12,8))
for i in range(0, 3):
data = np.loadtxt('data0{i}.dat')
plot_data(data)
plt.legend
plt.show()
If I understood correctly you are looking for something like this:
import pandas as pd
import numpy as np
# making up some data
dat1=pd.DataFrame({"x":["x1","x2","x3","x4"],"y":[24,35,81,99]})
dat2=pd.DataFrame({"x":["x1","x2","x3","x4"],"y":[12,17,1,76]})
# applying mean and converting it into a 1-column dataframe
df_means=pd.DataFrame({"means":[dat1["y"].mean(),dat2["y"].mean()]})
print(df_means)
output:
means
0 59.75
1 26.50
Related
I am trying to subset a matrix by using values from another smaller matrix. The number of rows in each are the same, but the smaller matrix has fewer columns. Each column in the smaller matrix contains the value of the column in the larger matrix that should be referenced. Here is what I have done, along with comments that hopefully describe this better, along with what I have tried. (The wrinkle in this is that the values of the columns to be used in each row change...)
I have tried Google, searching on stackoverflow, etc and can't find what I'm looking for. (The closest I came was something in sage called matrix_from_columns, which isn't being used here) So I'm probably making a very simple referencing error.
TIA,
mconsidine
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
#Problem: for each row in a matrix/image I need to replace
# a value in a particular column in that row by a
# weighted average of some of the values on either
# side of that column in that row. The wrinkle
# is that the column that needs to be changed may
# vary from row to row. The columns that need to
# have their values changes is stored in an array.
#
# How do I do something like:
# img[:, selectedcolumnarray] = somefunction(img,targetcolumnmatrix)
#
# I can do this for setting the selectedcolumnarray to a value, like 0
# But I am not figuring out how to select the targeted values to
# average.
#dimensions of subset of the matrix/image that will be averaged
rows = 7
columns = 5
#weights that will be used to average surrounding values
the_weights = np.ones((rows,columns)).astype(float)*(1/columns)
print(the_weights)
#make up some data to create a set of column
# values that vary by row
y = np.asarray(range(0,rows)).astype(float)
x = -0.095*(y**2) - 0.05*y + 12.123
fit=[x.astype(int),x-x.astype(int),y]
print(np.asarray(fit)[0])
#create a test array, eg "image' of 20 columns that will have
# values in targeted columns replaced
testarray = np.asarray(range(1,21))
img = np.ones((rows,20)).astype(np.uint16)
img = img*testarray.T #give it some values
print(img)
#values of the rows that will be replaced
targetcolumn = np.asarray(fit)[0].astype(int)
print(targetcolumn)
#calculate the range of columns in each row that
# will be used in the averaging
startcol = targetcolumn-2
endcol = targetcolumn+2
testcoords=np.linspace(startcol,endcol,5).astype(int).T
#this is the correct set of columns in the corresponding
# row to use for averaging
print(testcoords)
img2=img.copy()
#this correctly replaces the targetcolumn values with 0
# but I want to replace them with the sum of the values
# in the respective row of testcoords, weighted by the_weights
img2[np.arange(rows),targetcolumn]=0
#so instead of selecting the one column, I want to select
# the block of the image represented by testcoords, calculate
# a weighted average for each row, and use those values instead
# of 0 to set the values in targetcolumn
#starting again with the 7x20 (rowsxcolumns) "image"
img3=img.copy()
#this gives me the wrong size, ie 7,7,5 when I think I want 7,5;
print(testcoords.shape)
#I thought "take" might help, but ... nope
#img3=np.take(img,testcoords,axis=1)
#something here maybe??? :
#https://stackoverflow.com/questions/40084931/taking-subarrays-from-numpy-array-with-given-stride-stepsize
# but I can't figure out what
##### plot surface to try to visualize what is going on ####
'''
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
# Make data.
X = np.arange(0, 20, 1)
Y = np.arange(0, rows, 1)
X, Y = np.meshgrid(X, Y)
Z = img2
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
# Customize the z axis.
ax.set_zlim(0, 20)
ax.zaxis.set_major_locator(LinearLocator(10))
# A StrMethodFormatter is used automatically
ax.zaxis.set_major_formatter('{x:.02f}')
# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
It turns out that "take_along_axis" does the trick:
imgsubset = np.take_along_axis(img3,testcoords,axis=1)
print(imgsubset)
newvalues = imgsubset * the_weights
print(newvalues)
newvalues = np.sum(newvalues, axis=1)
print(newvalues)
img3[np.arange(rows),targetcolumn] = np.round(newvalues,0)
print(img3)
(It becomes more obvious when non trivial weights are used.)
Thanks for listening...
mconsidine
I was plotting a scatter plot to show null values in dataframe. As you can see the plt.scatter() function is not expressive enough. Relation between list(range(0,1200)) and 'a' is not clear unless you see the previous lines. Can the plt.scatter(x,y) be written in a more explicit way where it could be easily understood how x and y is related. Like if somebody only see the plt.scatter(x,y) , they would understand what it is about.
a = []
for i in range(0,1200):
feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum()>i]
a.append(len(feature_with_na))
plt.scatter(list(range(0,1200)), a)
On your x axis you have the number, then on the y-axis you want to plot the number of columns in your DataFrame that have more than that number of null values.
Instead of your loop you can count the number of null values within each column and use numpy.broadcasting, ([:, None]), to compare with an array of your numbers. This allows you to specify an xarr of the numbers, then you use that same array in the comparison.
Sample Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
df = pd.DataFrame(np.random.choice([1,2,3,4,5,np.NaN], (100,10)))
Code
# Range of 'x' values to consider
xarr = np.arange(0, 100)
plt.scatter(xarr, (df.isnull().sum().to_numpy()>xarr[:, None]).sum(axis=1))
ALollz answer is good, but here's a less numpy-heavy alternative if that's your thing:
feature_null_counts = df.isnull().sum()
n_nulls = list(range(100))
features_with_n_nulls = [sum(feature_null_counts > n) for n in n_nulls]
plt.scatter(n_nulls, features_with_n_nulls)
I'm working on python where I'm monitoring delay flows between two hosts. My program creates a file the contains two rows of information on column1 are different time interval for when I received the value in column2, example:
2.0 -0.430053710938
4.0 -0.0391845703125
1.0 5.830078125
4.0 5.07067871094
It took 2 seconds before I received the value -0.430053710938, 4 seconds later I got -0.0391845703125, a second later the value 5.830078125 and so on.
How can I plot this so it would make sense?, I tried look into gnuplot, but it creates column1 as x-axes which messes up everything since my my 3rd value as 1.0.
What you need is a cumulative sum (np.cumsum) of the time (first column) after reading from the file. Below is a complete working answer. I am reading the data from the file and then converting the time list to an array followed by taking a cumulative sum for the time.
import matplotlib.pyplot as plt
import numpy as np
with open('data.dat',"r") as file:
lines = file.readlines()
x = np.cumsum(np.array([float(row.split()[0]) for row in lines]))
y = [float(row.split()[1]) for row in lines]
plt.plot(x, y, '-kx')
plt.show()
Alternative way to load data
data = np.loadtxt('data.dat', usecols=(0,1))
x = np.cumsum(data[:,0])
y = data[:,1]
plt.plot(x, y, '-kx')
plt.show()
I'm trying to plot two columns that have been read in using pandas.read_csv, the code:-
from pandas import read_csv
from matplotlib import pyplot
data = read_csv('Stats.csv', sep=',')
#data = data.astype(float)
data.plot(x = 1, y = 2)
pyplot.show()
the csv file snippet:-
1,a4,2000,125,1.9,2.8,25.6
2,a4,7000,125,1.7,2.3,18
3,a2,7000,30,0.84,1.1,8.11
4,a2,5000,30,0.83,1.05,6.87
5,a2,4000,45,2.8,3.48,16.54
when x = 1 and y = 2 it will plot the second column against the fourth not the third as I expected
When I try to plot the third column against the fourth (x = 2, y = 3) it plots the third against the fifth
I'm trying to plot the third against the fourth right now, when both x and y = 2 it will plot the third column against the fourth but the values are incorrect, what am I missing? is the read_csv changing the order of the columns?
Your input csv is without headers which doesn't help clarity (see Murali's comment). But I think the problem stems from the nature of column that contains a4,a2.
This column can be used for the x axis but not for y axis (non-numeric data on an x axis appears to be just read in order). Hence the count offset. So as y "reads over" the column at 1 (all 0 indexed) - but x does not.
Conducting
data.plot(x=1,y=0)
and
data.plot(x=0,y=1)
and inspecting the axis helps visualise what's going on.
Bizarrely this means you can do
df.plot(x=1,y=1)
to get what you want.
I had a very ambitious project (for my novice level) to use on numpy array, where I load a series of data, and make different plots based on my needs - I have uploaded a slim version of my data file input_data and wanted to make plots based on: F (where I would like to choose the desired F before looping), and each series will have the data from E column (e.g. A12 one data series, A23 another data series in the plot, etc) and on the X axis I would like to use the corresponding values in D.
so to summarize for a chosen value on column F I want to have 4 different data series (as the number of variables on column E) and the data should be reference (x-axis) on the value of column D (which is date)
I stumbled in the first step (although spend too much time) where I wanted to plot all data with F column identifier as one plot.
Here is what I have up to now:
import os
import numpy as np
N = 8 #different values on column F
M = 4 #different values on column E
dataset = open('array_data.txt').readlines()[1:]
data = np.genfromtxt(dataset)
my_array = data
day = len(my_array)/M/N # number of measurement sets - variation on column D
for i in range(0, len(my_array), N):
plt.xlim(0, )
plt.ylim(-1, 2)
plt.plot(my_array[i, 0], my_array[i, 2], 'o')
plt.hold(True)
plt.show()
this does nothing.... and I still have a long way to go..
With pandas you can do:
import pandas as pd
dataset = pd.read_table("toplot.txt", sep="\t")
#make D index (automatically puts it on the x axis)
dataset.set_index("D", inplace=True)
#plotting R vs. D
dataset.R.plot()
#plotting F vs. D
dataset.F.plot()
dataset is a DataFrame object and DataFrame.plot is just a wrapper around the matplotlib function to plot the series.
I'm not clear on how you are wanting to plot it, but it sound like you'll need to select some values of a column. This would be:
# get where F == 1000
maskF = dataset.F == 1000
# get the values where F == 1000
rows = dataset[maskF]
# get the values where A12 is in column E
rows = rows[rows.E == "A12"]
#remove the we don't want to see
del rows["E"]
del rows["F"]
#Plot the result
rows.plot(xlim=(0,None), ylim=(-1,2))