Importing data as an array for plotting in Python - python

I am new to this question. I hop to get benefit of your advice. Sorry if it is amateurish.
I have the following code which finally shows a plot. I just write one part of code.
...
cov = np.dot(A, A.T)
samps2 = np.random.multivariate_normal([0]*ndim, cov, size=nsamp)
print(samps2)
names = ["x%s"%i for i in range(ndim)]
labels = ["x_%s"%i for i in range(ndim)]
samples2 = MCSamples(samples=samps2,names = names, labels = labels, label='Second set')
g = plots.getSubplotPlotter()
g.triangle_plot([samples2], filled=True)
It has no problem. The plot is drawn using the data coming from samps2. To see what the samps2 is, we do print(samps2) and see:
[[-0.11213986 -0.0582685 ]
[ 0.20346731 0.25309022]
[ 0.22737737 0.2250694 ]
[-0.09544588 -0.12754274]
[-1.05491483 -1.15432073]
[-0.31340717 -0.36144749]
[-0.99158936 -1.12785124]
[-0.5218308 -0.59193326]
[ 0.76552123 0.82138362]
[ 0.65083618 0.70784292]]
My question is, If I want to read these data from a txt file. what should I do?
Thank you.

There are several ways. What comes to my mind is:
plain python:
data = []
with open(filename, 'r') as f:
for line in f:
data.append([float(num) for num in line.split()])
numpy:
import numpy as np
data = np.genfromtxt(filename, ...)
pandas:
import pandas as pd
df = pd.read_table(filename, sep='\s+', header=None)

Related

Weird Time-Series Graph Using Pycaret and plotly

I am trying to visualize Air Quality Data as time-series charts using pycaret and plotly dash python libraries , but i am getting very weird graphs, below is my code:
import pandas as pd
import plotly.express as px
data = pd.read_csv('E:/Self Learning/Djang_Dash/2019-2020_5.csv')
data['Date'] = pd.to_datetime(data['Date'], format='%d/%m/%Y')
#data.set_index('Date', inplace=True)
# combine store and item column as time_series
data['OBJECTID'] = ['Location_' + str(i) for i in data['OBJECTID']]
#data['AQI_Bins_AI'] = ['Bin_' + str(i) for i in data['AQI_Bins_AI']]
data['time_series'] = data[['OBJECTID']].apply(lambda x: '_'.join(x), axis=1)
data.drop(['OBJECTID'], axis=1, inplace=True)
# extract features from date
data['month'] = [i.month for i in data['Date']]
data['year'] = [i.year for i in data['Date']]
data['day_of_week'] = [i.dayofweek for i in data['Date']]
data['day_of_year'] = [i.dayofyear for i in data['Date']]
data.head(4000)
data['time_series'].nunique()
for i in data['time_series'].unique():
subset = data[data['time_series'] == i]
subset['moving_average'] = subset['CO'].rolling(window = 30).mean()
fig = px.line(subset, x="Date", y=["CO","moving_average"], title = i, template = 'plotly_dark')
fig.show()
require needful help in this regard,
here is my sample data Google Drive Link
data has not been provided in a usable way. Sought out publicly available similar data. found: https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=station_hour.csv
using this data, with a couple of cleanups of your code, no issues with plots. I suspect your data has one of these issues
date is not datetime64[ns] in your data frame
date is not sorted, leading to lines being drawn in way you have noted
by refactoring way moving average is calculated, you can use animation instead of lots of separate figures
get some data
import kaggle.cli
import sys, math
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import plotly.express as px
# download data set
# https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=station_hour.csv
sys.argv = [
sys.argv[0]
] + "datasets download rohanrao/air-quality-data-in-india".split(
" "
)
kaggle.cli.main()
zfile = ZipFile("air-quality-data-in-india.zip")
print([f.filename for f in zfile.infolist()])
plot using code from question
import pandas as pd
import plotly.express as px
from pathlib import Path
from distutils.version import StrictVersion
# data = pd.read_csv('E:/Self Learning/Djang_Dash/2019-2020_5.csv')
# use kaggle data
# dfs = {f.filename:pd.read_csv(zfile.open(f)) for f in zfile.infolist() if f.filename in ['station_day.csv',"stations.csv"]}
# data = pd.merge(dfs['station_day.csv'],dfs["stations.csv"], on="StationId")
# data['Date'] = pd.to_datetime(data['Date'])
# # kaggle data is different from question, make it compatible with questions data
# data = data.assign(OBJECTID=lambda d: d["StationId"])
# sample data from google drive link
data2 = pd.read_csv(Path.home().joinpath("Downloads").joinpath("AQI.csv"))
data2["Date"] = pd.to_datetime(data2["Date"])
data = data2
# as per very first commment - it's important data is ordered !
data = data.sort_values(["Date","OBJECTID"])
data['time_series'] = "Location_" + data["OBJECTID"].astype(str)
# clean up data, remove rows where there is no CO value
data = data.dropna(subset=["CO"])
# can do moving average in one step (can also be used by animation)
if StrictVersion(pd.__version__) < StrictVersion("1.3.0"):
data["moving_average"] = data.groupby("time_series",as_index=False)["CO"].rolling(window=30).mean().to_frame()["CO"].values
else:
data["moving_average"] = data.groupby("time_series",as_index=False)["CO"].rolling(window=30).mean()["CO"]
# just first two for purpose of demonstration
for i in data['time_series'].unique()[0:3]:
subset = data.loc[data['time_series'] == i]
fig = px.line(subset, x="Date", y=["CO","moving_average"], title = i, template = 'plotly_dark')
fig.show()
can use animation
px.line(
data,
x="Date",
y=["CO", "moving_average"],
animation_frame="time_series",
template="plotly_dark",
).update_layout(yaxis={"range":[data["CO"].min(), data["CO"].quantile(.97)]})

HeatMap on pandas python 3.7

I'm trying to do a beautiful heatMap using pandas. The data is a csv file, is in the same folder of the script python.
I got an error in my code, is easy:
File "<ipython-input-6-1b7ca215e6d0>", line 4
fid = datadf u'/my_Path/File.csv'
^
SyntaxError: invalid syntax
I think the important reason is not the syntax.
So I need to your help ?
My code is:
datadf = pd.read_csv("D:\my_Path\File.csv")
## Loading the data
fid = datadf u'/my_Path/File.csv'
key = u'dataset_key'
## Load the dataframe
df = pd.read_hdf(fid,key)
## Default plot ranges:
long_range = (datadf['long'].min(), datadf['long'].max())
lat_range = (datadf['lat'].min(), datadf['lat'].max())
## France plot ranges
long_range_fr = (-5,10)
lat_range_fr = (40,52)
## Visualization
### Custom functions
def bg(img):
return tf.set_background(img,"black")
def create_image(long_range=long_range, lat_range=lat_range, w=800, h=800):
cvs = ds.Canvas(x_range=long_range, y_range=lat_range, plot_height=h, plot_width=w)
agg = cvs.points(df, 'lon', 'lat')
return bg(tf.shade(agg, cmap = cm(Hot,0.2), how='eq_hist'))
### Statit plot
create_image(long_range=long_range_fr, lat_range=lat_range_fr)
A sample of my data:
long lat
-0.91655 43.456863
-0.495795 43.162117
-0.029272 43.097401
-0.108955 43.233845
-0.10237 43.207676
-0.096726 43.19257
-0.102862 43.216438
-0.1091 43.234241
-0.105826 43.225636
-0.096518 43.190247
-0.098496 43.19902
-0.079585 43.229698
-0.081321 43.232929
-0.079448 43.232937
-0.624699 43.364143
-0.429526 43.328094
As rightly mentioned by #Kristóf Varga seaborn heatmap can be used to find the appropriate solution.
A solution can be found over here: Using seaborn heatmap

Python input for Spectral Clustering

I am using the code from https://github.com/pin3da/spectral-clustering/blob/master/spectral/utils.py to spectrally cluster data in https://cs.joensuu.fi/sipu/datasets/s1.txt
May i know how I can change the code such that it can take in txt file as input?
I have given the original code below for reference
Original code from GitHub
import numpy
import scipy.io
import h5py
def load_dot_mat(path, db_name):
try:
mat = scipy.io.loadmat(path)
except NotImplementedError:
mat = h5py.File(path)
return numpy.array(mat[db_name]).transpose()
I do not understand the purpose of the variable, db_name
The code you show here just opens a given mat or h5 file. The path to the file (path) and the name of the data set within the file (db_name) are provided as arguments to the load_dot_mat function.
To load your txt file, we can create our own little load function:
def load_txt(filename):
with open(filename, "r") as f:
data = [[int(x) for x in line.split(" ") if x != ""] for line in f]
return np.array(data)
This function takes the path to your "txt" file as an argument an returns a numpy array with the data from your file. The data array has shape (5000,2) for the file you provided. You may want to use float instead of int, if other files contain float values and not only integers.
The complete clustering step for your data could then look like this:
from itertools import cycle, islice
import matplotlib.pyplot as plt
import numpy as np
import seaborn
from spectral import affinity, clustering
seaborn.set()
def load_txt(filename):
with open(filename, "r") as f:
data = [[int(x) for x in line.split(" ") if x != ""] for line in f]
return np.array(data)
data = load_txt("s1.txt")
A = affinity.com_aff_local_scaling(data)
n_cls = 15 # found by looking at your data
Y = clustering.spectral_clustering(A, n_cls)
colors = np.array(list(islice(cycle(seaborn.color_palette()), int(max(Y) + 1))))
fig = plt.figure(1)
ax = fig.add_subplot(111)
ax.scatter(data[:, 0], data[:, 1], color=colors[Y], s=6, alpha=0.6)
plt.show()

Python - Outputting two data sets (lists?) to data file as two columns

I am very novice when it comes to python. I have done most of my programming in C++. I have a program which generates the fast Fourier transform of a data set and plots both the data and the FFT in two windows using matplotlib. Instead of plotting, I want to output the data to a file. This would be a simple task for me in C++, but I can't seem to figure this out in python. So the question is, "how can I output powerx and powery to a data file in which both data sets are in separate columns? Below is the program:
import matplotlib.pyplot as plt
from fft import fft
from fft import fft_power
from numpy import array
import math
import time
# data downloaded from ftp://ftp.cmdl.noaa.gov/ccg/co2/trends/co2_mm_mlo.txt
print ' C02 Data from Mauna Loa'
data_file_name = 'co2_mm_mlo.txt'
file = open(data_file_name, 'r')
lines = file.readlines()
file.close()
print ' read', len(lines), 'lines from', data_file_name
window = False
yinput = []
xinput = []
for line in lines :
if line[0] != '#' :
try:
words = line.split()
xval = float(words[2])
yval = float( words[4] )
yinput.append( yval )
xinput.append( xval )
except ValueError :
print 'bad data:',line
N = len(yinput)
log2N = math.log(N, 2)
if log2N - int(log2N) > 0.0 :
print 'Padding with zeros!'
pads = [300.0] * (pow(2, int(log2N)+1) - N)
yinput = yinput + pads
N = len(yinput)
print 'Padded : '
print len(yinput)
# Apply a window to reduce ringing from the 2^n cutoff
if window :
for iy in xrange(len(yinput)) :
yinput[iy] = yinput[iy] * (0.5 - 0.5 * math.cos(2*math.pi*iy/float(N-1)))
y = array( yinput )
x = array([ float(i) for i in xrange(len(y)) ] )
Y = fft(y)
powery = fft_power(Y)
powerx = array([ float(i) for i in xrange(len(powery)) ] )
Yre = [math.sqrt(Y[i].real**2+Y[i].imag**2) for i in xrange(len(Y))]
plt.subplot(2, 1, 1)
plt.plot( x, y )
ax = plt.subplot(2, 1, 2)
p1, = plt.plot( powerx, powery )
p2, = plt.plot( x, Yre )
ax.legend( [p1, p2], ["Power", "Magnitude"] )
plt.yscale('log')
plt.show()
You can use a csv.writer() to achieve this task, here is the reference: https://docs.python.org/2.6/library/csv.html
Basic usage:
zip you lists into rows:
rows=zip(powery,powerx)
Use a csv writer to write the data to a csv file:
with open('test.csv', 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
Depending on what you want to use the file for, I'd suggest either the csv module or the json module.
Writing the file as CSV data will give you the ability to open it with a spreadsheet, graph it, edit it, etc.
Writing the file as JSON data will give you the ability to quickly import it into other programming languages, and to inspect it (generally read-only -- if you want to do serious editing, go with CSV).
This is how you can write data from two different lists into text file in two column.
# Two random lists
index = [1, 2, 3, 4, 5]
value = [4.5, 5, 7.0, 11, 15.7]
# Opening file for output
file_name = "output.txt"
fwm = open(file_name, 'w')
# Writing data in file
for i in range(len(index)):
fwm.write(str(index[i])+"\t")
fwm.write(str(value[i])+"\n")
# Closing file after writing
fwm.close()
if your list contain data in the form of string then remove 'str' while writing data in file.
If you want to save data in csv file change
fwm.write(str(index[i])+"\t")
WITH
fwm.write(str(index[i])+",")

Create a weighted adjacency list from an alphanumeric edgelist in Python

I've been working on this dataset of protein-protein interactions. I have the edgelist in the following format:
AIG676464 AIG8475985 0.00035. Protein 1, Protein 2, weight.
I've tried several methods and can't get it to output the matrix. What I am hoping to get is the matrix form of the interactions. Any help would be greatly appreciated. Python or R is fine.
I've tried networkx:
import networkx as nx
fh = open("InWeb29.txt", 'rb')
#d = fh.write(textline)
#fh.close()
G = nx.read_edgelist(fh)
G = nx.Graph([()])
A = nx.adjacency_matrix(G)
print(A.todense())
A.setdiag(A.diagonal()*2)
print(A.todense())
Here is my other code so far:
import csv
import pandas as pd
"Load in data file"
"""Read in the data file"""
df = pd.read_csv("datafile.txt", sep= '\t', header=0)
headers = list(df)
prot1 = df[df.columns[0]]
prot2 = df[df.columns[1]]
weight = df[df.columns[2]]
print prot1
with open("datafile.txt") as f:
next(f)
data = [tuple(map(str,row)) for row in csv.reader(f)]
n = max(max(prot1, prot2) for prot1, prot2, weight in data)
matrix = [[None]* n for i in range(n)]
for prot1, prot2 in data:
matrix[prot1][prot2]= weight
for row in matrix:
print(row)
It NetworkX you can use read_weighted_edgelist
import networkx as nx
import StringIO
s = StringIO.StringIO("AIG676464 AIG8475985 0.00035")
G = nx.read_weighted_edgelist(s)
A = nx.adjacency_matrix(G)
print A.todense()
Output
[[ 0. 0.00035]
[ 0.00035 0. ]]

Categories