Gurobi in Python: best way to read csv file - python

I'm learning how to solve combinatorial optimization problems in Gurobi using Python. I would like to know what is the best option to read a csv file to use the data as model parameters. I'm using 'genfromtxt' to read the csv file, but I'm having difficulties in using it for constraint construction (Gurobi doesn't support this type - see error).
Here my code and error message, my_data is composed by 4 columns: node index, x coordinate, y coordinate and maximum degree.
from gurobipy import *
from numpy import genfromtxt
import math
# Read data from csv file
my_data = genfromtxt('prob25.csv', delimiter=',')
# Number of vertices
n = len(my_data)
# Function to calculate euclidean distancces
dist = {(i,j) :
math.sqrt(sum((my_data[i][k]-my_data[j][k])**2 for k in [1,2]))
for i in range(n) for j in range(i)}
# Create a new model
m = Model("dcstNarula")
# Create variables
vars = m.addVars(dist.keys(), obj=dist, vtype=GRB.BINARY, name='e')
for i,j in vars.keys():
vars[j,i] = vars[i,j] # edge in opposite direction
m.update()
# Add degree-b constraint
m.addConstrs((vars.sum('*',j) <= my_data[:,3]
for i in range(n)), name='degree')
GurobiError: Unsupported type (<type 'numpy.ndarray'>) for LinExpr addition argument
First two lines of data
1,19.007,35.75,1
2,4.4447,6.0735,2

Actually it was a problem of indexing instead of data type. In the code:
# Add degree-b constraint
m.addConstrs((vars.sum('*',j) <= my_data[:,3]
for i in range(n)), name='degree')
It should be used vars.sum('*',i) instead of vars.sum('*',j) and my_data[i,3] instead of my_data[:,3]

Even though this question is answered, for future visitors who are looking for good ways to read a csv file, pandas must be mentioned:
import pandas as pd
df = pd.read_csv('prob25.csv', header=None, index_col=0, names=['x', 'y', 'idx'])
df
x y idx
1 19.0070 35.7500 1
2 4.4447 6.0735 2

Related

How do I optimize a for loop for faster results in Python

I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#print("Group")
#print(group)
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
#length=len(dataset1)
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
#masterdf[key]=dataset1
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
#%%
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
#data_df.to_csv("test.csv")
#%%
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
k=0
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
k=k+1
# print(k)`
we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.
Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.

How to get nearest match in csv file python

If want to get the nearest match in my big .csv file in python. My (shortened) .csv file is:
0,4,5,0,132,24055,0,64,6,23215,39635,22,21451751,3233419908,8,0,4126,368,15087,0
0,4,5,16,52,22607,0,64,6,24727,22,39635,3233439332,21453192,8,0,26,501,28207,0
1,4,5,0,40,1727,0,128,6,29216,62281,22,123196295,3338477204,5,0,26,513,30738,0
0,4,5,0,116,24108,0,64,6,23178,39635,22,21452647,3233437508,8,0,4126,644,61163,0
0,4,5,0,724,32046,0,64,6,14632,38655,22,1452688218,1828171762,8,0,4126,343,31853,0
0,4,5,0,76,26502,0,128,6,4405,50266,22,1776918274,3172205875,5,0,4126,512,9381,0
1,4,5,0,40,7662,0,64,6,39665,22,62202,3176642698,3972914889,5,0,26,501,63331,0
1,4,5,0,52,939,0,128,6,29992,62206,22,1466629610,0,8,0,44,64240,43460,0
0,4,5,16,76,10076,0,64,6,37199,22,50268,4016221794,718292575,5,0,4126,501,310,0
0,4,5,0,40,26722,0,128,6,4221,50270,22,38340335,3852724687,5,0,26,510,36549,0
0,4,5,0,76,26631,0,128,6,4276,50266,22,1776920362,3172222235,5,0,4126,511,61692,0
0,4,5,16,148,38558,0,64,6,8680,22,37221,2019795091,3598991383,8,0,4126,501,9098,0
0,4,5,0,52,24058,0,64,6,23292,39635,22,21452135,3233420036,8,0,26,368,38558,0
0,4,5,16,76,10249,0,64,6,37026,22,50266,3172221011,1776919966,5,0,4126,501,31557,0
0,4,5,16,212,38490,0,64,6,8684,22,37221,2019776067,3598991175,8,0,4126,501,56063,0
0,4,5,0,60,0,0,64,6,47342,22,44751,2722242689,3606442876,10,0,4426,65160,29042,0
0,4,5,16,76,10234,0,64,6,37041,22,50266,3172220319,1776919498,5,0,4126,501,49854,0
1,4,5,0,1016,1737,0,128,6,28230,62273,22,3387237183,3449598142,5,0,4126,513,49536,0
1,4,5,0,40,20630,0,64,6,26697,22,62288,4040909519,95375909,5,0,26,501,36104,0
0,4,5,16,180,22591,0,64,6,24615,22,39635,3233437764,21452775,8,0,4126,501,28548,0
0,4,5,0,52,31654,0,64,6,15696,47873,22,3476257438,205382502,8,0,26,368,59804,0
1,4,5,0,320,20922,0,64,6,26125,22,62195,2187234888,2519273239,5,0,4126,501,52263,0
0,4,5,0,1132,22526,0,64,6,23744,22,39635,3233417124,21450447,8,0,4126,509,12391,0
1,4,5,0,52,0,0,64,6,47315,22,62282,3209938138,2722777338,8,0,4426,64240,36683,0
0,4,5,0,52,3091,0,64,6,44259,22,38655,1828172842,1452688914,8,0,26,504,7425,0
0,4,5,16,132,10184,0,64,6,37035,22,50266,3172212167,1776918310,5,0,4126,501,44260,0
0,4,5,16,256,10167,0,64,6,36928,22,50266,3172210503,1776918310,5,0,4126,501,19165,0
1,4,5,0,120,2043,0,128,6,28820,62294,22,644393448,2960970388,5,0,4126,512,36939,0
0,4,5,16,196,38575,0,64,6,8615,22,37221,2019796627,3598991543,8,0,4126,501,29587,0
0,4,5,16,148,22599,0,64,6,24639,22,39635,3233438532,21452967,8,0,4126,501,41316,0
1,4,5,0,88,1733,0,128,6,29162,62267,22,872073945,3114048214,5,0,4126,508,23918,0
I have made a programm, but it isn't finished and I don't know how I can complete it. Do I have to use an another program?:
with open("<dir>", "r") as file:
file = file.readlines()
len_ = len(file)
string = "4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0" #The string, that I want to find the neares data in the .csv data.
list_ = []
for i in range(1, len_):
item = str(file[i])
item2 = item[2:]
list_.append(item2)
for item in list_:
algorithm: Look from left to right on the row and find the row with the most sequential matches to the search data.
It seems you are handling a machine learning problem, with a dataset and a point to find the nearest neighbor. I assume you want the point of the dataset that has the shortest euclidean distance (in 19-dimension) to the given point.
I would use pandas and scikit-learn packages with the NearestNeighbors algorithm.
Upload the packages
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
upload the file.csv as Pandas DataFrame (with generic column names)
df = pd.read_csv('file.csv', index_col=False, names=np.arange(20))
Since you want the first column of values as results, I move it to a Pandas Series called "first_column" and drop it from the "df" dataframe
first_column = df[0]
df.drop(columns=[0], inplace=True)
What you called "string" I call it "y" and set it as numpy array:
y = np.array([[4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0]])
now let's fit the NearestNeighbors model
nnb = NearestNeighbors(n_neighbors=1).fit(df)
and now computes which point in the data set is the closest to the given point y:
distances, indices = nnb.kneighbors(y, n_neighbors=1)
print(indices)
[[13]]
So, the nearest point has index 13 in the dataframe. Let's print the 13th position of the first_column
print(first_column.loc[13])
0

"IndexError: too many indices for array" while merging VIPERS and PRIMUS

Hi I'm trying to extract RA, Dec and redshift information from the two surveys(PRIMUS and VIPERS) and collects them into a single nd-array.
The code is as follows :
from astropy.io import fits
import numpy as np
hdulist_PRIMUS = fits.open('data/PRIMUS_2013_zcat_v1.fits')
data_PRIMUS = hdulist_PRIMUS[1].data
data_PRIMUS = np.column_stack((data_PRIMUS['RA'], data_PRIMUS['DEC'],
data_PRIMUS['Z'], data_PRIMUS['FIELD']))
data_PRIMUS = np.array(filter(lambda x: x[3].strip() == 'xmm', data_PRIMUS))[:, :3]
data_PRIMUS = np.array(map(lambda x: [float(x[0]), float(x[1]), float(x[2])], data_PRIMUS))
hdulist_VIPERS = fits.open('data/VIPERS_W1_SPECTRO_PDR2.fits')
data_VIPERS = hdulist_VIPERS[1].data
data_VIPERS = np.column_stack((data_VIPERS['alpha'], data_VIPERS['delta'], data_VIPERS['zspec']))
from astropy import units as u
from astropy.coordinates import SkyCoord
PRIMUS_catalog = SkyCoord(ra=data_PRIMUS[:, 0]*u.degree, dec =data_PRIMUS[:, 1]*u.degree)
VIPERS_catalog = SkyCoord(ra=data_VIPERS[:, 0]*u.degree, dec=data_VIPERS [:, 1]*u.degree)
idx, d2d, d3d = PRIMUS_catalog.match_to_catalog_sky(VIPERS_catalog)
feasible_indices = np.array(map(
lambda x: x[0],
filter(lambda x: x[1].value > 1e-3, zip(idx, d2d))))
data_VIPERS = data_VIPERS[feasible_indices]
data_HZ = np.vstack((data_PRIMUS, data_VIPERS))
When I run this I'm getting a "IndexError: too many indices for array"
Datasets:
PRIMUS Redshift Catalog - https://primus.ucsd.edu/version1.html
VIPERS Redshift Catalog - https://projects.ift.uam-csic.es/skies-universes/VIPERS/photometry/
I think there are a few ways you're doing this where you're making it harder for yourself by not using existing, available tools effectively. For example, since you are working with tabular data from a FITS file, you can take advantage of Astropy's Table interface:
>>> from astropy.table import Table
>>> primus = Table.read('PRIMUS_2013_zcat_v1.fits')
(for this particular file I got some warnings about some of the headers in the table being non-standard, but this can be ignored).
If you want to do some operations on just a few columns of the table, you can do this easily. For example, rather than doing what you did, of selecting a few columns together, and then stacking them into a new array
np.column_stack((data_PRIMUS['RA'], data_PRIMUS['DEC'],
data_PRIMUS['Z'], data_PRIMUS['FIELD']))
you can select a subset of columns from the table like so:
>>> primus[['RA', 'DEC', 'Z', 'FIELD']]
<Table length=213696>
RA DEC Z FIELD
degree degree
float64 float64 float32 bytes13
------------------ ------------------- ---------- -------------
52.892275339281994 -27.833172368069615 0.3420992 calib
52.88448889270391 -27.85252305560996 0.4824943 calib
52.880363885710295 -27.86221750021335 0.33976158 calib
52.88334306466262 -27.86937808271639 0.6134631 calib
52.8866138857103 -27.871773055662942 0.58744365 calib
52.885607068267845 -27.889578785511922 0.26873255 calib
... ... ... ...
34.54856 -4.5544 0.8544105 xmm
34.56942 -4.57564 0.6331108 xmm
34.567412432719756 -4.572718190305209 1.1456184 xmm
34.57134 -4.56414 0.6346616 xmm
34.58088 -4.56804 1.081143 xmm
34.58686 -4.57449 0.7471819 xmm
Then it seems you select the RA, DEC, and Z columns where the field is xmm by using a filter function, but as these are Numpy arrays you can use the filtering capabilities built into Numpy array indexing, as well as Table indexing. The only tricky part is that since these are fixed width character fields you do still need to perform comparisons correctly. You can use Numpy's string functions like np.char.startswith for this:
>>> primus = primus[np.char.startswith(primus['FIELD'], b'xmm')]
In the process of doing a performance comparison, I realized this line is where you're probably getting the error IndexError: too many indices for array:
>>> np.array(filter(lambda x: x[3].strip() == 'xmm', primus))
array(<filter object at 0x7f5170981940>, dtype=object)
In Python 3, the filter function returns an iterable, so wrapping it in np.array() just makes a 0-D array containing this Python object; it's probably not what you intended, so it fails here (this is where looking at the traceback might have been useful). Even if you wrapped the filter() call in list() it wouldn't work, because np.array() only takes homogeneous arrays normally. So an approach like the one I gave is perfectly sufficient (though there may be slightly more efficient ways). It also makes the next line:
np.array(map(lambda x: [float(x[0]), float(x[1]), float(x[2])], data_PRIMUS))
unnecessary. In particular, the first three columns are already in floating point format so this would not be necessary anyways.
Some similar advice applies to the other parts of your code. I'd have written it like more like this:
import numpy as np
from astropy.table import Table, vstack
from astropy import units as u
from astropy.coordinates import SkyCoord
primus = Table.read('PRIMUS_2013_zcat_v1.fits')
primus_field = primus['FIELD']
primus = primus[['RA', 'DEC', 'Z']]
primus = primus[np.char.startswith(primus_field, b'xmm')]
vipers = Table.read('VIPERS_W1_SPECTRO_PDR2.fits')[['alpha', 'delta', 'zspec']]
primus_catalog = SkyCoord(ra=primus['RA']*u.degree, dec=primus['DEC']*u.degree)
vipers_catalog = SkyCoord(ra=vipers['alpha']*u.degree, dec=vipers['delta']*u.degree)
idx, d2d, d3d = primus_catalog.match_to_catalog_sky(vipers_catalog)
feasible_indices = idx[d2d > 1e-3]
vipers = vipers[feasible_indices]
vipers.rename_columns(['alpha', 'delta', 'zspec'], ['RA', 'DEC', 'Z'])
hz = vstack(primus, vipers)
Please let me know if there are any parts of this you have questions on.

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

Graphlab and numpy issue

I'm currently doing a course on Coursera (Machine Leraning) offered by University of Washington and I'm facing little problem with the numpy and graphlab
The course requests to use a version of graphlab higher than 1.7
Mine is higher as you can see below, however, when I run the script below, I got an error as follows:
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started.
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1
features = ['constant'] + features # this is how you combine two lists
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
(example_features, example_output) = get_numpy_data(sales,['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'
print example_output[0] # and the corresponding output
----> 8 feature_matrix = features_sframe.to_numpy()
NameError: global name 'features_sframe' is not defined
The script above was written by the course authors, so I believe there is something I'm doing wrong
Any help will be highly appreciated.
You are supposed to complete the function get_numpy_data before running it, that's why you are getting an error. Follow the instructions in the original function, which actually are:
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame
# add the column 'constant' to the front of the features list so that we can extract it along with the others:
features = ['constant'] + features # this is how you combine two lists
# select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
The graphlab assignment instructions have you convert from graphlab to pandas and then to numpy. You could just skip the the graphlab parts and use pandas directly. (This is explicitly allowed in the homework description.)
First, read in the data files.
import pandas as pd
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('data//kc_house_data.csv', dtype=dtype_dict)
train_data = pd.read_csv('data//kc_house_train_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('data//kc_house_test_data.csv', dtype=dtype_dict)
The convert to numpy function then becomes
def get_numpy_data(df, features, output):
df['constant'] = 1
# add the column 'constant' to the front of the features list so that we can extract it along with the others
features = ['constant'] + features
# select the columns of data_SFrame given by the features list into the SFrame features_sframe
features_df = pd.DataFrame(**FILL IN THE BLANK HERE WITH YOUR CODE**)
# cast the features_df into a numpy matrix
feature_matrix = features_df.as_matrix()
etc.
The remaining code should be the same (since you only work with the numpy versions for the rest of the assignment).

Categories