How to get nearest match in csv file python - python

If want to get the nearest match in my big .csv file in python. My (shortened) .csv file is:
0,4,5,0,132,24055,0,64,6,23215,39635,22,21451751,3233419908,8,0,4126,368,15087,0
0,4,5,16,52,22607,0,64,6,24727,22,39635,3233439332,21453192,8,0,26,501,28207,0
1,4,5,0,40,1727,0,128,6,29216,62281,22,123196295,3338477204,5,0,26,513,30738,0
0,4,5,0,116,24108,0,64,6,23178,39635,22,21452647,3233437508,8,0,4126,644,61163,0
0,4,5,0,724,32046,0,64,6,14632,38655,22,1452688218,1828171762,8,0,4126,343,31853,0
0,4,5,0,76,26502,0,128,6,4405,50266,22,1776918274,3172205875,5,0,4126,512,9381,0
1,4,5,0,40,7662,0,64,6,39665,22,62202,3176642698,3972914889,5,0,26,501,63331,0
1,4,5,0,52,939,0,128,6,29992,62206,22,1466629610,0,8,0,44,64240,43460,0
0,4,5,16,76,10076,0,64,6,37199,22,50268,4016221794,718292575,5,0,4126,501,310,0
0,4,5,0,40,26722,0,128,6,4221,50270,22,38340335,3852724687,5,0,26,510,36549,0
0,4,5,0,76,26631,0,128,6,4276,50266,22,1776920362,3172222235,5,0,4126,511,61692,0
0,4,5,16,148,38558,0,64,6,8680,22,37221,2019795091,3598991383,8,0,4126,501,9098,0
0,4,5,0,52,24058,0,64,6,23292,39635,22,21452135,3233420036,8,0,26,368,38558,0
0,4,5,16,76,10249,0,64,6,37026,22,50266,3172221011,1776919966,5,0,4126,501,31557,0
0,4,5,16,212,38490,0,64,6,8684,22,37221,2019776067,3598991175,8,0,4126,501,56063,0
0,4,5,0,60,0,0,64,6,47342,22,44751,2722242689,3606442876,10,0,4426,65160,29042,0
0,4,5,16,76,10234,0,64,6,37041,22,50266,3172220319,1776919498,5,0,4126,501,49854,0
1,4,5,0,1016,1737,0,128,6,28230,62273,22,3387237183,3449598142,5,0,4126,513,49536,0
1,4,5,0,40,20630,0,64,6,26697,22,62288,4040909519,95375909,5,0,26,501,36104,0
0,4,5,16,180,22591,0,64,6,24615,22,39635,3233437764,21452775,8,0,4126,501,28548,0
0,4,5,0,52,31654,0,64,6,15696,47873,22,3476257438,205382502,8,0,26,368,59804,0
1,4,5,0,320,20922,0,64,6,26125,22,62195,2187234888,2519273239,5,0,4126,501,52263,0
0,4,5,0,1132,22526,0,64,6,23744,22,39635,3233417124,21450447,8,0,4126,509,12391,0
1,4,5,0,52,0,0,64,6,47315,22,62282,3209938138,2722777338,8,0,4426,64240,36683,0
0,4,5,0,52,3091,0,64,6,44259,22,38655,1828172842,1452688914,8,0,26,504,7425,0
0,4,5,16,132,10184,0,64,6,37035,22,50266,3172212167,1776918310,5,0,4126,501,44260,0
0,4,5,16,256,10167,0,64,6,36928,22,50266,3172210503,1776918310,5,0,4126,501,19165,0
1,4,5,0,120,2043,0,128,6,28820,62294,22,644393448,2960970388,5,0,4126,512,36939,0
0,4,5,16,196,38575,0,64,6,8615,22,37221,2019796627,3598991543,8,0,4126,501,29587,0
0,4,5,16,148,22599,0,64,6,24639,22,39635,3233438532,21452967,8,0,4126,501,41316,0
1,4,5,0,88,1733,0,128,6,29162,62267,22,872073945,3114048214,5,0,4126,508,23918,0
I have made a programm, but it isn't finished and I don't know how I can complete it. Do I have to use an another program?:
with open("<dir>", "r") as file:
file = file.readlines()
len_ = len(file)
string = "4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0" #The string, that I want to find the neares data in the .csv data.
list_ = []
for i in range(1, len_):
item = str(file[i])
item2 = item[2:]
list_.append(item2)
for item in list_:
algorithm: Look from left to right on the row and find the row with the most sequential matches to the search data.

It seems you are handling a machine learning problem, with a dataset and a point to find the nearest neighbor. I assume you want the point of the dataset that has the shortest euclidean distance (in 19-dimension) to the given point.
I would use pandas and scikit-learn packages with the NearestNeighbors algorithm.
Upload the packages
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
upload the file.csv as Pandas DataFrame (with generic column names)
df = pd.read_csv('file.csv', index_col=False, names=np.arange(20))
Since you want the first column of values as results, I move it to a Pandas Series called "first_column" and drop it from the "df" dataframe
first_column = df[0]
df.drop(columns=[0], inplace=True)
What you called "string" I call it "y" and set it as numpy array:
y = np.array([[4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0]])
now let's fit the NearestNeighbors model
nnb = NearestNeighbors(n_neighbors=1).fit(df)
and now computes which point in the data set is the closest to the given point y:
distances, indices = nnb.kneighbors(y, n_neighbors=1)
print(indices)
[[13]]
So, the nearest point has index 13 in the dataframe. Let's print the 13th position of the first_column
print(first_column.loc[13])
0

Related

How do I optimize a for loop for faster results in Python

I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#print("Group")
#print(group)
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
#length=len(dataset1)
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
#masterdf[key]=dataset1
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
#%%
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
#data_df.to_csv("test.csv")
#%%
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
k=0
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
k=k+1
# print(k)`
we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.
Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.

Multiplication of values in a dataframe with scalars

I am working on a problem where I want to convert X and Y pixel values to physical coordinates. I have a huge folder containing many csv files and i load them, pass them to my function, compute the coordinates and overwrite the columns and return the data frame. I then overwrite it outside the function. I have the formula which does it correctly but I am having some problems implementing it in python.
Each CSV files has many columns. The columns I am interested in are Latitude (degree), Longitude (degree), XPOS and YPOS. The former 2 are blank and the latter 2 have the data with which I need to fill up the former two.
import pandas as pd
import glob
max_long = float(XXXX)
max_lat = float(XXXX)
min_long = float(XXXX)
min_lat = float(XXXX)
hoi = int(909)
woi = int(1070)
def pixel2coor (filepath, max_long, max_lat, min_lat, min_long, hoi, woi):
data = pd.read_csv(filepath) #reading Csv
data2 = data.set_index("Log File") #Setting index of dataframe with first column
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
data2.loc[data2['Latitude (degree)']] = (((max_lat-min_lat)/woi)*[data2[:,'YPOS']]+min_lat) #Computing Latitude & Overwriting
return data2 #Return dataframe
filenames = sorted(glob.glob('*.csv'))
for file in filenames:
df = pixel2coor (file, max_long, max_lat, min_lat, min_long, hoi, woi) #Calling pixel 2 coor function and passing a csv file in every iteration
df.to_csv(file) #overwriting the file with the dataframe
I am getting the following error.
**
TypeError: '(slice(None, None, None), 'XPOS')' is an invalid key
**
It looks to me like your syntax is off. In the following line:
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
The left side of your equation appears to be referring to a column, but you have it in the 'row' section of .loc slicer. So it should be:
data2.loc[:, 'Longitude (degree)']
On the right side of your equation, you've forgotten .loc or need to drop the ':,' so two possible solutions:
(((max_long-min_long)/hoi)*data2.loc[:,'XPOS']+min_long)
(((max_long-min_long)/hoi)*data2['XPOS']+min_long)
Also, I would add that your brackets on the right side should be more explicit. It's a bit unclear how you want scalars to act on the series. Do you want to add min_long first? Or multiply (((max_long-min_long)/hoi) first?
Your final row might look like this, forcing addition first as an example:
data2.loc[:, 'Longitude (degree)'] = ((max_long-min_long)/hoi)*(data2.loc[:,'XPOS']+min_long)
This applies to your next line as well. You may get more errors after you fix this.

Append model output to pd df rows

I'm trying to put Pyomo model output into pandas.DataFrame rows. I'm accomplishing it now by saving data as a .csv, then reading the .csv file as a DataFrame. I would like to skip the .csv step and put output directly into a DataFrame.
When I accomplish an optimization solution with Pyomo, the optimal assignments are 1 in the model.x[i] output data (0 otherwise). model.x[i] is indexed by dict keys in v. model.x is specific syntax to Pyomo
Pyomo assigns a timeItem[i], platItem[i], payItem[i], demItem[i], v[i] for each value that presents an optimal solution. The 0807results.csv file produces an accurate file of the optimal assignments showing the value of timeItem[i], platItem[i], payItem[i], demItem[i], v[i] for each valid assignment in the optimal solution.
When model.x[i] is 1, how can I get timeItem[i], platItem[i], payItem[i], demItem[i], v[i] directly into a DataFrame? Your assistance is greatly appreciated. My current code is below.
index=sorted(v.keys())
with open('0807results.csv', 'w') as f:
for i in index:
if value(model.x[i])>0:
f.write("%s,%s,%s,%s,%s\n"%(timeItem[i],platItem[i],payItem[i], demItem[i],v[i]))
from pandas import read_csv
now = datetime.datetime.now()
dtg=(now.strftime("%Y%m%d_%H%M"))
df = read_csv('0807results.csv')
df.columns = ['Time', 'Platform','Payload','DemandType','Value']
# convert payload types to string so not summed
df['Payload'] = df['Payload'].astype(str)
df = df.sort_values('Time')
df.to_csv('results'+(dtg)+'.csv')
# do stats & visualization with pandas df
I have no idea what is in the timeItem etc iterables from the code you've posted. However, I suspect that something similar to:
import pandas as pd
results = pd.DataFrame([timeItem, platItem, payItem, demItem, v], index=["time", "plat", "pay", "dem", "v"]).T
Will work.
If you want to filter on 1s in model.x, you might add it as a column as well, and do a filter with pandas directly:
import pandas as pd
results = pd.DataFrame([timeItem, platItem, payItem, demItem, v, model.x], index=["time", "plat", "pay", "dem", "v", "x"]).T
filtered_results = results[results["x"]>0]
You can also use the DataFrame.from_records() function:
def record_generator():
for i in sorted(v.keys()):
if value(model.x[i] > 1E-6): # integer tolerance
yield (timeItem[i], platItem[i], payItem[i], demItem[i], v[i])
df = pandas.DataFrame.from_records(
record_generator(), columns=['Time', 'Platform', 'Payload', 'DemandType', 'Value'])

Gurobi in Python: best way to read csv file

I'm learning how to solve combinatorial optimization problems in Gurobi using Python. I would like to know what is the best option to read a csv file to use the data as model parameters. I'm using 'genfromtxt' to read the csv file, but I'm having difficulties in using it for constraint construction (Gurobi doesn't support this type - see error).
Here my code and error message, my_data is composed by 4 columns: node index, x coordinate, y coordinate and maximum degree.
from gurobipy import *
from numpy import genfromtxt
import math
# Read data from csv file
my_data = genfromtxt('prob25.csv', delimiter=',')
# Number of vertices
n = len(my_data)
# Function to calculate euclidean distancces
dist = {(i,j) :
math.sqrt(sum((my_data[i][k]-my_data[j][k])**2 for k in [1,2]))
for i in range(n) for j in range(i)}
# Create a new model
m = Model("dcstNarula")
# Create variables
vars = m.addVars(dist.keys(), obj=dist, vtype=GRB.BINARY, name='e')
for i,j in vars.keys():
vars[j,i] = vars[i,j] # edge in opposite direction
m.update()
# Add degree-b constraint
m.addConstrs((vars.sum('*',j) <= my_data[:,3]
for i in range(n)), name='degree')
GurobiError: Unsupported type (<type 'numpy.ndarray'>) for LinExpr addition argument
First two lines of data
1,19.007,35.75,1
2,4.4447,6.0735,2
Actually it was a problem of indexing instead of data type. In the code:
# Add degree-b constraint
m.addConstrs((vars.sum('*',j) <= my_data[:,3]
for i in range(n)), name='degree')
It should be used vars.sum('*',i) instead of vars.sum('*',j) and my_data[i,3] instead of my_data[:,3]
Even though this question is answered, for future visitors who are looking for good ways to read a csv file, pandas must be mentioned:
import pandas as pd
df = pd.read_csv('prob25.csv', header=None, index_col=0, names=['x', 'y', 'idx'])
df
x y idx
1 19.0070 35.7500 1
2 4.4447 6.0735 2

Graphlab and numpy issue

I'm currently doing a course on Coursera (Machine Leraning) offered by University of Washington and I'm facing little problem with the numpy and graphlab
The course requests to use a version of graphlab higher than 1.7
Mine is higher as you can see below, however, when I run the script below, I got an error as follows:
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started.
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1
features = ['constant'] + features # this is how you combine two lists
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
(example_features, example_output) = get_numpy_data(sales,['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'
print example_output[0] # and the corresponding output
----> 8 feature_matrix = features_sframe.to_numpy()
NameError: global name 'features_sframe' is not defined
The script above was written by the course authors, so I believe there is something I'm doing wrong
Any help will be highly appreciated.
You are supposed to complete the function get_numpy_data before running it, that's why you are getting an error. Follow the instructions in the original function, which actually are:
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame
# add the column 'constant' to the front of the features list so that we can extract it along with the others:
features = ['constant'] + features # this is how you combine two lists
# select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
The graphlab assignment instructions have you convert from graphlab to pandas and then to numpy. You could just skip the the graphlab parts and use pandas directly. (This is explicitly allowed in the homework description.)
First, read in the data files.
import pandas as pd
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('data//kc_house_data.csv', dtype=dtype_dict)
train_data = pd.read_csv('data//kc_house_train_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('data//kc_house_test_data.csv', dtype=dtype_dict)
The convert to numpy function then becomes
def get_numpy_data(df, features, output):
df['constant'] = 1
# add the column 'constant' to the front of the features list so that we can extract it along with the others
features = ['constant'] + features
# select the columns of data_SFrame given by the features list into the SFrame features_sframe
features_df = pd.DataFrame(**FILL IN THE BLANK HERE WITH YOUR CODE**)
# cast the features_df into a numpy matrix
feature_matrix = features_df.as_matrix()
etc.
The remaining code should be the same (since you only work with the numpy versions for the rest of the assignment).

Categories