Python: How to extract string from text file to use as data - python
this is my first time writing a python script and I'm having some trouble getting started. Let's say I have a txt file named Test.txt that contains this information.
x y z Type of atom
ATOM 1 C1 GLN D 10 26.395 3.904 4.923 C
ATOM 2 O1 GLN D 10 26.431 2.638 5.002 O
ATOM 3 O2 GLN D 10 26.085 4.471 3.796 O
ATOM 4 C2 GLN D 10 26.642 4.743 6.148 C
What I want to do is eventually write a script that will find the center of mass of these three atoms. So basically I want to sum up all of the x values in that txt file with each number multiplied by a given value depending on the type of atom.
I know I need to define the positions for each x-value, but I'm having trouble with figuring out how to make these x-values be represented as numbers instead of txt from a string. I have to keep in mind that I'll need to multiply these numbers by the type of atom, so I need a way to keep them defined for each atom type. Can anyone push me in the right direction?
mass_dictionary = {'C':12.0107,
'O':15.999
#Others...?
}
# If your files are this structured, you can just
# hardcode some column assumptions.
coords_idxs = [6,7,8]
type_idx = 9
# Open file, get lines, close file.
# Probably prudent to add try-except here for bad file names.
f_open = open("Test.txt",'r')
lines = f_open.readlines()
f_open.close()
# Initialize an array to hold needed intermediate data.
output_coms = []; total_mass = 0.0;
# Loop through the lines of the file.
for line in lines:
# Split the line on white space.
line_stuff = line.split()
# If the line is empty or fails to start with 'ATOM', skip it.
if (not line_stuff) or (not line_stuff[0]=='ATOM'):
pass
# Otherwise, append the mass-weighted coordinates to a list and increment total mass.
else:
output_coms.append([mass_dictionary[line_stuff[type_idx]]*float(line_stuff[i]) for i in coords_idxs])
total_mass = total_mass + mass_dictionary[line_stuff[type_idx]]
# After getting all the data, finish off the averages.
avg_x, avg_y, avg_z = tuple(map( lambda x: (1.0/total_mass)*sum(x), [[elem[i] for elem in output_coms] for i in [0,1,2]]))
# A lot of this will be better with NumPy arrays if you'll be using this often or on
# larger files. Python Pandas might be an even better option if you want to just
# store the file data and play with it in Python.
Basically using the open function in python you can open any file. So you can do something as follows: --- the following snippet is not a solution to the whole problem but an approach.
def read_file():
f = open("filename", 'r')
for line in f:
line_list = line.split()
....
....
f.close()
From this point on you have a nice setup of what you can do with these values. Basically the second line just opens the file for reading. The third line define a for loop that reads the file one line at a time and each line goes into the line variable.
The last line in that snippet basically breaks the string --at every whitepsace -- into an list. So line_list[0] will be the value on your first column and so forth. From this point if you have any programming experience you can just use if statements and such to get the logic that you want.
** Also keep in mind that the type of values stored in that list will all be string so if you want to perform any arithmetic operations such as adding you have to be careful.
* Edited for syntax correction
If you have pandas installed, checkout the read_fwf function that imports a fixed-width file and creates a DataFrame (2-d tabular data structure). It'll save you lines of code on import and also give you a lot of data munging functionality if you want to do any additional data manipulations.
Related
String index out of range when reading data
For this problem, I have a separate txt file which contains a list of values down below: Years+1900 Populationx106 0 1650 10 1750 20 1860 30 2070 40 2300 50 2560 60 3040 70 3710 80 4450 90 5280 100 6080 110 6870 For the problem I'm working on, I'm supposed to obtain that file and path name to then use to do calculations on with some functions I created. I have finished the functions I need to do, however I'm having an issue running it because I believe when doing the function it reads the "Years+1900 Populationx106" part first instead of the numbers below it. Here's the code for my functions: Input: year Output: estimate of population for that year def pop(year): return 1436.53*((1.01395)**year) # Input: data # Return: the average error as per equation 18. def error(data): error=0 for i in data: error +=(abs(i[1]-pop(i[0]))/i[1]) return 100*error/12 Here is the code I created to retrieve the data from my separate txt file: def get_data(path,name): with open("Assignment7/pop.txt", "r") as path: path = open("Assignment7/pop.txt", "r") name = path.read() return name The error I'm receiving is for the part below. It is an index error and it says the string index is out of range. I believe this is because it is reading the first part of the data in the pop.txt, how can I remove te first line in the pop.txt so that it only reads the numerical values I have? error +=(abs(i[1]-pop(i[0]))/i[1]) I have tried changing the index values already, however it still says that my string index is out of range.
Let's assume you are correct and passing the first line of your text file to your function is breaking it. You can "throw away" the first line of the text file by reading it as a single line (but doing nothing with it) and then reading the data you actually want like this.. def get_data(path,name): with open("Assignment7/pop.txt", "r") as path: path = open("Assignment7/pop.txt", "r") header=path.readline() #Read "Header line", but don't use it name = path.read() #Read subsequent lines as the data you want return name
I suspect that you’ve simply read the entire file as one string. Therefore each element, i, is a single character and has no dimensionality. You’ll need to either parse the file by the new line character to split it into by line (and likely again to get it the two separate columns). Python String Split will be useful for the that. You’re correct that the first line will pose issues, but this can be removed by using a path.readline() call as Richard said.
Using python to read and replace specific text after mathematical operation
I have a text file in the following format: File What I am interested in is to have python go through the Atoms chunk of the file (lines 34 to 24033), isolate the 3rd column (which correspond to the atoms type), and perform the following operation: Isolate only atom type 2 and 4 Randomly select a user-defined percentage from that list of atoms (e.g. randomly select 40% of type 2 and 4 atoms) Add 3 to those selected atoms (2->5, 4->7) Write the new file with the updated atom types Unfortunately, I don't have a lot of knowledge of reading/writing text files in python and so far the only pseudo solution I was able to come up with is the following: import numpy as np import random data = np.loadtxt('5mer_SA.data',skiprows=33,max_rows=(24032-32)) atomType = data[:,2] reactive = [] for i in range(len(atomType)): if atomType[i] == 2 or atomType[i] == 4: x = atomType[i] reactive.append(x) reactive_array = np.array(reactive) p = 0.5 #user defined perecentage newType = np.zeros(int(len(reactive_array)*p)) for j in range(len(newType)): newType[j] = random.choice(reactive_array)+3 What I am struggling is to put back modified atomTypes to the text file in the original position. Any suggestions will be greatly appreciated. My ideal goal would be just to have a function that takes as input the user-defined percentage and automatically perform all the mentioned operation generating a new text file as output.
How to find matches between csv files based on two columns within a range
I'm currently struggling to put together some code that will find the matches of values in two different columns in two csv files within a range. I have tried using the code below, but it doesn't output what I am trying to accomplish. Basically, I want to output a new file that contains all of the lines in the second file that have matches to the same columns in the first file, not merge them together. I've added more detailed clarification below my code. I feel like what I've done so far is probably completely wrong. What do I need to change in order for my code to produce the results I am looking for? import csv with open('F435W.csv') as csvF435: readCSV1 = csv.reader(csvF435, delimiter=',') with open("F550Mnew.csv", "w") as new_F550M: pass with open("F550Mnew.csv", "a") as new_F550M: for header in readCSV1: new_F550M.write(','.join(header)+'\n') break for l435 in readCSV1: with open('F550M.csv') as csvF550: readCSV2 = csv.reader(csvF550, delimiter=',') for l550 in readCSV2: if isfloat(l435[12]) and isfloat(l550[12]) and abs(float(l435[12])-float(l550[12])) < 0.002778: if isfloat(l435[13]) and isfloat(l550[13]) and abs(float(l435[13])-float(l550[13])) < 0.002778: new_F550M.write(','.join(l550)+'\n') For clarification, each file has an X column and a Y column so basically each row corresponds to an (X,Y) point. In addition, there are 21 other columns of data that are not necessary for finding matches, but need to be included in the final output file. I am trying to find points in the second file that match the points in the first file within a radius. This is because I know that none of my points will be exact matches. In my data, my X is column 13 and my Y is column 14. The way I have tried to accomplish this is by finding the differences between every X in the first file and every X in the second file (eg. X1-X2), and the differences between every Y in the first file and every Y in the second file (eg. Y1-Y2). Then, every row in the second file which corresponds to differences for both X and Y which are less than my radius value (0.0002778) would be considered a match to the first file. Unfortunately, my code produces a file with over 300,000 points when my original files only have 7000 points. There should be less data, not more data. It also includes many repeats of data, when there should not be any repeats at all. Thank you for your time! Sample of what the data looks like: I apologize for the length, but I am afraid they will not contain enough matches to be useful if I don't include enough of the data. F435W.csv (file 1) 1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14 2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444392,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88 3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78 4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68 5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01 6,2.379565,0.01249223,-0.9412,0.0057,0.231205,0.02710035,1.59,0.1273,-0.004135321,3824.3706,9.0756,50.0052903,41.4940468,7.81E-01,6.99E-02,4.27E-02,0.885,0.26,3.42,0.01265,0.00622,15.52 7,0.3171223,0.01250492,1.2469,0.0428,0.5233852,0.05406558,0.7029,0.1122,-0.00399635,4097.3604,7.0301,50.0059585,41.4902884,9.61E-01,1.63E+00,-3.94E-01,1.346,0.883,-65.16,0.06171,0.04005,-65.05 8,0.289245,0.0125176,1.3468,0.047,0.2744479,0.02238134,1.4039,0.0886,-0.004173243,3904.7402,7.3912,50.0055069,41.4929422,7.90E-01,2.38E-01,7.13E-02,0.894,0.479,7.24,0.04501,0.02071,8.29 9,0.3543034,0.01247953,1.1266,0.0383,0.7666836,0.06376094,0.2885,0.0903,-0.004009248,4107.0684,3.259,50.0060503,41.4901611,3.53E+00,1.28E+00,-4.60E-01,1.903,1.09,-11.12,0.06873,0.03955,-11.22 10,1.308331,0.01250492,-0.2918,0.0104,-0.005209296,0.004877397,99,99,-0.004193406,3933.9834,6,50.0056001,41.4925416,5.78E-01,8.33E-02,0.00E+00,0.76,0.289,0,0.01272,0.00424,0 11,3.995717,0.01250492,-1.504,0.0034,0.1589517,0.007450347,1.9968,0.0509,-0.003990021,4069.0469,3.0234,50.0059668,41.4906855,8.03E-01,2.29E-02,1.02E-02,0.896,0.151,0.75,0.00888,0.00361,5.59 12,1.067634,0.01250492,-0.0711,0.0127,0.1260926,0.02787585,2.2483,0.2401,-0.004042602,4048.9148,4,50.0059023,41.4909612,7.40E-01,8.33E-02,0.00E+00,0.86,0.289,0,0.02449,0.00576,0 13,0.2808423,0.01162418,1.3788,0.0449,0.4633991,0.02235104,0.8351,0.0524,-0.004015559,4114.6655,2.0641,50.0060898,41.4900585,9.65E-01,5.88E-01,-9.47E-02,0.994,0.752,-13.34,0.05405,0.03814,-15.13 14,1.067291,0.01245409,-0.0707,0.0127,1.081617,0.01516444,-0.0852,0.0152,-0.004168633,3960.8787,18.0524,50.0054405,41.4921501,6.84E-01,8.29E-01,-6.18E-02,0.923,0.813,-69.77,0.01468,0.01229,-78.83 15,0.5216251,0.0125176,0.7066,0.0261,0.584776,0.01824955,0.5825,0.0339,-0.003026338,2661.6533,58.4563,50.0016952,41.5099844,8.51E-01,1.17E+00,-7.27E-02,1.089,0.914,-77.72,0.03244,0.02498,-81.68 16,0.6062042,0.01249223,0.5435,0.0224,0.8726375,0.05509822,0.1479,0.0686,-0.003950399,4149.8169,31.0127,50.0056384,41.489524,9.30E-01,3.48E+00,2.03E-01,1.87,0.956,85.48,0.05307,0.0241,86.01 17,0.1324067,0.01242859,2.1952,0.1019,0.1208224,0.01290438,2.2946,0.116,-0.004166729,3911.6807,12.661,50.005426,41.4928374,2.17E-01,2.24E-01,-1.08E-01,0.574,0.335,-45.89,0.0721,0.04162,-44.98 18,0.2136006,0.01247953,1.676,0.0634,0.3511444,0.02471001,1.1363,0.0764,-0.003978713,4096.9111,15.6285,50.0057993,41.4902797,1.00E+00,4.37E-01,2.85E-01,1.058,0.564,22.64,0.07548,0.03957,23.17 19,0.1470979,0.01244135,2.081,0.0919,0.1216703,0.0168958,2.287,0.1508,-0.004147241,3695.311,13.7044,50.004907,41.4958173,2.14E-01,2.08E-01,9.20E-02,0.551,0.345,44.05,0.07073,0.04115,45.12 20,0.5434682,0.01250492,0.6621,0.025,0.5819249,0.01592951,0.5878,0.0297,-0.004136056,3866.6416,24.8316,50.0050981,41.493437,8.34E-01,9.96E-01,2.74E-01,1.096,0.793,53.22,0.02966,0.02055,58.08 21,0.2259093,0.01249223,1.6152,0.0601,0.2848583,0.01867901,1.3634,0.0712,-0.00409535,3645.521,20.0162,50.0046759,41.4964926,5.71E-01,4.26E-01,-1.11E-02,0.756,0.652,-4.34,0.03735,0.0305,0.08 22,0.9499883,0.01247953,0.0557,0.0143,0.9711754,0.01891141,0.0318,0.0211,-0.003134006,3378.7927,19.5305,50.0040686,41.5001691,8.66E-01,4.09E-01,3.57E-03,0.931,0.639,0.45,0.01623,0.01142,-1.19 23,1.125635,0.01240305,-0.1285,0.012,1.050538,0.02402694,-0.0535,0.0248,-0.003295973,3132.9458,24.9024,50.0034018,41.5035477,9.65E-01,7.83E-01,-1.44E-01,1.022,0.839,-28.88,0.01702,0.01288,-21 24,0.168302,0.01249223,1.9348,0.0806,0.2447732,0.01930529,1.5281,0.0857,-0.004140488,3904.7268,27.0386,50.0051454,41.4929084,4.47E-01,4.56E-01,-1.28E-02,0.682,0.662,-54.61,0.04399,0.04068,89.66 25,0.0542859,0.01244135,3.1633,0.2489,0.08799078,0.007964755,2.6389,0.0983,-0.003241792,3454.2612,25.2749,50.0041373,41.4991191,1.93E-01,1.99E-01,-7.18E-02,0.518,0.353,-46.27,0.06408,0.03839,-44.76 26,0.4379335,0.01242859,0.8965,0.0308,0.4661828,0.01542368,0.8286,0.0359,-0.00336337,3478.7058,32.3355,50.0040639,41.4987701,6.15E-01,8.96E-01,-2.91E-02,0.948,0.782,-84.15,0.02891,0.02521,-70.04 27,0.1515608,0.01249223,2.0485,0.0895,0.1935181,0.01712885,1.7832,0.0961,-0.002904789,2982.0017,29.9904,50.0029594,41.505619,3.46E-01,3.61E-01,1.55E-05,0.601,0.588,89.94,0.05241,0.05241,-80.48 28,0.6658883,0.01250492,0.4415,0.0204,0.718064,0.01780974,0.3596,0.0269,-0.00324104,3408.0103,36.2539,50.0038284,41.4997375,9.45E-01,1.11E+00,1.98E-01,1.115,0.902,56.45,0.02706,0.02147,51.52 29,0.7244126,0.01244135,0.35,0.0187,1.030102,0.02744665,-0.0322,0.0289,-0.00280412,3259.0889,37.3165,50.0034648,41.5017879,8.65E-01,1.01E+00,5.85E-02,1.017,0.919,70.87,0.02225,0.02011,55.79 30,0.1651701,0.01247953,1.9552,0.0821,0.163293,0.01641976,1.9676,0.1092,-0.003909466,3595.4846,31.9761,50.0043403,41.4971614,2.50E-01,4.42E-01,2.21E-01,0.766,0.324,56.75,0.08087,0.03087,58.28 F550M.csv (file 2) 2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04 3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98 4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61 5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88 6,1.166133,0.01257594,-0.1669,0.0117,0.005819145,0.009692424,5.5879,1.8089,-0.003201006,3476.9932,10,49.9946543,41.5434658,5.88E-01,8.33E-02,0.00E+00,0.767,0.289,0,0.01497,0.00499,0 7,0.1372164,0.0125503,2.1565,0.0993,0.1238123,0.02608246,2.2681,0.2288,-0.003556473,3535.5281,13.4586,49.9947993,41.5426587,2.49E-01,2.48E-01,-7.69E-03,0.506,0.491,-43.27,0.05264,0.05237,-55.87 8,0.6174777,0.01260153,0.5234,0.0222,0.6206718,0.01300407,0.5178,0.0228,-0.002441164,3357.0044,20.0487,49.9940449,41.5450748,5.10E-01,5.22E-01,-6.28E-03,0.724,0.712,-66.7,0.01194,0.01192,84.82 9,1.46848,0.01260153,-0.4172,0.0093,0.001897994,0.009688255,6.8043,5.5435,-0.003612399,3584.0171,16,49.9949252,41.5419909,5.87E-01,8.33E-02,0.00E+00,0.766,0.289,0,0.01175,0.00392,0 10,1.452348,0.01258874,-0.4052,0.0094,3.124427,0.04807406,-1.2369,0.0167,-0.003148756,3805.6069,39.5791,49.9952831,41.5389075,2.25E+00,3.87E+00,-6.77E-01,2.03,1.416,-70.08,0.0302,0.01891,-67.61 11,0.1548658,0.01260153,2.0251,0.0884,0.1777253,0.01630147,1.8756,0.0996,-0.002919044,3459.7681,25.6248,49.9943085,41.5436591,4.64E-01,2.34E-01,8.40E-02,0.701,0.455,18.09,0.05739,0.03321,18.33 12,0.5046132,0.01253746,0.7426,0.027,0.7798272,0.04462456,0.27,0.0621,-0.00261193,3418.9119,65.5326,49.9934365,41.5441099,6.87E-01,2.77E+00,-2.92E-01,1.678,0.804,-82.19,0.05363,0.02182,-83.28 13,0.380733,0.01260153,1.0484,0.0359,0.4313257,0.01605258,0.913,0.0404,-0.003497544,3548.8484,34.5602,49.9944623,41.542421,8.27E-01,8.51E-01,8.92E-02,0.964,0.865,48.75,0.03776,0.03252,30.61 14,0.1643925,0.01258874,1.9603,0.0832,0.2181225,0.01839054,1.6532,0.0916,-0.003121084,3710.6785,33.3215,49.9950598,41.5402182,2.18E-01,2.18E-01,1.03E-01,0.567,0.339,45,0.0757,0.04376,45 15,0.3959635,0.01260153,1.0059,0.0346,0.9984215,0.0763398,0.0017,0.083,-0.003106286,3805.9988,48.3363,49.995125,41.5388789,1.87E+00,3.12E+00,4.86E-01,1.813,1.304,71.09,0.0559,0.04105,67.61 16,0.1625628,0.01260153,1.9724,0.0842,0.3490304,0.02234424,1.1428,0.0695,-0.002472953,3410.77,38.0388,49.9939083,41.544294,1.77E-01,4.75E-01,8.92E-03,0.689,0.421,88.29,0.0769,0.04707,89.86 17,0.1725209,0.01260153,1.9079,0.0793,0.2965718,0.02357189,1.3197,0.0863,-0.003454017,3629.0247,40.9706,49.9946304,41.541311,3.73E-01,7.91E-01,-3.73E-01,1.004,0.393,-59.65,0.09781,0.03734,-58.27 18,0.3034717,0.01260153,1.2947,0.0451,0.5031242,0.02774418,0.7458,0.0599,-0.003073985,4079.0825,42,49.9962105,41.5351731,6.68E-01,8.33E-02,0.00E+00,0.818,0.289,0,0.06348,0.02106,0 19,1.593927,0.01260153,-0.5062,0.0086,1.860803,0.0219809,-0.6743,0.0128,-0.003038161,4065.9434,58.3703,49.9958657,41.5353087,1.75E+00,1.41E+00,-7.15E-03,1.323,1.188,-1.21,0.01697,0.01464,-0.43 20,0.5464995,0.01258874,0.656,0.025,0.5661472,0.0144696,0.6177,0.0278,-0.003053429,4045.0474,54.439,49.9958631,41.535604,5.43E-01,8.46E-01,-1.22E-03,0.92,0.737,-89.77,0.02257,0.01649,-89.72 21,1.303251,0.01253746,-0.2876,0.0104,1.296672,0.01418861,-0.2821,0.0119,-0.00259741,4240.1406,55.2714,49.9965409,41.5329423,6.05E-01,6.81E-01,7.89E-03,0.826,0.777,84.15,0.00892,0.00852,69.62 22,0.5174786,0.01260153,0.7153,0.0264,0.5260691,0.01390194,0.6974,0.0287,-0.003019847,3828.95,55.19,49.9950817,41.5385478,5.18E-01,7.56E-01,-6.34E-02,0.879,0.709,-75.96,0.0236,0.01643,-75.02 23,0.1551826,0.01260153,2.0229,0.0882,0.166565,0.01726119,1.946,0.1125,-0.003271136,3504.7439,52.7386,49.9939745,41.5429739,1.91E-01,6.86E-01,1.89E-01,0.866,0.356,71.33,0.10376,0.04235,71.56 24,0.2214222,0.01260153,1.6369,0.0618,0.2389908,0.01360924,1.554,0.0618,-0.00285033,3750.3167,54.0027,49.994824,41.5396229,4.32E-01,5.51E-01,1.68E-03,0.742,0.657,89.18,0.04862,0.04505,89.94 25,0.1336059,0.01253746,2.1854,0.1019,0.1320868,0.009830156,2.1979,0.0808,-0.002921393,3459.6851,51.7091,49.9938331,41.5435908,2.16E-01,2.06E-01,-9.16E-02,0.55,0.345,-43.52,0.06231,0.03626,-45.19 26,0.1703959,0.01260153,1.9214,0.0803,0.1577456,0.0152816,2.0051,0.1052,-0.002779523,3446.95,49,49.9938372,41.5437717,7.29E-01,8.33E-02,0.00E+00,0.854,0.289,0,0.11183,0.03721,0 27,1.896325,0.01258874,-0.6948,0.0072,1.941203,0.0152816,-0.7202,0.0085,-0.00306097,3809.6836,57.8143,49.9949655,41.5388035,7.38E-01,6.80E-01,7.46E-03,0.86,0.824,7.18,0.00713,0.00678,59.71 28,0.6522877,0.01260153,0.4639,0.021,0.1713469,0.01312423,1.9153,0.0832,-0.002447558,4271.9614,52,49.9967135,41.5325172,5.92E-01,8.33E-02,0.00E+00,0.77,0.289,0,0.0274,0.00913,0 29,0.1370073,0.0125503,2.1581,0.0995,0.101415,0.02614047,2.4847,0.2799,-0.002207851,4324.667,55.3374,49.99684,41.5317898,2.22E-01,2.24E-01,1.12E-01,0.579,0.332,45.18,0.07753,0.04476,45 30,0.2240251,0.01253746,1.6243,0.0608,0.2254432,0.01360924,1.6174,0.0656,-0.003037372,3960.3042,58.9024,49.9954807,41.5367473,4.18E-01,4.81E-01,-1.07E-02,0.695,0.645,-80.65,0.03802,0.03492,-88.86
You are complicating the program by nesting all the loops and conditionals. Break it down into simple steps. Do the following. 1. Read both the csv files and convert them into 2d lists. 2. Compare the columns/values of the lists within a loop based on the given index, add the rows from second list to a new output list. 3. Write the output list to a csv file. def read_file(filepath): with open(filepath,'r') as f: x = csv.reader(f) l = list(x) return l l435 = read_file('F435W.csv') l550 = read_file('F550M.csv') new_F550M = [] r = 0.002778 for i in l550: for j in l435: # I did't exactly get your if condition, so I am putting it down based on what I understood, so if it is wrong, modify it accordingly. if isfloat(i[12]) and isfloat(j[12]) and abs(float(i[12]) float(j[12])) < r: if isfloat(i[13]) and isfloat(j[13]) and abs(float(i[13]) float(j[13])) < r: new_F550M.append(i) with open('new_F550M.csv','w') as f: out = csv.writer(f) out.writerows(new_F550M)
performing fuzzy differences between 2 files
When doing some retrocomputing stuff, I sometimes have to compare 2 MC68000 disassembled executables of the same game. Games are published using different languages (english, french...) or have slight modifications / revisions. The code is roughly the same but the global labels are shifted because of previous code changes (or data which is wrongly interpreted as branches which generate more or less fake labels, depending on the data) so I can have for the first file: LAB_0012: MOVE #0,D0 MOVE #2,D2 LAB_0013: RTS and for the second file: LAB_0015: MOVE #0,D0 SUB #3,D1 MOVE #2,D2 LAB_0016: RTS If I perform a diff on both files, the labels scramble/pollute the required result, which I'd like to be SUB #3,D1 added in file 2. So I performed a pre-processing using a regex to change all labels by LAB_XXXX, like this: def readlines(filepath): with open(filepath) as f: lines = list(f) return [x.rstrip() for x in lines],[r.sub("LAB_XXXX",l).partition(";")[0] for l in lines] and use difflib to print the diffs and it kind of works but it doesn't revert back to original label values of course. So I keep the original data, and parse difflib output to try to print the original data instead, but that's lame and doesn't work very well. lines1,filtered_lines1 = readlines(file1) lines2,filtered_lines2 = readlines(file2) for line in difflib.unified_diff(filtered_lines1, filtered_lines2, fromfile=file1, tofile=file2, lineterm=''): m = re.match(r"##..(\d+),(\d+).*(\d+),(\d+)",line) if m: start,end,start2,end2 = [int(x) for x in m.groups()] print(line) for i in range(start,start+end): print("{} <=> {}".format(lines1[i],lines2[i-start2+start])) I've checked this answer Fuzzy file diff but that doesn't cut it for me: pre-processing both files is already what I'm doing. I'd like to instruct difflib (or any other diff mean) to ignore this LAB_.... regex when comparing (a bit like you can compare data ignoring blanks, or case insensitive), so the original file content is printed (either side would do) when showing the diffs. For my above example I'd like: LAB_0015: MOVE #0,D0 ##added:235,1## <== this is just an example: 1 line added at line 235 > SUB #3,D1 MOVE #2,D2 LAB_0016: RTS I'd prefer to keep it within python, but if I have to perform system calls for external commands that's okay too.
Fast extraction of chunks of lines from large CSV file
I have a large CSV file full of stock-related data formatted as such: Ticker Symbol, Date, [some variables...] So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB. The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines). I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster. First, I have a function called getData: def getData(symbol, filename): out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend", "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"] l = len(symbol) beforeMatch = True with open(filename, 'r') as f: for line in f: match = checkMatch(symbol, l, line) if beforeMatch and match: beforeMatch = False out.append(formatLineData(line[:-1].split(","))) elif not beforeMatch and match: out.append(formatLineData(line[:-1].split(","))) elif not beforeMatch and not match: break return out (This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented: def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False): dataset = getData(symbol) if not changeRateTransform: column = [day[col] for day in dataset[-numDays:]] else: n = len(dataset) column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)] return column (changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions: def checkMatch(symbol, symbolLength, line): out = False if line[:symbolLength+1] == symbol + ",": out = True return out def formatLineData(lineData): out = [lineData[0]] out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date()) out += [float(d) for d in lineData[2:6]] out += [int(float(d)) for d in lineData[6:9]] out += [float(d) for d in lineData[9:13]] out.append(int(float(lineData[13]))) return out Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up. EDIT: In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module: def getData(symbol, database): out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend", "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"] l = len(symbol) beforeMatch = True with open(database, 'r') as f: databaseReader = csv.reader(f, delimiter=",") for row in databaseReader: match = (row[0] == symbol) if beforeMatch and match: beforeMatch = False out.append(formatLineData(row)) elif not beforeMatch and match: out.append(formatLineData(row)) elif not beforeMatch and not match: break return out def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False): if not changeRateTransform: out = [day[col] for day in dataset[-numDays:]] else: n = len(dataset) out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)] return out Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results. My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.
When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous. If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront. Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go. Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for. Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly. Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude! Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index: import csv from itertools import islice import mmap def iter_symbol_lines(f, symbol): # How to recognize the start of a line of interest ident = b'"' + symbol.encode() + b'",' # The memory-mapped file mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) # Skip the header mm.readline() # The inclusive lower bound of the byte range we're still interested in lo = mm.tell() # The exclusive upper bound of the byte range we're still interested in hi = mm.size() # As long as the range isn't empty while lo < hi: # Find the position of the beginning of a line near the middle of the range mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1 # Go to that position mm.seek(mid) # Is it a line that comes before lines we're interested in? if mm.readline() < ident: # If so, ignore everything up to right after this line lo = mm.tell() else: # Otherwise, ignore everything from right before this line hi = mid # We found where the first line of interest would be expected; go there mm.seek(lo) while True: line = mm.readline() if not line.startswith(ident): break yield line.decode() with open(filename) as f: r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10)) for line in r: print(line) No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!
So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve). I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining. The command to run the following short script follows a super simple convention: bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+' #!/bin/bash dates="$3" cat "$1" \ | gocsv filter --columns 'ticker' --regex "$2" \ | gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv' both arguments for ticker and for dates are regexes You can add as many variations as your want into that one regex, separating them by |. So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month. End Result Starting data: myusername$ gocsv dims wikiprices.csv Dimensions: Rows: 23946 Columns: 14 myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+' myusername$ gocsv dims AMZN|MSFT-out.csv Dimensions: Rows: 24 Columns: 14 Here is a sample where I limited to only those 2 tickers and then to december only: Voila - in a matter of seconds you have a second file saved with out the data you care about. The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)
in addition to using csv.reader I think using itertools.groupby would speed up looking for the wanted sections, so the actual iteration could look something like this: import csv from itertools import groupby from operator import itemgetter #for the keyfunc for groupby def getData(wanted_symbol, filename): with open(filename) as file: reader = csv.reader(file) #so each line in reader is basically line[:-1].split(",") from the plain file for symb, lines in groupby(reader, itemgetter(0)): #so here symb is the symbol at the start of each line of lines #and lines is the lines that all have that symbol in common if symb != wanted_symbol: continue #skip this whole section if it has a different symbol for line in lines: #here we have each line as a list of fields #for only the lines that have `wanted_symbol` as the first element <DO STUFF HERE> so in the space of <DO STUFF HERE> you could have the out.append(formatLineData(line)) to do what your current code does but the code for that function has a lot of unnecessary slicing and += operators which I think are pretty expensive for lists (might be wrong), another way you could apply the conversions is to have a list of all the conversions: def conv_date(date_str): return datetime.strptime(date_str, '%Y-%m-%d').date() #the conversions applied to each element (taken from original formatLineData) castings = [str, conv_date, #0, 1 float, float, float, float, #2:6 int, int, int, #6:9 float, float, float, float, #9:13 int] #13 then use zip to apply these to each field in a line in a list comprehension: [conv(val) for conv, val in zip(castings, line)] so you would replace <DO STUFF HERE> with out.append with that comprehension. I'd also wonder if switching the order of groupby and reader would be better since you don't need to parse most of the file as csv, just the parts you are actually iterating over so you could use a keyfunc that seperates just the first field of the string def getData(wanted_symbol, filename): out = [] #why are you starting this with strings in it? def checkMatch(line): #define the function to only take the line #this would be the keyfunc for groupby in this example return line.split(",",1)[0] #only split once, return the first element with open(filename) as file: for symb, lines in groupby(file,checkMatch): #so here symb is the symbol at the start of each line of lines if symb != wanted_symbol: continue #skip this whole section if it has a different symbol for line in csv.reader(lines): out.append( [typ(val) for typ,val in zip(castings,line)] ) return out