Using python to read and replace specific text after mathematical operation - python

I have a text file in the following format:
File
What I am interested in is to have python go through the Atoms chunk of the file (lines 34 to 24033), isolate the 3rd column (which correspond to the atoms type), and perform the following operation:
Isolate only atom type 2 and 4
Randomly select a user-defined percentage from that list of atoms (e.g. randomly select 40% of type 2 and 4 atoms)
Add 3 to those selected atoms (2->5, 4->7)
Write the new file with the updated atom types
Unfortunately, I don't have a lot of knowledge of reading/writing text files in python and so far the only pseudo solution I was able to come up with is the following:
import numpy as np
import random
data = np.loadtxt('5mer_SA.data',skiprows=33,max_rows=(24032-32))
atomType = data[:,2]
reactive = []
for i in range(len(atomType)):
if atomType[i] == 2 or atomType[i] == 4:
x = atomType[i]
reactive.append(x)
reactive_array = np.array(reactive)
p = 0.5 #user defined perecentage
newType = np.zeros(int(len(reactive_array)*p))
for j in range(len(newType)):
newType[j] = random.choice(reactive_array)+3
What I am struggling is to put back modified atomTypes to the text file in the original position. Any suggestions will be greatly appreciated. My ideal goal would be just to have a function that takes as input the user-defined percentage and automatically perform all the mentioned operation generating a new text file as output.

Related

How to turn items from extracted data to numbers for plotting in python?

So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.
I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.
From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)

How to find matches between csv files based on two columns within a range

I'm currently struggling to put together some code that will find the matches of values in two different columns in two csv files within a range. I have tried using the code below, but it doesn't output what I am trying to accomplish. Basically, I want to output a new file that contains all of the lines in the second file that have matches to the same columns in the first file, not merge them together. I've added more detailed clarification below my code. I feel like what I've done so far is probably completely wrong. What do I need to change in order for my code to produce the results I am looking for?
import csv
with open('F435W.csv') as csvF435:
readCSV1 = csv.reader(csvF435, delimiter=',')
with open("F550Mnew.csv", "w") as new_F550M:
pass
with open("F550Mnew.csv", "a") as new_F550M:
for header in readCSV1:
new_F550M.write(','.join(header)+'\n')
break
for l435 in readCSV1:
with open('F550M.csv') as csvF550:
readCSV2 = csv.reader(csvF550, delimiter=',')
for l550 in readCSV2:
if isfloat(l435[12]) and isfloat(l550[12]) and abs(float(l435[12])-float(l550[12])) < 0.002778:
if isfloat(l435[13]) and isfloat(l550[13]) and abs(float(l435[13])-float(l550[13])) < 0.002778:
new_F550M.write(','.join(l550)+'\n')
For clarification, each file has an X column and a Y column so basically each row corresponds to an (X,Y) point. In addition, there are 21 other columns of data that are not necessary for finding matches, but need to be included in the final output file. I am trying to find points in the second file that match the points in the first file within a radius. This is because I know that none of my points will be exact matches. In my data, my X is column 13 and my Y is column 14.
The way I have tried to accomplish this is by finding the differences between every X in the first file and every X in the second file (eg. X1-X2), and the differences between every Y in the first file and every Y in the second file (eg. Y1-Y2). Then, every row in the second file which corresponds to differences for both X and Y which are less than my radius value (0.0002778) would be considered a match to the first file.
Unfortunately, my code produces a file with over 300,000 points when my original files only have 7000 points. There should be less data, not more data. It also includes many repeats of data, when there should not be any repeats at all.
Thank you for your time!
Sample of what the data looks like: I apologize for the length, but I am afraid they will not contain enough matches to be useful if I don't include enough of the data.
F435W.csv (file 1)
1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444392,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88
3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78
4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68
5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01
6,2.379565,0.01249223,-0.9412,0.0057,0.231205,0.02710035,1.59,0.1273,-0.004135321,3824.3706,9.0756,50.0052903,41.4940468,7.81E-01,6.99E-02,4.27E-02,0.885,0.26,3.42,0.01265,0.00622,15.52
7,0.3171223,0.01250492,1.2469,0.0428,0.5233852,0.05406558,0.7029,0.1122,-0.00399635,4097.3604,7.0301,50.0059585,41.4902884,9.61E-01,1.63E+00,-3.94E-01,1.346,0.883,-65.16,0.06171,0.04005,-65.05
8,0.289245,0.0125176,1.3468,0.047,0.2744479,0.02238134,1.4039,0.0886,-0.004173243,3904.7402,7.3912,50.0055069,41.4929422,7.90E-01,2.38E-01,7.13E-02,0.894,0.479,7.24,0.04501,0.02071,8.29
9,0.3543034,0.01247953,1.1266,0.0383,0.7666836,0.06376094,0.2885,0.0903,-0.004009248,4107.0684,3.259,50.0060503,41.4901611,3.53E+00,1.28E+00,-4.60E-01,1.903,1.09,-11.12,0.06873,0.03955,-11.22
10,1.308331,0.01250492,-0.2918,0.0104,-0.005209296,0.004877397,99,99,-0.004193406,3933.9834,6,50.0056001,41.4925416,5.78E-01,8.33E-02,0.00E+00,0.76,0.289,0,0.01272,0.00424,0
11,3.995717,0.01250492,-1.504,0.0034,0.1589517,0.007450347,1.9968,0.0509,-0.003990021,4069.0469,3.0234,50.0059668,41.4906855,8.03E-01,2.29E-02,1.02E-02,0.896,0.151,0.75,0.00888,0.00361,5.59
12,1.067634,0.01250492,-0.0711,0.0127,0.1260926,0.02787585,2.2483,0.2401,-0.004042602,4048.9148,4,50.0059023,41.4909612,7.40E-01,8.33E-02,0.00E+00,0.86,0.289,0,0.02449,0.00576,0
13,0.2808423,0.01162418,1.3788,0.0449,0.4633991,0.02235104,0.8351,0.0524,-0.004015559,4114.6655,2.0641,50.0060898,41.4900585,9.65E-01,5.88E-01,-9.47E-02,0.994,0.752,-13.34,0.05405,0.03814,-15.13
14,1.067291,0.01245409,-0.0707,0.0127,1.081617,0.01516444,-0.0852,0.0152,-0.004168633,3960.8787,18.0524,50.0054405,41.4921501,6.84E-01,8.29E-01,-6.18E-02,0.923,0.813,-69.77,0.01468,0.01229,-78.83
15,0.5216251,0.0125176,0.7066,0.0261,0.584776,0.01824955,0.5825,0.0339,-0.003026338,2661.6533,58.4563,50.0016952,41.5099844,8.51E-01,1.17E+00,-7.27E-02,1.089,0.914,-77.72,0.03244,0.02498,-81.68
16,0.6062042,0.01249223,0.5435,0.0224,0.8726375,0.05509822,0.1479,0.0686,-0.003950399,4149.8169,31.0127,50.0056384,41.489524,9.30E-01,3.48E+00,2.03E-01,1.87,0.956,85.48,0.05307,0.0241,86.01
17,0.1324067,0.01242859,2.1952,0.1019,0.1208224,0.01290438,2.2946,0.116,-0.004166729,3911.6807,12.661,50.005426,41.4928374,2.17E-01,2.24E-01,-1.08E-01,0.574,0.335,-45.89,0.0721,0.04162,-44.98
18,0.2136006,0.01247953,1.676,0.0634,0.3511444,0.02471001,1.1363,0.0764,-0.003978713,4096.9111,15.6285,50.0057993,41.4902797,1.00E+00,4.37E-01,2.85E-01,1.058,0.564,22.64,0.07548,0.03957,23.17
19,0.1470979,0.01244135,2.081,0.0919,0.1216703,0.0168958,2.287,0.1508,-0.004147241,3695.311,13.7044,50.004907,41.4958173,2.14E-01,2.08E-01,9.20E-02,0.551,0.345,44.05,0.07073,0.04115,45.12
20,0.5434682,0.01250492,0.6621,0.025,0.5819249,0.01592951,0.5878,0.0297,-0.004136056,3866.6416,24.8316,50.0050981,41.493437,8.34E-01,9.96E-01,2.74E-01,1.096,0.793,53.22,0.02966,0.02055,58.08
21,0.2259093,0.01249223,1.6152,0.0601,0.2848583,0.01867901,1.3634,0.0712,-0.00409535,3645.521,20.0162,50.0046759,41.4964926,5.71E-01,4.26E-01,-1.11E-02,0.756,0.652,-4.34,0.03735,0.0305,0.08
22,0.9499883,0.01247953,0.0557,0.0143,0.9711754,0.01891141,0.0318,0.0211,-0.003134006,3378.7927,19.5305,50.0040686,41.5001691,8.66E-01,4.09E-01,3.57E-03,0.931,0.639,0.45,0.01623,0.01142,-1.19
23,1.125635,0.01240305,-0.1285,0.012,1.050538,0.02402694,-0.0535,0.0248,-0.003295973,3132.9458,24.9024,50.0034018,41.5035477,9.65E-01,7.83E-01,-1.44E-01,1.022,0.839,-28.88,0.01702,0.01288,-21
24,0.168302,0.01249223,1.9348,0.0806,0.2447732,0.01930529,1.5281,0.0857,-0.004140488,3904.7268,27.0386,50.0051454,41.4929084,4.47E-01,4.56E-01,-1.28E-02,0.682,0.662,-54.61,0.04399,0.04068,89.66
25,0.0542859,0.01244135,3.1633,0.2489,0.08799078,0.007964755,2.6389,0.0983,-0.003241792,3454.2612,25.2749,50.0041373,41.4991191,1.93E-01,1.99E-01,-7.18E-02,0.518,0.353,-46.27,0.06408,0.03839,-44.76
26,0.4379335,0.01242859,0.8965,0.0308,0.4661828,0.01542368,0.8286,0.0359,-0.00336337,3478.7058,32.3355,50.0040639,41.4987701,6.15E-01,8.96E-01,-2.91E-02,0.948,0.782,-84.15,0.02891,0.02521,-70.04
27,0.1515608,0.01249223,2.0485,0.0895,0.1935181,0.01712885,1.7832,0.0961,-0.002904789,2982.0017,29.9904,50.0029594,41.505619,3.46E-01,3.61E-01,1.55E-05,0.601,0.588,89.94,0.05241,0.05241,-80.48
28,0.6658883,0.01250492,0.4415,0.0204,0.718064,0.01780974,0.3596,0.0269,-0.00324104,3408.0103,36.2539,50.0038284,41.4997375,9.45E-01,1.11E+00,1.98E-01,1.115,0.902,56.45,0.02706,0.02147,51.52
29,0.7244126,0.01244135,0.35,0.0187,1.030102,0.02744665,-0.0322,0.0289,-0.00280412,3259.0889,37.3165,50.0034648,41.5017879,8.65E-01,1.01E+00,5.85E-02,1.017,0.919,70.87,0.02225,0.02011,55.79
30,0.1651701,0.01247953,1.9552,0.0821,0.163293,0.01641976,1.9676,0.1092,-0.003909466,3595.4846,31.9761,50.0043403,41.4971614,2.50E-01,4.42E-01,2.21E-01,0.766,0.324,56.75,0.08087,0.03087,58.28
F550M.csv (file 2)
2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04
3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98
4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61
5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88
6,1.166133,0.01257594,-0.1669,0.0117,0.005819145,0.009692424,5.5879,1.8089,-0.003201006,3476.9932,10,49.9946543,41.5434658,5.88E-01,8.33E-02,0.00E+00,0.767,0.289,0,0.01497,0.00499,0
7,0.1372164,0.0125503,2.1565,0.0993,0.1238123,0.02608246,2.2681,0.2288,-0.003556473,3535.5281,13.4586,49.9947993,41.5426587,2.49E-01,2.48E-01,-7.69E-03,0.506,0.491,-43.27,0.05264,0.05237,-55.87
8,0.6174777,0.01260153,0.5234,0.0222,0.6206718,0.01300407,0.5178,0.0228,-0.002441164,3357.0044,20.0487,49.9940449,41.5450748,5.10E-01,5.22E-01,-6.28E-03,0.724,0.712,-66.7,0.01194,0.01192,84.82
9,1.46848,0.01260153,-0.4172,0.0093,0.001897994,0.009688255,6.8043,5.5435,-0.003612399,3584.0171,16,49.9949252,41.5419909,5.87E-01,8.33E-02,0.00E+00,0.766,0.289,0,0.01175,0.00392,0
10,1.452348,0.01258874,-0.4052,0.0094,3.124427,0.04807406,-1.2369,0.0167,-0.003148756,3805.6069,39.5791,49.9952831,41.5389075,2.25E+00,3.87E+00,-6.77E-01,2.03,1.416,-70.08,0.0302,0.01891,-67.61
11,0.1548658,0.01260153,2.0251,0.0884,0.1777253,0.01630147,1.8756,0.0996,-0.002919044,3459.7681,25.6248,49.9943085,41.5436591,4.64E-01,2.34E-01,8.40E-02,0.701,0.455,18.09,0.05739,0.03321,18.33
12,0.5046132,0.01253746,0.7426,0.027,0.7798272,0.04462456,0.27,0.0621,-0.00261193,3418.9119,65.5326,49.9934365,41.5441099,6.87E-01,2.77E+00,-2.92E-01,1.678,0.804,-82.19,0.05363,0.02182,-83.28
13,0.380733,0.01260153,1.0484,0.0359,0.4313257,0.01605258,0.913,0.0404,-0.003497544,3548.8484,34.5602,49.9944623,41.542421,8.27E-01,8.51E-01,8.92E-02,0.964,0.865,48.75,0.03776,0.03252,30.61
14,0.1643925,0.01258874,1.9603,0.0832,0.2181225,0.01839054,1.6532,0.0916,-0.003121084,3710.6785,33.3215,49.9950598,41.5402182,2.18E-01,2.18E-01,1.03E-01,0.567,0.339,45,0.0757,0.04376,45
15,0.3959635,0.01260153,1.0059,0.0346,0.9984215,0.0763398,0.0017,0.083,-0.003106286,3805.9988,48.3363,49.995125,41.5388789,1.87E+00,3.12E+00,4.86E-01,1.813,1.304,71.09,0.0559,0.04105,67.61
16,0.1625628,0.01260153,1.9724,0.0842,0.3490304,0.02234424,1.1428,0.0695,-0.002472953,3410.77,38.0388,49.9939083,41.544294,1.77E-01,4.75E-01,8.92E-03,0.689,0.421,88.29,0.0769,0.04707,89.86
17,0.1725209,0.01260153,1.9079,0.0793,0.2965718,0.02357189,1.3197,0.0863,-0.003454017,3629.0247,40.9706,49.9946304,41.541311,3.73E-01,7.91E-01,-3.73E-01,1.004,0.393,-59.65,0.09781,0.03734,-58.27
18,0.3034717,0.01260153,1.2947,0.0451,0.5031242,0.02774418,0.7458,0.0599,-0.003073985,4079.0825,42,49.9962105,41.5351731,6.68E-01,8.33E-02,0.00E+00,0.818,0.289,0,0.06348,0.02106,0
19,1.593927,0.01260153,-0.5062,0.0086,1.860803,0.0219809,-0.6743,0.0128,-0.003038161,4065.9434,58.3703,49.9958657,41.5353087,1.75E+00,1.41E+00,-7.15E-03,1.323,1.188,-1.21,0.01697,0.01464,-0.43
20,0.5464995,0.01258874,0.656,0.025,0.5661472,0.0144696,0.6177,0.0278,-0.003053429,4045.0474,54.439,49.9958631,41.535604,5.43E-01,8.46E-01,-1.22E-03,0.92,0.737,-89.77,0.02257,0.01649,-89.72
21,1.303251,0.01253746,-0.2876,0.0104,1.296672,0.01418861,-0.2821,0.0119,-0.00259741,4240.1406,55.2714,49.9965409,41.5329423,6.05E-01,6.81E-01,7.89E-03,0.826,0.777,84.15,0.00892,0.00852,69.62
22,0.5174786,0.01260153,0.7153,0.0264,0.5260691,0.01390194,0.6974,0.0287,-0.003019847,3828.95,55.19,49.9950817,41.5385478,5.18E-01,7.56E-01,-6.34E-02,0.879,0.709,-75.96,0.0236,0.01643,-75.02
23,0.1551826,0.01260153,2.0229,0.0882,0.166565,0.01726119,1.946,0.1125,-0.003271136,3504.7439,52.7386,49.9939745,41.5429739,1.91E-01,6.86E-01,1.89E-01,0.866,0.356,71.33,0.10376,0.04235,71.56
24,0.2214222,0.01260153,1.6369,0.0618,0.2389908,0.01360924,1.554,0.0618,-0.00285033,3750.3167,54.0027,49.994824,41.5396229,4.32E-01,5.51E-01,1.68E-03,0.742,0.657,89.18,0.04862,0.04505,89.94
25,0.1336059,0.01253746,2.1854,0.1019,0.1320868,0.009830156,2.1979,0.0808,-0.002921393,3459.6851,51.7091,49.9938331,41.5435908,2.16E-01,2.06E-01,-9.16E-02,0.55,0.345,-43.52,0.06231,0.03626,-45.19
26,0.1703959,0.01260153,1.9214,0.0803,0.1577456,0.0152816,2.0051,0.1052,-0.002779523,3446.95,49,49.9938372,41.5437717,7.29E-01,8.33E-02,0.00E+00,0.854,0.289,0,0.11183,0.03721,0
27,1.896325,0.01258874,-0.6948,0.0072,1.941203,0.0152816,-0.7202,0.0085,-0.00306097,3809.6836,57.8143,49.9949655,41.5388035,7.38E-01,6.80E-01,7.46E-03,0.86,0.824,7.18,0.00713,0.00678,59.71
28,0.6522877,0.01260153,0.4639,0.021,0.1713469,0.01312423,1.9153,0.0832,-0.002447558,4271.9614,52,49.9967135,41.5325172,5.92E-01,8.33E-02,0.00E+00,0.77,0.289,0,0.0274,0.00913,0
29,0.1370073,0.0125503,2.1581,0.0995,0.101415,0.02614047,2.4847,0.2799,-0.002207851,4324.667,55.3374,49.99684,41.5317898,2.22E-01,2.24E-01,1.12E-01,0.579,0.332,45.18,0.07753,0.04476,45
30,0.2240251,0.01253746,1.6243,0.0608,0.2254432,0.01360924,1.6174,0.0656,-0.003037372,3960.3042,58.9024,49.9954807,41.5367473,4.18E-01,4.81E-01,-1.07E-02,0.695,0.645,-80.65,0.03802,0.03492,-88.86
You are complicating the program by nesting all the loops and conditionals. Break it down into simple steps.
Do the following.
1. Read both the csv files and convert them into 2d lists.
2. Compare the columns/values of the lists within a loop based on the given index, add the rows from second list to a new output list.
3. Write the output list to a csv file.
def read_file(filepath):
with open(filepath,'r') as f:
x = csv.reader(f)
l = list(x)
return l
l435 = read_file('F435W.csv')
l550 = read_file('F550M.csv')
new_F550M = []
r = 0.002778
for i in l550:
for j in l435:
# I did't exactly get your if condition, so I am putting it down based on what I understood, so if it is wrong, modify it accordingly.
if isfloat(i[12]) and isfloat(j[12]) and abs(float(i[12]) float(j[12])) < r:
if isfloat(i[13]) and isfloat(j[13]) and abs(float(i[13]) float(j[13])) < r:
new_F550M.append(i)
with open('new_F550M.csv','w') as f:
out = csv.writer(f)
out.writerows(new_F550M)

Output with Python Glob // Cannot find where is error in Python code

I have the following code, which does NOT give an error but it also does not produce an output.
The script is made to do the following:
The script takes an input file of 4 tab-separated columns:
It then counts the unique values in Column 1 and the frequency of corresponding values in Column 4 (which contains 2 different tags: C and D).
The output is 3 tab-separated columns containing the unique values of column 1 and their corresponding frequency of values in Column 4: Column 2 has the frequency of the string in Column 1 that corresponds with Tag C and Column 3 has the frequency of the string in Column 1 that corresponds with Tag D.
Here is a sample of input:
algorithm-n like-1-resonator-n 8.1848 C
algorithm-n produce-hull-n 7.9104 C
algorithm-n like-1-resonator-n 8.1848 D
algorithm-n produce-hull-n 7.9104 D
anything-n about-1-Zulus-n 7.3731 C
anything-n above-shortage-n 6.0142 C
anything-n above-1-gig-n 5.8967 C
anything-n above-1-magnification-n 7.8973 C
anything-n after-1-memory-n 2.5866 C
and here is a sample of the desired output:
algorithm-n 2 2
anything-n 5 0
The code I am using is the following (which one will see takes into consideration all suggestions from the comments):
from collections import defaultdict, Counter
def sortAndCount(opened_file):
lemma_sense_freqs = defaultdict(Counter)
for line in opened_file:
lemma, _, _, senseCode = line.split()
lemma_sense_freqs[lemma][senseCode] += 1
return lemma_sense_freqs
def writeOutCsv(output_file, input_dict):
with open(output_file, "wb") as outfile:
for lemma in input_dict.keys():
for senseCode in input_dict[lemma].keys():
outstring = "\t".join([lemma, senseCode,\
str(input_dict[lemma][senseCode])])
outfile.write(outstring + "\n")
import os
import glob
folderPath = "Python_Counter" # declare here
for input_file in glob.glob(os.path.join(folderPath, 'out_')):
with open(input_file, "rb") as opened_file:
lemma_sense_freqs = sortAndCount(input_file)
output_file = "count_*.csv"
writeOutCsv(output_file, lemma_sense_freqs)
My intuition is the problem is coming from the "glob" function.
But, as I said before: the code itself DOES NOT give me an error -- but it doesn't seem to produce an output either.
Can someone help?
I have referred to the documentation here and here, and I cannot seem to understand what I am doing wrong.
Can someone provide me insight on how to solve the problem by outputting the results from glob. As I have a large amount of files I need to process.
In regards to your original code, *lemma_sense_freqs* is not defined cause it should be returned by the function sortAndCount(). And you never call that function.
For instance, you have a second function in your code, which is called writeOutCsv. You define it, and then you actually call it on the last line.
While you never call the function sortAndCount() (which is the one that should return the value of *lemma_sense_freqs*). Hence, the error.
I don't know what you want to achieve exactly with that code, but you definitely need to write at a certain point (try before the last line) something like this
lemma_sense_freqs = sortAndCount(input_file)
this is the way you call the function you need and lemma_sense_freqs will then have a value associated and you shouldn't get the error.
I cannot be more specific cause it is not clear exactly what you want to achieve with that code. However, you just are experiencing a basic issue at the moment (you defined a function but never used it to retrieve the value lemma_sense_freqs). Try to add the piece of code I suggest and play with it.

How to compare clusters?

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:
Cluster 0:
Brucellaceae(10)
Brucella(10)
abortus(1)
canis(1)
ceti(1)
inopinata(1)
melitensis(1)
microti(1)
neotomae(1)
ovis(1)
pinnipedialis(1)
suis(1)
Cluster 1:
Streptomycetaceae(28)
Streptomyces(28)
achromogenes(1)
albaduncus(1)
anthocyanicus(1)
etc.
These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).
My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.
Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!
Given:
file1 = '''Cluster 0:
giant(2)
red(2)
brick(1)
apple(1)
Cluster 1:
tiny(3)
green(1)
dot(1)
blue(2)
flower(1)
candy(1)'''.split('\n')
file2 = '''Cluster 18:
giant(2)
red(2)
brick(1)
tomato(1)
Cluster 19:
tiny(2)
blue(2)
flower(1)
candy(1)'''.split('\n')
Is this what you need?
def parse_file(open_file):
result = []
for line in open_file:
indent_level = len(line) - len(line.lstrip())
if indent_level == 0:
levels = ['','','']
item = line.lstrip().split('(', 1)[0]
levels[indent_level - 1] = item
if indent_level == 3:
result.append('.'.join(levels))
return result
data1 = set(parse_file(file1))
data2 = set(parse_file(file2))
differences = [
('common elements', data1 & data2),
('missing from file2', data1 - data2),
('missing from file1', data2 - data1) ]
To see the differences:
for desc, items in differences:
print desc
print
for item in items:
print '\t' + item
print
prints
common elements
giant.red.brick
tiny.blue.candy
tiny.blue.flower
missing from file2
tiny.green.dot
giant.red.apple
missing from file1
giant.red.tomato
So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.
Note that this does not answer your full question but points you in one of the directions in the comments.
Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)
On to the fun stuffs! :)
class Cluster(object):
'''
This is a class that will contain your information about the Clusters.
'''
def __init__(self, number):
'''
This is what some languages call a constructor, but it's not.
This method initializes the properties with values from the method call.
'''
self.cluster_number = number
self.family_name = None
self.bacteria_name = None
self.bacteria = []
#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
cluster = None
clusters = []
for index, line in enumerate(file):
if line.startswith('Cluster'):
cluster = Cluster(index)
clusters.append(cluster)
else:
if not cluster.family_name:
cluster.family_name = line
elif not cluster.bacteria_name:
cluster.bacteria_name = line
else:
cluster.bacteria.append(line)
I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2
You could copy this file into a .py file and run it directly from command line python bacteria.py for example.
Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)
http://learnpythonthehardway.org/
http://www.diveintopython.net/
http://docs.python.org/2/tutorial/inputoutput.html
check if all elements in a list are identical
Retaining order while using Python's set difference
You have to write some code to parse the file. If you ignore the cluster, you should be able to distinguish between family, genera and species based on indentation.
The easiest way it to define a named tuple:
import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])
You can make in instance of this object like this:
b = Bacterium('Brucellaceae', 'Brucella', 'canis')
Your parser should read a file line by line, and set the family and genera. If it then finds a species, it should add a Bacterium to a list;
with open('cluster0.txt', 'r') as infile:
lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
# set family and genera.
# if you detect a bacterium:
bacteria.append(Bacterium(family, genera, species))
Once you have a list of all bacteria in each file or cluster, you can select from all the bacteria like this:
s = [b for b in bacteria if b.genera == 'Streptomycetaceae']
Comparing two clusterings is not trivial task and reinventing the wheel is unlikely to be successful. Check out this package which has lots of different cluster similarity metrics and can compare dendrograms (the data structure you have).
The library is called CluSim and can be found here:
https://github.com/Hoosier-Clusters/clusim/
After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.
I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:
def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to
maximize alignment of elements within each cluster. This method is
described in in Menéndez, Héctor D. A genetic approach to the graph and
spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
are consecutive integers starting with zero)
INPUTS:
clstr1 - The first clustering assignment
clstr2 - The second clustering assignment
OUTPUTS:
clstr2_temp - The second clustering assignment with clusters renumbered to
maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))
for i in range(K):
for j in range(K):
dcix = clstr1==i
dcjx = clstr2==j
dd = np.dot(dcix.astype(int),dcjx.astype(int))
simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
simdist_vec = np.reshape(simdist.T,(K**2,1))
I = np.argmax(simdist_vec)
xy = np.unravel_index(I,simdist.shape,order='F')
x = xy[0]
y = xy[1]
mask[x,y] = 1
simdist[x,:] = 0
simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
swapj = [swapJ[k]==i for i in clstr2]
clstr2_temp[swapj] = swapI[k]
return clstr2_temp

Python: How to extract string from text file to use as data

this is my first time writing a python script and I'm having some trouble getting started. Let's say I have a txt file named Test.txt that contains this information.
x y z Type of atom
ATOM 1 C1 GLN D 10 26.395 3.904 4.923 C
ATOM 2 O1 GLN D 10 26.431 2.638 5.002 O
ATOM 3 O2 GLN D 10 26.085 4.471 3.796 O
ATOM 4 C2 GLN D 10 26.642 4.743 6.148 C
What I want to do is eventually write a script that will find the center of mass of these three atoms. So basically I want to sum up all of the x values in that txt file with each number multiplied by a given value depending on the type of atom.
I know I need to define the positions for each x-value, but I'm having trouble with figuring out how to make these x-values be represented as numbers instead of txt from a string. I have to keep in mind that I'll need to multiply these numbers by the type of atom, so I need a way to keep them defined for each atom type. Can anyone push me in the right direction?
mass_dictionary = {'C':12.0107,
'O':15.999
#Others...?
}
# If your files are this structured, you can just
# hardcode some column assumptions.
coords_idxs = [6,7,8]
type_idx = 9
# Open file, get lines, close file.
# Probably prudent to add try-except here for bad file names.
f_open = open("Test.txt",'r')
lines = f_open.readlines()
f_open.close()
# Initialize an array to hold needed intermediate data.
output_coms = []; total_mass = 0.0;
# Loop through the lines of the file.
for line in lines:
# Split the line on white space.
line_stuff = line.split()
# If the line is empty or fails to start with 'ATOM', skip it.
if (not line_stuff) or (not line_stuff[0]=='ATOM'):
pass
# Otherwise, append the mass-weighted coordinates to a list and increment total mass.
else:
output_coms.append([mass_dictionary[line_stuff[type_idx]]*float(line_stuff[i]) for i in coords_idxs])
total_mass = total_mass + mass_dictionary[line_stuff[type_idx]]
# After getting all the data, finish off the averages.
avg_x, avg_y, avg_z = tuple(map( lambda x: (1.0/total_mass)*sum(x), [[elem[i] for elem in output_coms] for i in [0,1,2]]))
# A lot of this will be better with NumPy arrays if you'll be using this often or on
# larger files. Python Pandas might be an even better option if you want to just
# store the file data and play with it in Python.
Basically using the open function in python you can open any file. So you can do something as follows: --- the following snippet is not a solution to the whole problem but an approach.
def read_file():
f = open("filename", 'r')
for line in f:
line_list = line.split()
....
....
f.close()
From this point on you have a nice setup of what you can do with these values. Basically the second line just opens the file for reading. The third line define a for loop that reads the file one line at a time and each line goes into the line variable.
The last line in that snippet basically breaks the string --at every whitepsace -- into an list. So line_list[0] will be the value on your first column and so forth. From this point if you have any programming experience you can just use if statements and such to get the logic that you want.
** Also keep in mind that the type of values stored in that list will all be string so if you want to perform any arithmetic operations such as adding you have to be careful.
* Edited for syntax correction
If you have pandas installed, checkout the read_fwf function that imports a fixed-width file and creates a DataFrame (2-d tabular data structure). It'll save you lines of code on import and also give you a lot of data munging functionality if you want to do any additional data manipulations.

Categories