How can I take Data from files without pandas in Python? - python

I have two text files as codes.txt and values.txt. Code file has categorical values and values.txt file has numerical values for the corresponding category. Continuous categories are presumed to be one segment. An example given in the figure below. Data points between 3 and 11 are all categories of “H” and this forms one segment.
I want to write a function which takes these two files (code.txt and values.txt) and return a dictionary as the output. Dictionary should have a key for each category. After that, I need to provide a new dictionary where keys are segment id’s for each category key. I have to provide a dictionary whose keys are for each segment id. I cannot use pandas and numpy for this.
After all it should like this:
Sample Input Data (values.txt)
After getting together, they will look like:
1
0.55147
H 0.76923
H 0.131979
H 0.503175
T 0.867538
T 0.123256

code.txt
A
A
B
B
C
C
B
C
A
C
B
A
B
A
A
C
values.txt
1.00
2.89
3.46
3.5443
343.234
3535.35235
253415.3512
561.343
0.544534
222.453
213.5525
4532.3435
3541.134
55.31314
341.3143
131.4534
Complete code
codes=[]
values=[]
with open(r"values.txt") as file:
for line in file:
values.append(float(line.replace("\n","")))
with open(r"code.txt") as file:
for line in file:
codes.append(line.replace("\n",""))
# print(codes)
# print(values)
dictt={}
for i in set(codes):
dictt[i]={"values":[],"mean":"","length":""}
for i in range(len(codes)):
dictt[codes[i]]["values"].append(values[i])
for key,value in dictt.items():
dictt[key]["length"]=len(value["values"])
dictt[key]["mean"]=sum(value["values"])/len(value["values"])
print(dictt)
OUTPUT

Related

How to get all combinations of records in csv file using python

I want all the combinations of records in a csv file(only one column)
Here is how the column looks like
Here is my code for what I had done
corpus = []
for index, row in df.iterrows():
corpus.append(pre_process(row['Chief Complaints']))
print(embeddings_similarity(corpus))
def embeddings_similarity(sentences):
# first we need to get data into | sentence_a | sentence_b | format
sentence_pairs = list(itertools.combinations(sentences, 2))
sentence_a = [pair[0] for pair in sentence_pairs]
sentence_b = [pair[1] for pair in sentence_pairs]
sentence_pairs_df = pd.DataFrame({'sentence_a': sentence_a, 'sentence_b': sentence_b})
From the above code I was able to get output fairly good that is 6X6 36 rows for given input in the picture.
But It takes a long time for more records so I was wondering is there any other way we can do to obtain the combinations of all records of a single column in a csv file.

Openpyxl and Binary Search

The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.
You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10

How can I fill a dataframe from a recursive dictionary values?

I have created a script that allows me to read multiple pdf files and extract information recursively one by one. This script generates a dictionary with data by pdf.
Ex:
1º Iteration from 1º PDF file:
d = {"GGT":["transl","mut"], "ATT":["alt3"], "ATC":["alt5"], "AUC":["alteration"]}
2º In the Second Iteration from 2º PDF file:
d = {"GGT":["transl","mut"], "AUC":["alteration"]}
.
.
. Doing this until 200 pdf files.
Initially I have a dataframe created with all the genes that allow to detect that analysis.
df = pd.DataFrame(data=None, columns=["GGT","AUC","ATC","ATT","UUU","UUT"], dtype=None, copy=False)
Desire output:
What I would like to obtain is a dataframe where the information of the values is stored in a recursive way line by line.
For example:
Is there an easy way to implement this? or functions that can help me?
IIUC, you are trying to loop through the dictionaries and add them as rows in your dataframe? I'm not sure how this applies to recursion with "What I would like to obtain is a dataframe where the information of the values is stored in a recursive way line by line."
d1 = {"GGT":["transl","mut"], "ATT":["alt3"], "ATC":["alt5"], "AUC":["alteration"]}
d2 = {"GGT":["transl","mut"], "AUC":["alteration"]}
dicts = [d1, d2] #imagine this list contains the 200 dictionaries
df = pd.DataFrame(data=None, columns=["GGT","AUC","ATC","ATT","UUU","UUT"], dtype=None, copy=False)
for d in dicts: #since only 200 rows a simple loop with append
df = df.append(d, ignore_index=True)
df
Out[1]:
GGT AUC ATC ATT UUU UUT
0 [transl, mut] [alteration] [alt5] [alt3] NaN NaN
1 [transl, mut] [alteration] NaN NaN NaN NaN

convert crosstab to columns without using pandas in python

How do I convert the crosstab data from the input file mentioned below into columns based on the input list without using pandas?
Input list
[A,B,C]
Input data file
Labels A,B,C are only for representation, original file only has the numeric values.
We can ignore the colums XX & YY based on the length of the input list
A B C XX YY
A 0 2 3 4 8
B 4 0 6 4 8
C 7 8 0 5 8
Output (Output needs to have labels)
A A 0
A B 2
A C 3
B A 4
B B 0
B C 6
C A 7
C B 8
C C 0
The labels need to be present in the output file even though its present in the input file, hence I have mentioned its representation in the output file.
NB: In reality the labels are sorted city names without duplicates in ascending order & not single alphabets like A or B.
Unfortunately this would have been easier if I could install pandas on the server & use unstack(), but installations aren't allowed on this old server right now.
This is on python 3.5
Considering you tagged the post csv, I'm assuming the actual input data is a .csv file, without header as you indicated.
So example data would look like:
0,2,3,4,8
4,0,6,4,8
7,8,0,5,8
If the labels are provided as a list, matching the order of the columns and rows (i.e. ['A', 'B', 'C'] this would turn the example output into:
'A','A',0
'A','B',2
'A','C',3
'B','A',4
etc.
Note that this implies the number of rows and columns in the file cannot exceed the number of labels provided.
You indicate that the columns you label 'XX' and 'YY' are to be ignored, but you don't indicate how that's supposed to be communicated, but you do mention the length of the input is determining it, so I assume this means 'everything after column n can be ignored'.
This is a simple implementation:
from csv import reader
def unstack_csv(fn, columns, labels):
with open(fn) as f:
cr = reader(f)
row = 0
for line in cr:
col = 0
for x in line[:columns]:
yield labels[row], labels[col], x
col += 1
row += 1
print(list(unstack_csv('unstack.csv', 3, ['A', 'B', 'C'])))
or if you like it short and sweet:
from csv import reader
with open('unstack.csv') as f:
content = reader(f)
labels = ['A', 'B', 'C']
print([(labels[row], labels[col], x)
for row, data in enumerate(content)
for col, x in enumerate(data) if col < 3])
(I'm also assuming using numpy is out, for the same reason as pandas, but that stuff like csv is in, since it's a standard library)
If you don't want to provide the labels explicitly, but just want them generated, you could do something like:
def label(n):
r = n // 26
c = chr(65 + (n % 26))
if r > 0:
return label(r-1)+c
else:
return c
And then of course just remove the labels from the examples and replace with calls to label(col) and label(row).

Python: Remove all lines of csv files that do not match within a range

I have been given two sets of data in the form of csv files which have 23 columns and thousands of lines of data.
The data in column 14 corresponds to the positions of stars in an image of a galaxy.
The issue is that one set of data contains values for positions that do not exist in the second set of data. They need to both contain the same positions, but the positions are off by a value of 0.0002 each data set.
F435.csv has values which are 0.0002 greater than the values in F550.csv. I am trying to find the matches between the two files, but within a certain range because all values are off by a certain amount.
Then, I need to delete all lines of data that correspond to values that do not match.
Below is a sample of the data from each of the two files:
F435W.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88
3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78
4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68
5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01
F550M.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE,,FALSE
2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04,,
3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98,,
4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61,,
5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88
Below is the code I have so far, but I'm unsure how to find matches based on that specific column. I am very new to Python, and this task is probably way beyond my knowledge of Python, but I desperately need to figure it out. I've been working on this single task for weeks, trying different methods. Thank you in advance!
import csv
with open('F435W.csv') as csvF435:
readCSV = csv.reader(csvF435, delimiter=',')
with open('F550M.csv') as csvF550:
readCSV = csv.reader(csvF550, delimiter=',')
for x in range (0,6348):
a = csvF435[x]
for y in range(0,6349):
b = csvF550[y]
if b < a + 0.0002 and b > a - 0.0002:
newlist.append(b)
break
You can use the following sample:
import csv
def isfloat(value):
try:
float(value)
return True
except ValueError:
return False
interval = 0.0002
with open('F435W.csv') as csvF435:
csvF435_in = csv.reader(csvF435, delimiter=',')
#clean the file content before processing
with open("merge.csv","w") as merge_out:
pass
with open("merge.csv", "a") as merge_out:
#write the header of the output csv file
for header in csvF435_in:
merge_out.write(','.join(header)+'\n')
break
for l435 in csvF435_in:
with open('F550M.csv') as csvF550:
csvF550_in = csv.reader(csvF550, delimiter=',')
for l550 in csvF550_in:
if isfloat(l435[13]) and isfloat(l550[13]) and abs(float(l435[13])-float(l550[13])) < interval:
merge_out.write(','.join(l435)+'\n')
F435W.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88
3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78
4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68
5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01
F550M.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE,,FALSE
2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04,,
3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98,,
4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61,,
5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88
merge.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88

Categories