How to reshape a numpy data table? - python

I'm pretty new to python and I have a task to "reshape" some data in a .txt file. The simplified format of the original data looks like this:
A 1 x
A 2 y
A 3 z
B 1 q
B 2 w
B 3 e
...
What I need to get looks like this
A B
1 x q
2 y w
3 z e
...
The thing is, there are multiple .txt files I have to reshape and there's no fixed number of 1-2-3s per A-B-C, meaning A could go from 1 to 50, while B could go from 1 to 10 or 75.
I'm looking for an algorithm on how to do this, I've figured how to reach the data I need and discard the data I don't need, but I can't figure how to "reduce" the dimension of data.
What I've done so far is getting the necessary data in arrays and putting those arrays in a numpy array
data = np.array([station, depth, temperature])
Now I'm trying to fill a new 2d data array, with x and y axis being the number of different stations and depths: if the original data has AAAABBCCDDDD, then the new data array's x axis will contain ABCD (using Counter().keys()).

First you could parse everything, reading line by line, and store the values in a dictionary. Since each line looks something like A 1 x , the general case is as follows
BIG_LETTER INDEX VALUE WHITESPACE
In the dictionary, you would have as keys the BIG_LETTER's and as the values another dictionary that stores the index and the value, something like {A : {1: 'q', 2: 'c'}}. This can trivially be achieved.
replace_with_your_file_name = "./text.txt"
with open(replace_with_your_file_name, "r") as file:
for line in file.readlines():
line = line.strip().split(' ') # remove ending whitespace and split ''
# Store in a dictionary the big letter and all its values
# something like {A : {1: 'q', 2: 'c'}}
if not line[0] in data:
data[line[0]] = {}
data[line[0]][line[1]] = line[2] # data[ big_letter ][number] = char
Then, after that is finished you could use another for loop to sort the keys in the nested dictionary, so if it was {'B' : { 5: 'a', 2:'c' } } it would become {'B' : {2: 'c', 5: 'a'}}. Also you can then easily extract for each big letter the maxmium number they have a value for, which solves the problem of non-fixed length. The highest maxmium number is saved for later.
# Sort the by the dictionary key
GLOBAL_MAX_NUMBER: int = 0 # the larget number among all big letters
for item in data:
big_letter: dict = data[item]
data[item] = dict(sorted(big_letter.items(),)) # Sorth according to the keys
local_max_number = list(data[item])[-1] # The last element is the largest
if local_max_number > GLOBAL_MAX_NUMBER:
GLOBAL_MAX_NUMBER = local_max_number
iterations = GLOBAL_MAX_NUMBER # Improve readability
Now you can write the data in a new file in the format you wish
# Write them to a new file
with open("newfile.txt", "w") as file:
# FORMAT: A B C D ... (BIG NUMBRES)
# ----- 1 a b c d ... (INDEX AND VALUE FOR EACH BIG LETTTER IN THE FIRST ROW)
# Write all the big letters in a row
WHITESPACE: str = " "
file.write(WHITESPACE + " ".join(list(data)) + "\n")
# that `GLOBAL_MAX_NUMBER` we kept track off
for i in range(iterations):
current_number: int = i+1 # Current index
file.write(f'{current_number} ')
for big_letter in data: # A, B, C ...
if current_number not in data[big_letter]:
file.write("0 ") # in case this does not exist write 0
else:
file.write(f'{data[big_letter][current_number]} ') # write the value
file.write("\n")
All of the above, combined would give the desired output
A B
1 x q
2 y w
3 z e

Related

How to convert '2.6840000e+01' type like datas to float in Python?

I got a "input.txt" file that contains lines like:
1 66.3548 1011100110110010 25
Then i apply some functions column by column:
column stays the same,
column is rounding in a spesific way,
column is converted from binary to decimal,
column is converted from hexadecimal to binary.
And finaly i get this:
[1.0000000e+00 6.6340000e+01 4.7538000e+04 1.0010100e+05]
Then i write this to "fall.txt".
All the operations is working correctly. But i want to see the numbers like:
1 66.34 47538 100101
I placed the columns of the relevant rows in list_for_1. Then i applied the functions to indexes and put them to another list list_for_11. Finally i put all the answers in a matrix. I wrote the matrix to the "fall.txt".
Here's what i did:
with open("input.txt", "r") as file:
#1. TİP SATIRLAR İÇİN GEREKLİ OBJELER
list_for_1 = list()
list_for_11 = list()
#list_final_1 = list()
for line in file:
#EĞER SATIR TİPİ 1 İSE
if line.startswith("1"):
line = line[:-1]
list_for_1 = line.split(" ") #tüm elemanları 1 listede toplama
#1. tip satır için elemanlara gerekli işlemlerin yapılması
list_for_11.append(list_for_1[0]) #ilk satır 1 kalacak
list_for_11.append(float_yuvarla(float(list_for_1[1]))) #float yuvarlama
list_for_11.append(binary_decimal(list_for_1[2])) #binary'den decimal'e
list_for_11.append(hexa_binary(list_for_1[3])) #hexa'dan binary'e
m = 0
n = 0
array1 = np.zeros((6,4))
for i in list_for_11: #listedeki elemanları matrise yerleştirme
if(m > 5):
break
if(isinstance(i, str)):
x = int(i, 2)
array1[m][n] = float(i)
n += 1
if(n == 4):
n = 0
m += 1
with open("fall.txt","w") as ff:
ff.write(str(array1))
ff.write("\n")
Over here i actually send float type to matrix but it's not working:
if(isinstance(i, str)):
x = int(i, 2)
array1[m][n] = float(i)
I'm sort of a new python user, so i might write unnecessarily long and complex codes. If there's any shorter way to do what i did, i would like to get opinions for that as well.
Here's a function to format your numbers the way you want them:
def formatNumber(num):
if num % 1 == 0:
return int(num)
else:
return num
Your list of numbers:
l = [1.0000000e+00, 6.6340000e+01, 4.7538000e+04, 1.0010100e+05]
Reformatting your list of numbers:
for x in l:
print(formatNumber(x))
Output:
1
66.34
47538
100101

How to find the sum of a certain column in a .txt file in Python?

I have a .txt file with 3 rows and 3 columns of data shown below:
1.5 3.1425 blank
10 12 14
8.2 blank 9.5
I am looking to create a function that allows a user to input a number of either 1,2,or 3 and get the sum of that specified column
The error I receive is as follows:
Traceback (most recent call last):
File "<pyshell#41>", line 1, in <module>
summarizer(2)
File "/Users/"practice.py", line
403, in summarizer
print(sum(float(col2)))
ValueError: could not convert string to float: '.'
I'm just practicing my indexing and am running into trouble when trying to pick a specific column or row to analyze. I have the following code, but get errors pertaining to my index being out of range, or a float object not being iterable
def summarizer(searchNum):
infile = open('nums.txt','r')
fileContents = infile.readlines()
infile.close
newList = []
for numbers in fileContents:
numVals = numbers.split('\t')
for i in range(len(numVals)):
for j in range(0, len(numVals[i])):
newList+=numVals[i][j]
col1 = numVals[i][0]
col2 = numVals[i][1]
col3 = numVals[i][2]
if searchNum == 1:
print (sum(float(col1)))
elif searchNum == 2:
print(sum(float(col2)))
else:
print(sum(float(col3)))
If a user inputs summarizer(3), I would like the output to be 23.5 since 14+9.5+0= 23.5
I put comments on the script. You can create three column lists to collect each value in the corresponding columns. Then sum it at the end.
def summarizer(searchNum):
infile = open('nums.txt','r')
fileContents = infile.readlines()
infile.close
col1, col2, col3 = [], [], [] #initialize the columns
for numbers in fileContents:
numVals = numbers.replace('\n','').split('\t') #also remove newline at the end (\n)
col1.append(float(numVals[0]) if numVals[0] else 0) #convert to float if not blank else 0 then add to col1
col2.append(float(numVals[1]) if numVals[1] else 0)
col3.append(float(numVals[2]) if numVals[2] else 0)
if searchNum == 1:
print(sum(col1))
elif searchNum == 2:
print(sum(col2))
else:
print(sum(col3)) #print the sum of col3
return
Result:
summarizer(3)
23.5
You need to make sure that text file is perfectly formatted with tabs. Then you need to append each row to a list, and split each value by tabs.
Then you need to get rid of 'blanks' and '\n' or whatever other non-numbers.
Then sum them.
This is how I would do it
infile = open('nums.txt','r')
fileContents = infile.readlines()
infile.close
newList = [] # List of lists. Each list is a column
for line in fileContents:
newList.append(line.split('\t'))
# - Blank must be 0. Let's get rid of \n as well
for i in range(len(newList)):
for j in range(len(newList[i])):
if '\n' in newList[i][j]:
newList[i][j] = newList[i][j].replace('\n', '')
try:
newList[i][j] = float(newList[i][j]) # get rid of string entries
except ValueError:
newList[i][j] = 0
sum = 0
if searchNum == 1:
for i in range(len(newList)):
sum += newList[i][0]
if searchNum == 2:
for i in range(len(newList)):
sum += newList[i][1]
if searchNum == 3:
for i in range(len(newList)):
sum += newList[i][2]
print(sum)
Explanation of the "could not convert string to float: '.' " error:
col2 variable has a string "blank" (which is not a integer) .
When you apply float on a string which is not a integer ( in our case float(col2)) it throws the error which u mentioned.
What your code actually does:
1.It creates a n*n 2d array and puts all the elements from textfile to the 2d array.
2.You assign the last element in each column to variable col1,col2,col3
3.You apply sum operation on the last element in each column
What you were trying to do :
1.Create a n*n 2d array and puts all the elements from textfile to the 2d array.
2.Apply sum operation on each column element and display the result:
So ur code is not actually doing what you wanted to do.
I have written the below code which does wat u actually intended to do
Solution Code
def summarizer(searchNum):
infile = open('nums.txt','r')
fileContents = infile.readlines()
infile.close
newList = []
for numbers in fileContents:
# - replace the "blank" string and with 0 and makes every instance
#- float type
numbers =numbers.replace("blank","0").replace('\n','').split('\t')
# - creates the 2d array of the items from you text file
for i in range(1,len(numbers)+1):
newList[i].extend(float(numbers[i-1]))
# - prints the sum based on column index u wanted
print(sum(newList(searchNum)))
You can do this easier by using the csv library
https://docs.python.org/2/library/csv.html

Optimize searching two text files and output based upon a third using Python

I'm having performance issues with a python function that I'm using loading two 5+ GB tab delineated txt files that are the same format with different values and using a third text file as a key to determine which values should be kept for output. I'd like some help for speed gains if possible.
Here is the code:
def rchfile():
# there are 24752 text lines per stress period, 520 columns, 476 rows
# there are 52 lines per MODFLOW model row
lst = []
out = []
tcel = 0
end_loop_break = False
# key file that will set which file values to use. If cell address is not present or value of cellid = 1 use
# baseline.csv, otherwise use test_p97 file.
with open('input/nrd_cells.csv') as csvfile:
reader = csv.reader(csvfile)
for item in reader:
lst.append([int(item[0]), int(item[1])])
# two files that are used for data
with open('input/test_baseline.rch', 'r') as b, open('input/test_p97.rch', 'r') as c:
for x in range(3): # skip the first 3 lines that are the file header
b.readline()
c.readline()
while True: # loop until end of file, this should loop here 1,025 times
if end_loop_break == True: break
for x in range(2): # skip the first 2 lines that are the stress period header
b.readline()
c.readline()
for rw in range(1, 477):
if end_loop_break == True: break
for cl in range(52):
# read both files at the same time to get the different data and split the 10 values in the row
b_row = b.readline().split()
c_row = c.readline().split()
if not b_row:
end_loop_break == True
break
for x in range(1, 11):
# search for the cell address in the key file to find which files datat to keep
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
if not testval: # cell address not in key file
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 1: # cell address value == 1
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 2: # cell address value == 2
out.append(c_row[x - 1])
print(cl * 10 + x + tcel) # test output for cell location
tcel += 520
print('success')`
The key file looks like:
37794, 1
37795, 0
37796, 2
The data files are large ~5GB each and complex from a counting standpoint, but are standard in format and look like:
0 0 0 0 0 0 0 0 0 0
1.5 1.5 0 0 0 0 0 0 0 0
This process is taking a very long time and was hoping someone could help speed it up.
I believe your speed problem is coming from this line:
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
You are iterating over the whole key list for every single value in the HUGE output files. This is not good.
It looks like cl * 10 + x + tcel is the formula you are looking for in lst[n][0].
May I suggest you use a dict instead of a list for storing the data in lst.
lst = {}
for item in reader:
lst[int(item[0])] = int(item[1])
Now, lst is a mapping, which means you can simply use the in operator to check for the presence of a key. This is a near instant lookup because the dict type is hash based and very efficient for key lookups.
something in lst
# for example
(cl * 10 + x) in lst
And you can grab the value by:
lst[something]
#or
lst[cl * 10 + x]
A little bit of refactoring and your code should PROFOUNDLY speed up.

How to find duplicates in a python list that are adjacent to each other and list them with respect to their indices?

I have a program that reads a .csv file, checks for any mismatch in column length (by comparing it to the header-fields), which then returns everything it found out as a list (and then writes it into a file). What I want to do with this list, is to list out the results as follows:
row numbers where the same mismatch is found : the amount of columns in that row
e.g.
rows: n-m : y
where n and m are the numbers of rows which share the same amount of columns that mismatch to header.
I have looked into these topics, and while the information is useful, they do not answer the question:
Find and list duplicates in a list?
Identify duplicate values in a list in Python
This is where I am right now:
r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
# adds column length to a list
colm = len(row)
columns.append(colm)
b = len(columns)
for a in range(b):
# checks if the current member matches the header length of columns
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
file.write("row " + str(a + 1) + ": " + str(columns[a]) + " \n")
the file output looks like this:
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
row 7224: 0
row 7225: 1
row 7226: 1
when the desired end result is
rows 7220 - 7224 : 0
rows 7225 - 7226 : 1
So I what I essentially need, the way i see it, is an dictionary where key is the rows with duplicate value and value is the amount of columns in that said mismatch. What I essentially think I need (in a horrible written pseudocode, that doesn't make any sense now that I'm reading it years after writing this question), is here:
def pseudoList():
i = 1
ListOfLists = []
while (i < len(originalList)):
duplicateList = []
if originalList[i] == originalList[i-1]:
duplicateList.append(originalList[i])
i += 1
ListOfLists.append(duplicateList)
def PseudocreateDict(ListOfLists):
pseudoDict = {}
for x in ListOfLists:
a = ListOfLists[x][0] #this is the first node in the uniqueList created
i = len(ListOfLists) - 1
b = listOfLists[x][i] #this is the last node of the uniqueList created
pseudodict.update('key' : '{} - {}'.format(a,b))
This however, seems very convoluted way for doing what I want, so I was wondering if there's a) more efficient way b) an easier way to do this?
You can use a list comprehension to return a list of elements in the columns list that differ from adjacent elements, which will be the end-points of your ranges. Then enumerate these ranges and print/write out those that differ from the first (header) element. An extra element is added to the list of ranges to specify the end index of the list, to avoid out of range indexing.
columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];
ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element
for i,v in enumerate(ranges[:-1]):
if v[1] != columns[0]:
print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]
output:
rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1
You can also try the following code -
b = len(columns)
check = 0
for a in range(b):
# checks if the current member matches the header length of columns
if check != 0 and columns[a] == check:
continue
elif check != 0 and columns[a] != check:
check = 0
if start != a:
file.write("row " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
else:
file.write("row " + str(start) + ": " + str(columns[a]) + " \n")
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
start = a+1
check = columns[a]
What you want to do is a map/reduce operation, but without the sorting that is normally done between the mapping and the reducing.
If you output
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
To stdout, you can pipe this data to another python program that generates the groups you want.
The second python program could look something like this:
import sys
import re
line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)
for line in sys.stdin:
rowid, diff = re.findall('(\d+)', line)
if diff != last_diff:
print "rows", last_rowid, rowid, last_diff
last_diff = diff
last_rowid = rowid
print "rows", last_rowid, rowid, last_diff
You would execute them like this in a unix environment to get the output into a file:
python yourprogram.py | python myprogram.py > youroutputfile.dat
If you cannot run this on a unix environment, you can still use the algorithm I wrote in your program with a few modifications.

How to create a dataset using sequence file in python

I have a protein sequence file looks like this:
>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX
The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.
By coding in Python I would like to generate a table which should look like this:
name of the sequence
total number of missing coordinates (which is the number of X)
the range of these missing coordinates (which is the range of the position of those X)
4)the length of the sequence
5)the actual sequence
So the final results should looks like this:
>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
And my code looks like this so far:
total_seq = []
with open('sample.txt') as lines:
for l in lines:
split_list = l.split()
# Assign the list number
header = split_list[0] # 1
seq = split_list[1] # 5
disorder = split_list[2]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq) # 4
for x in disorder:
counts = 0
if x == 'X':
counts = counts + 1
total_seq.append([header, seq, str(counts)]) # obviously I haven't finish coding 2 & 3
with open('new_sample.txt', 'a') as f:
for lol in total_seq:
f.write('\n'.join(lol))
I'm new in python, would anyone help please?
Here's your modified code. It now produces your desired output.
with open("sample.txt") as infile:
matrix = [line.split() for line in infile.readlines()]
header_list = [row[0] for row in matrix]
seq_list = [str(row[1]) for row in matrix]
disorder_list = [str(row[2]) for row in matrix]
f = open('new_sample.txt', 'a')
for i in range(len(header_list)):
header = header_list[i]
seq = seq_list[i]
disorder = disorder_list[i]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq)
# get total number of missing coordinates
num_missing = disorder.count('X')
# get the range of these missing coordinates
first_X_pos = disorder.find('X')
last_X_pos = disorder.rfind('X')
range_missing = '-'.join([str(first_X_pos), str(last_X_pos)])
reformat_seq=" ".join([header, str(num_missing), range_missing, str(sequence_length), seq, '\n'])
f.write(reformat_seq)
f.close()
Some more tips:
Don't forget about python's string functions. They will solve a lot of your problems automatically. The documentation is very good.
If you searched for how to do just part 2 or just part 3 in your question, you would find the results elsewhere.

Categories