I have a very large square matrix of order around 570,000 x 570,000 and I want to power it by 2.
The data is in json format casting to associative array in array (dict inside dict in python) form
Let's say I want to represent this matrix:
[ [0, 0, 0],
[1, 0, 5],
[2, 0, 0] ]
In json it's stored like:
{"3": {"1": 2}, "2": {"1": 1, "3": 5}}
Which for example "3": {"1": 2} means the number in 3rd row and 1st column is 2.
I want the output to be the same as json, but powered by 2 (matrix multiplication)
The programming language isn't important. I want to calculate it the fastest way (less than 2 days, if possible)
So I tried to use Numpy in python (numpy.linalg.matrix_power), but it seems that it doesn't work with my nested unsorted dict format.
I wrote a simple python code to do that but I estimated that it would take 18 days to accomplish:
jsonFileName = "file.json"
def matrix_power(arr):
result = {}
for x1,subarray in arr.items():
print("doing item:",x1)
for y1,value1 in subarray.items():
for x2,subarray2 in arr.items():
if(y1 != x2):
continue
for y2,value2 in subarray2.items():
partSum = value1 * value2
result[x1][y2] = result.setdefault(x1,{}).setdefault(y2,0) + partSum
return result
import json
with open(jsonFileName, 'r') as reader:
jsonFile = reader.read()
print("reading is succesful")
jsonArr = json.loads(jsonFile)
print("matrix is in array form")
matrix = matrix_power(jsonArr)
print("Well Done! matrix is powered by 2 now")
output = json.dumps(matrix)
print("result is in json format")
writer = open("output.json", 'w+')
writer.write(output)
writer.close()
print("Task is done! you can close this window now")
Here, X1,Y1 is the row and col of the first matrix which then is multiplied by the corresponding element of the second matrix (X2,Y2).
Numpy is not the problem, you need to input it on a format that numpy can understand, but since your matrix is really big, it probably won't fit in memory, so it's probably a good idea to use a sparse matrix (scipy.sparse.csr_matrix):
m = scipy.sparse.csr_matrix((
[v for row in data.values() for v in row.values()], (
[int(row_n) for row_n, row in data.items() for v in row],
[int(column) for row in data.values() for column in row]
)
))
Then it's just a matter of doing:
m**2
now I have to somehow translate csr_matrix back to json serializable
Here's one way to do that, using the attributes data, indices, indptr - m is the csr_matrix:
d = {}
end = m.indptr[0]
for row in range(m.shape[0]):
start = end
end = m.indptr[row+1]
if end > start: # if row not empty
d.update({str(1+row): dict(zip([str(1+i) for i in m.indices[start:end]], m.data[start:end]))})
output = json.dumps(d, default=int)
I don't know how it can hold csr_matrix format but not in dictionary. d.update gives MemoryError after some time
Here's a variant which doesn't construct the whole output dictionary and JSON string in memory, but prints the individual rows directly to the output file; this should need considerably less memory.
#!/usr/bin/env python3
…
import json
import sys
sys.stdout = open("output.json", 'w')
delim = '{'
end = m.indptr[0]
for row in range(m.shape[0]):
start = end
end = m.indptr[row+1]
if end > start: # if row not empty
print(delim, '"'+str(1+row)+'":',
json.dumps(dict(zip([str(1+i) for i in m.indices[start:end]], m.data[start:end])), default=int)
)
delim = ','
print('}')
Related
I'm pretty new to python and I have a task to "reshape" some data in a .txt file. The simplified format of the original data looks like this:
A 1 x
A 2 y
A 3 z
B 1 q
B 2 w
B 3 e
...
What I need to get looks like this
A B
1 x q
2 y w
3 z e
...
The thing is, there are multiple .txt files I have to reshape and there's no fixed number of 1-2-3s per A-B-C, meaning A could go from 1 to 50, while B could go from 1 to 10 or 75.
I'm looking for an algorithm on how to do this, I've figured how to reach the data I need and discard the data I don't need, but I can't figure how to "reduce" the dimension of data.
What I've done so far is getting the necessary data in arrays and putting those arrays in a numpy array
data = np.array([station, depth, temperature])
Now I'm trying to fill a new 2d data array, with x and y axis being the number of different stations and depths: if the original data has AAAABBCCDDDD, then the new data array's x axis will contain ABCD (using Counter().keys()).
First you could parse everything, reading line by line, and store the values in a dictionary. Since each line looks something like A 1 x , the general case is as follows
BIG_LETTER INDEX VALUE WHITESPACE
In the dictionary, you would have as keys the BIG_LETTER's and as the values another dictionary that stores the index and the value, something like {A : {1: 'q', 2: 'c'}}. This can trivially be achieved.
replace_with_your_file_name = "./text.txt"
with open(replace_with_your_file_name, "r") as file:
for line in file.readlines():
line = line.strip().split(' ') # remove ending whitespace and split ''
# Store in a dictionary the big letter and all its values
# something like {A : {1: 'q', 2: 'c'}}
if not line[0] in data:
data[line[0]] = {}
data[line[0]][line[1]] = line[2] # data[ big_letter ][number] = char
Then, after that is finished you could use another for loop to sort the keys in the nested dictionary, so if it was {'B' : { 5: 'a', 2:'c' } } it would become {'B' : {2: 'c', 5: 'a'}}. Also you can then easily extract for each big letter the maxmium number they have a value for, which solves the problem of non-fixed length. The highest maxmium number is saved for later.
# Sort the by the dictionary key
GLOBAL_MAX_NUMBER: int = 0 # the larget number among all big letters
for item in data:
big_letter: dict = data[item]
data[item] = dict(sorted(big_letter.items(),)) # Sorth according to the keys
local_max_number = list(data[item])[-1] # The last element is the largest
if local_max_number > GLOBAL_MAX_NUMBER:
GLOBAL_MAX_NUMBER = local_max_number
iterations = GLOBAL_MAX_NUMBER # Improve readability
Now you can write the data in a new file in the format you wish
# Write them to a new file
with open("newfile.txt", "w") as file:
# FORMAT: A B C D ... (BIG NUMBRES)
# ----- 1 a b c d ... (INDEX AND VALUE FOR EACH BIG LETTTER IN THE FIRST ROW)
# Write all the big letters in a row
WHITESPACE: str = " "
file.write(WHITESPACE + " ".join(list(data)) + "\n")
# that `GLOBAL_MAX_NUMBER` we kept track off
for i in range(iterations):
current_number: int = i+1 # Current index
file.write(f'{current_number} ')
for big_letter in data: # A, B, C ...
if current_number not in data[big_letter]:
file.write("0 ") # in case this does not exist write 0
else:
file.write(f'{data[big_letter][current_number]} ') # write the value
file.write("\n")
All of the above, combined would give the desired output
A B
1 x q
2 y w
3 z e
I need to check if the numbers in gradescale is in my NxM matrix as a numpy array, if example the number 8 is in my matrix, I would like to append the number to a empty list and the row number to another list
So how do i check if the number in my matrix isn't in gradescale, i have tried different types of loops, but they dont work.
wrongNumber = []
Rows = []
gradeScale = np.array([-3,0,2,4,7,10,12])
if there is a number i matrix which is not i gradeScale
wrongNumber.append[number]
Rows.append[rownumber]
print("the grade {} in line {} is out of range",format(wrongNumber),
format(Rows))
You can use numpy.ndarray.shape to go through your rows.
for row in range(matrix.shape[0]):
for x in matrix[row]:
if x not in gradeScale:
wrongNumber.append(x)
Rows.append(row)
In addition, you do not use format correctly. Your print statement should be
print("The grade {} in line {} is out of range".format(wrongNumber, Rows))
The following post has some more information on formatting String formatting in Python .
Example
import numpy as np
wrongNumber = []
Rows = []
matrix = np.array([[1,2],[3,4],[5,6],[7,8]])
gradeScale = [1,3,4,5,8]
for row in range(matrix.shape[0]):
for x in matrix[row]:
if x not in gradeScale:
wrongNumber.append(x)
Rows.append(row)
print("The grades {} in lines {} (respectively) are out of range.".format(wrongNumber, Rows))
Output
The grades [2, 6, 7] in lines [0, 2, 3] (respectively) are out of range
Probably a for loop with enumerate() is what you are looking for.
Example:
for rowNumber, number in enumerate(matrix)
if number not in gradeScale:
wrongNumber.append[number]
Rows.append[rowNumber]
I'm doing the splitting of the words from the text file in python. I've receive the number of row (c) and a dictionary (word_positions) with index. Then I create a zero matrix (c, index). Here is the code:
from collections import defaultdict
import re
import numpy as np
c=0
f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')
for line in f:
c = c + 1
word_positions = {}
with open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r') as f:
index = 0
for word in re.findall(r'[a-z]+', f.read().lower()):
if word not in word_positions:
word_positions[word] = index
index += 1
print(word_positions)
matrix=np.zeros(c,index)
My question: How can I populate the matrix to be able to get this: matrix[c,index] = count, where c - is the number of row, index -the indexed position and count -the number of counted words in a row
Try next:
import re
import numpy as np
from itertools import chain
text = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt')
text_list = text.readlines()
c=0
for i in range(len(text_list)):
c=c+1
text_niz = []
for i in range(len(text_list)):
text_niz.append(text_list[i].lower()) # перевел к нижнему регистру
slovo = []
for j in range(len(text_niz)):
slovo.append(re.split('[^a-z]', text_niz[j])) # токенизация
for e in range(len(slovo)):
while slovo[e].count('') != 0:
slovo[e].remove('') # удалил пустые слова
slovo_list = list(chain(*slovo))
print (slovo_list) # составил список слов
slovo_list=list(set(slovo_list)) # удалил повторяющиеся
x=len(slovo_list)
s = []
for i in range(len(slovo)):
for j in range(len(slovo_list)):
s.append(slovo[i].count(slovo_list[j])) # посчитал количество слов в каждом предложении
matr = np.array(s) # матрица вхождений слов в предложения
d = matr.reshape((c, x)) # преобразовал в матрицу 22*254
It looks like you are trying to create something similar to an n-dimensional list. these are achieved by nesting lists inside themselves as such:
two_d_list = [[0, 1], [1, 2], [example, blah, blah blah]]
words = two_d_list[2]
single_word = two_d_list[2][1] # Notice the second index operator
This concept is very flexible in Python and can also be done with a dictionary nested inside as you would like:
two_d_list = [{"word":1}, {"example":1, "blah":3}]
words = two_d_list[1] # type(words) == dict
single_word = two_d_list[2]["example"] # Similar index operator, but for the dictionary
This achieves what you would like, functionally, but does not use the syntax matrix[c,index], however this syntax does not really exist in python for indexing. Commas within square-brackets usually delineate the elements of list literals. Instead you can access the row's dictionary's element with matrix[c][index] = count
You may be able to overload the index operator to achieve the syntx you want. Here is a question about achieving the syntax you desire. In summary:
Overload the __getitem__(self, inex) function in a wrapper of the list class and set the function to accept a tuple. The tuple can be created without parenthesis, giving the syntax matrix[c, index] = count
I am working on a problem to create a function find_row with three input parameters - file name, col_number and value. I want output like in given example:
For example, if we have a file a.csv:
1, 1.1, 1.2
2, 2.1, 2.2
3
4, 4.1, 4.2
then
print(find_row('a.csv', 0, 4)) would print 3,
print(find_row('a.csv', 2, 2.2)) would print 1, and
print(find_row('a.csv', 0, 100)) would print None.
The code I tried is :
import csv
def find_row(filename,col_number,value):
var = str(value)
coln = str(col_number)
o = open(filename, 'r')
myData = csv.reader(o)
index = 0
for row in myData:
if row[col_number] == var:
return index
else :
index+=1
print find_row('a.csv',2,2.2)
It is throwing error :
File "C:/Users/ROHIT SHARMA/Desktop/1.py", line 17, in find_row
if row[col_number] == var:
IndexError: list index out of range
I understand the error now, but not able to improve the code. Any help here guys??!
Thanks.
In your CSV file, the 3rd row has only one column, so 2 is not a valid index.
As an aside, it's cleaner to do
for index, row in enumerate(myData):
if row[col_number] == var:
return index
Edit: Also, that CSV is going to give you problems. It can't find '2.2' because it actually returns ' 2.2'. Strip the spaces when you read or make sure the CSV is saved the "correct" way (no spaces between comma and content).
Edit2: If you MUST have a CSV with unequal rows, this will do the trick:
for index, row in enumerate(myData):
try:
if row[col_number] == var:
return index
except IndexError:
pass
I have a dataset of the form:
user_id::item_id1::rating::timestamp
user_id::item_id2::rating::timestamp
user_id::item_id3::rating::timestamp
user_id::item_id4::rating::timestamp
I require the item_ids (there are n distinct item ids in sorted order. Subsequent rows could have the same item ids or different but its guaranteed to be sorted) to be contiguous from 1 to n and they are currently ranging from 1 to k
for k >> n
I have the following code but it isn't quite correct and have been at it for a couple of hours so would really appreciate any help regarding this or if there is a simpler way to do this in python I would really appreciate guidance regarding that as well.
I currently have the following code:
def reOrderItemIds(inputFile,outputFile):
#This is a list in the range of 1 to 10681.
itemIdsRange = set(range(1,10682))
#currKey = 1
currKey = itemIdsRange.pop()
lastContiguousKey=1
#currKey+1
contiguousKey=itemIdsRange.pop()
f = open(inputFile)
g = open(outputFile,"w")
oldKeyToNewKeyMap = dict()
for line in f:
if int(line.split(":")[1]) == currKey and int(line.split(":")[1])==lastContiguousKey:
g.write(line)
elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])!=contiguousKey:
oldKeyToNewKeyMap[line.split(":")[1]]=contiguousKey
lastContiguousKey=contiguousKey
#update current key to the value of the current key.
currKey=int(line.split(":")[1])
contiguousKey=itemIdsRange.pop()
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
elif int(line.split(":")[1])==currKey and int(line.split(":")[1])!=contiguousKey:
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])==contiguousKey:
currKey = int(line.split(":")[1])
lastContiguousKey=contiguousKey
oldKeyToNewKeyMap[line.split(":")[1]] = lastContiguousKey
contiguousKey=itemIdsRange.pop()
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
f.close()
g.close()
Example:
1::1::3::100
10::1::5::104
20::2::3::110
1::5::2::104
I require the output to be of the form:
1::1::3::100
10::1::5::104
20::2::3::110
1::3::2::104
so only the item_ids column changes and everything else remains the same.
Any help would be much appreciated!
Because your data is already sorted by item_id - you can use itertools.groupby() which makes easy work of the solution.
from operator import itemgetter
from itertools import groupby
item_id = itemgetter(1)
def reOrderItemIds(inputFile,outputFile):
n = 1
with open(inputFile)as infile, open(outputFile,"w") as outfile:
dataset = (line.split('::') for line in infile)
for key, group in groupby(dataset, item_id):
for line in group:
line[1] = str(n)
outfile.write('::'.join(line))
n += 1
With my apologies for grossly misreading your question the first time, suppose data is a file containing
1::1::3::100
10::1::5::104
20::2::3::110
30::5::3::121
40::9::7::118
50::10::2::104
(If your data cannot all be cast to integers, this could be modified.)
>>> with open('data', 'r') as datafile:
... dataset = datafile.read().splitlines()
...
>>> ids = {0}
>>> for i, line in enumerate(dataset):
... data = list(map(int, line.split('::')))
... if data[1] not in ids:
... data[1] = max(ids) + 1
... ids.add(data[1])
... dataset[i] = '::'.join((str(d) for d in data))
...
>>> print('\n'.join(dataset))
1::1::3::100
10::1::5::104
20::2::3::110
30::3::3::121
40::4::7::118
50::5::2::104
Again, if your dataset is large, faster solutions can be devised.