input box to populate excel formulas in a file - python

I would like to create a Python script which would open the csv (or xls) file and with an input box I could copy and paste the Excel formula to the specific row...then apply this to the rest of the empty rows in that column. To help visualize it here is an example
DATA, FORMULA
001, [here would be inserted the formula]
002, [here would be populated the amended formula]
003, [here would be populated the amended formula]
004, [here would be populated the amended formula]
So, the idea is to have a script, which would get me the input box asking
- from which row you want to start? | answer = B2
- what formula you want to populat there? | "=COUNTIF(A:A,A2)"
...and then it will populate the formula in the B2 column and auto populate the next B3, B4, B5 and B6, where the formula is adjustusted to the specific cell. The reason why I want to do this is the fact I deal with large excel files which very often crash on my computer, so I would like to execute it without running Excel itself.
I did some research adn xlwt probably is not capable to do this. Could you please help me to find the solution how should I do this? I would appreciate any ideas and guidance from you.

Unfortunately what you want to do can't be done without implementing a part of the spreadsheet program (Excel) in your code. There are no shortcuts there.
As for the file format, Python can deal natively with the CSV files, but I think you'd have trouble importing raw formulas (as opposed to numeric or textual content) from CSV into Excel itself.
Since you are already into Python, maybe it would be a better idea to move you logic from the spreadsheet into the program: use Excel or other spreadsheet program to input your data, just the numbers, and use your script not to modify the sheet, but to effect the calculations you need - maybe storing the data in a SQL database (Python s built-in SQLite will perform nicely for a single user app like in this case) - and output just the calculated numbers to a spreadsheet file, or maybe, generate your intend charts directly from Python using matplotlib.
That said, what you are asking can be done from Python - but it might lead to more and more complications on your general workflow as your dataset grows -
Hre - these helper functions will allow you to convert from the Excel cell naming convention to numbers and vice-versa - so that you can have the numeric indices with which to operate in the Python programing.
Parsing the formula typed in to extract the cell - addresses, is no easy-deal, however
rendering them back into the formula, after the numeric indices are adjusted should be a lot easier). I'd suggest you to hard-code your formula in the script, instead of allowing for the input of any possible formula.
def parse_num(address):
x = ""
for chr in (address):
if chr.isdigit():
x += chr
return int(x) - 1
def parse_col(address):
x = 0
for chr in address:
if chr.isdigit():
break
x = x * 26 + (ord(chr.upper()) - ord("A"))
return x
def render_address(col, row):
base = 26
power = col // base
col_letters = ""
tmp_col = col
for p in xrange(power, -1, -1):
dig = tmp_col // (base ** p)
letter = chr(dig + ord("A"))
col_letters += letter
tmp_col %= base ** p
return col_letters + str(row + 1)
Now, if you are willing to do your work in Python, just have your data input as CSV and use a small python program to get your results,instead of trying to fit them in a spreadsheet - for the formula above COUNTIF(A:A,A2) Basically, you want to count how many other rows have the first column as this row - for 750000 data positions, it is a piece of cake in Python - (it starts to get tougher if all data won't fit in RAM - but that would happen with about 100 million data points in a 2GB machine - at that point you can still fit everything in RAM resorting to specialized structures- above that it would start to need some more logic, which would be a few lines long using SQLIte as I mentioned above.
Now, the code for, given a CSV file with one column of data produce a second CSV file, where the second column contains the total of occurrences of the number in the first column:
import csv
from collections import Counter
data_count = Counter()
with open("data.csv", "rt") as input_file:
reader = csv.reader(input_file)
# skip header:
reader.next()
for row in reader():
data_count[int(row[0])] += 1
# everything is accounted for now - output the result:
with open("data.csv", "rt") as input_file, open("counted_data.csv", "wt") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header = reader.next()
header.append("Count")
writer.writerow(header)
for row in reader():
writer.writerow(row + [str(data_count[int(row[0])])] )
And that is only if you really need all of the first column in order on
the final file. If all you want are the count for each number in column 1,
regardless of the order they appear, you just need the data in data_count after the first block - and you can play with that interactively in the Python prompt and have results in fractions of a second what would take tens of minutes in a spreadsheet program.
If you have datasets that don't fit in memory, you just drop them in a database with a simpler script than this, and you still will have your results in a fraction of a second.

Related

Historical database number formatting

Currently I am working with a historical database (in MS Access) containing passages of ships through the Sound (the strait between Denmark and Sweden).
I am having problems with the way amounts of products on board of ships were recorded. This generally takes the following forms:
12 1/15 (integer - space - fraction)
1/4 (fraction)
1 (integer)
I'd like to convert all these numbers to floats/decimal, in order to do some calculations. There are some additional challenges which are mainly caused by the lack of uniform input:
-not all rows have a value
-some rows have value: '-', i'd like to skip these
-some rows contain '*' when a number or a part of a number is missing, these can be skipped too
My first question is: Is there a way I could directly convert this in Access SQL? I have not been able to find anything but perhaps I overlooked something.
The second option I attempted is to export the table (called cargo), use python to convert the value and then output it and import the table again. I have a function to convert the standard three formats:
from fractions import Fraction
import pandas
import numpy
def fracToString(number):
conversionResult = float(sum(Fraction(s) for s in number.split()))
return conversionResult
df = pandas.read_csv('cargo.csv', usecols = [0,5], header = None, names = ['id_passage', 'amount'])
df['amountDecimal'] = df['amount'].dropna().apply(fracToString)
This works for empty rows, however the values containing '*' or '-' or other characters that the fractToString function can't handle raise a ValueError. Since these are just a couple of records out of over 4 million these can be omitted. Is there a way to tell pandas.apply() to just skip to the next row if the fracToString function throws a ValueError?
Thank you in advance,
Alex

How can one convert an uncertainty expression (e.g. 3.23 +/- 0.01) from a string to a float?

I taking some data from a .csv file and placing it into a dict within my Python script, when I noticed a discrepancy in one of columns that contained values of uncertainty (e.g. 3.23 +/- 0.01). After a new table was built and the results were exported to Excel, this column would not sort itself numerically– only the very first value was treated like a number, while the rest of the values were treated like they were an expression.
I suspect this might have to do with the fact that, when I first reading the .csv file, it was read with 'rU' (read universal characters, instead of 'rb' for read binary). I did this since the original +/- symbol in the .csv file was not being read properly. So after the .csv file was read in, it had ' \xb1 ' as a placeholder for the +/- symbol, which I subsequently replaced again with ' +/- '.
import csv
import re
folder_contents={}
with open("greencandidates.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
fluorescence= line[1].replace(" \xb1 "," +/- ")
folder_contents[candidate_number]= [fluorescence]
However, given that there is a lot of data that gets processed from the original .csv file, I really would like to be able to sort the data in descending order (largest to smallest). Although there is a module that does allow for the creation of expressions of uncertainty, (https://pythonhosted.org/uncertainties/), I'm not sure how to use it in order to make the expressions of uncertainty be treated as floats that can be arranged in descending order. I posted a way in which uncertainty expressions can be created with the Sympy package below.
from uncertainties import ufloat
x = ufloat(1, 0.1) # x = 1+/-0.1
Use a key function in your sort, such as:
def u_float_key(num):
return float(num.split('+')[0])
Then you can use the built-in sorted even with strings:
sorted(results, key=u_float_key, reverse=True)
>>> test = ["1+/-1", "0.2+/-0", "4+/-2", "3+/-100"]
>>> sorted(test, key=ufloatkey)
['0.2+/-0', '1+/-1', '3+/-100', '4+/-2']

Going out of memory for python dictionary when the numbers are integer

I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...
The structure of my file is like this:
string value_1 .... value_n
The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this:
string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ... with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ... with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.
for line in f:
tokens = line.strip().split()
if len(tokens) <= 5: #ignore w2v first line
continue
word = tokens[0]
number_of_columns = len(tokens)-1
features = {}
for dim, val in enumerate(tokens[1:]):
val = float(val)
features[dim] = val
matrix[word] = features
This will result Killed in the second case while will work in the first case.
I know this does not answer the question specifically, but probably offers a better solution to the problem looking to be resolved:
May i suggest you use Pandas for this kind of work?
It seems a lot more appropriate for what you're trying to do. http://pandas.pydata.org/index.html
import pandas as pd
pd.read_csv('file.txt', sep=' ', skiprows=1)
then do all your manipulations
Pandas is a package designed specifically to handle large datasets and process them. it has tons of useful features you probably will end up needing if you're dealing with big data.

Project Euler #13 in Python, trying to find smart solution

I'm trying to solve problem 13 from Euler project, and I'm trying to make the solution beautiful (at least, not ugly). Only "ugly thing" I do is that I'm pre-formating the input and keep it in the solution file (due to some technical reasons, and 'cause I want to concentrate on numeric part of problem)
The problem is "Work out the first ten digits of the sum of the following one-hundred 50-digit numbers."
I wrote some code, that should work, as far as I know, but it gives wrong result. I've checked input several times, it seems to be OK...
nums=[37107287533902102798797998220837590246510135740250,
46376937677490009712648124896970078050417018260538,
74324986199524741059474233309513058123726617309629,
91942213363574161572522430563301811072406154908250,
23067588207539346171171980310421047513778063246676,
89261670696623633820136378418383684178734361726757,
28112879812849979408065481931592621691275889832738,
44274228917432520321923589422876796487670272189318,
47451445736001306439091167216856844588711603153276,
70386486105843025439939619828917593665686757934951,
62176457141856560629502157223196586755079324193331,
64906352462741904929101432445813822663347944758178,
92575867718337217661963751590579239728245598838407,
58203565325359399008402633568948830189458628227828,
80181199384826282014278194139940567587151170094390,
35398664372827112653829987240784473053190104293586,
86515506006295864861532075273371959191420517255829,
71693888707715466499115593487603532921714970056938,
54370070576826684624621495650076471787294438377604,
53282654108756828443191190634694037855217779295145,
36123272525000296071075082563815656710885258350721,
45876576172410976447339110607218265236877223636045,
17423706905851860660448207621209813287860733969412,
81142660418086830619328460811191061556940512689692,
51934325451728388641918047049293215058642563049483,
62467221648435076201727918039944693004732956340691,
15732444386908125794514089057706229429197107928209,
55037687525678773091862540744969844508330393682126,
18336384825330154686196124348767681297534375946515,
80386287592878490201521685554828717201219257766954,
78182833757993103614740356856449095527097864797581,
16726320100436897842553539920931837441497806860984,
48403098129077791799088218795327364475675590848030,
87086987551392711854517078544161852424320693150332,
59959406895756536782107074926966537676326235447210,
69793950679652694742597709739166693763042633987085,
41052684708299085211399427365734116182760315001271,
65378607361501080857009149939512557028198746004375,
35829035317434717326932123578154982629742552737307,
94953759765105305946966067683156574377167401875275,
88902802571733229619176668713819931811048770190271,
25267680276078003013678680992525463401061632866526,
36270218540497705585629946580636237993140746255962,
24074486908231174977792365466257246923322810917141,
91430288197103288597806669760892938638285025333403,
34413065578016127815921815005561868836468420090470,
23053081172816430487623791969842487255036638784583,
11487696932154902810424020138335124462181441773470,
63783299490636259666498587618221225225512486764533,
67720186971698544312419572409913959008952310058822,
95548255300263520781532296796249481641953868218774,
76085327132285723110424803456124867697064507995236,
37774242535411291684276865538926205024910326572967,
23701913275725675285653248258265463092207058596522,
29798860272258331913126375147341994889534765745501,
18495701454879288984856827726077713721403798879715,
38298203783031473527721580348144513491373226651381,
34829543829199918180278916522431027392251122869539,
40957953066405232632538044100059654939159879593635,
29746152185502371307642255121183693803580388584903,
41698116222072977186158236678424689157993532961922,
62467957194401269043877107275048102390895523597457,
23189706772547915061505504953922979530901129967519,
86188088225875314529584099251203829009407770775672,
11306739708304724483816533873502340845647058077308,
82959174767140363198008187129011875491310547126581,
97623331044818386269515456334926366572897563400500,
42846280183517070527831839425882145521227251250327,
55121603546981200581762165212827652751691296897789,
32238195734329339946437501907836945765883352399886,
75506164965184775180738168837861091527357929701337,
62177842752192623401942399639168044983993173312731,
32924185707147349566916674687634660915035914677504,
99518671430235219628894890102423325116913619626622,
73267460800591547471830798392868535206946944540724,
76841822524674417161514036427982273348055556214818,
97142617910342598647204516893989422179826088076852,
87783646182799346313767754307809363333018982642090,
10848802521674670883215120185883543223812876952786,
71329612474782464538636993009049310363619763878039,
62184073572399794223406235393808339651327408011116,
66627891981488087797941876876144230030984490851411,
60661826293682836764744779239180335110989069790714,
85786944089552990653640447425576083659976645795096,
66024396409905389607120198219976047599490197230297,
64913982680032973156037120041377903785566085089252,
16730939319872750275468906903707539413042652315011,
94809377245048795150954100921645863754710598436791,
78639167021187492431995700641917969777599028300699,
15368713711936614952811305876380278410754449733078,
40789923115535562561142322423255033685442488917353,
44889911501440648020369068063960672322193204149535,
41503128880339536053299340368006977710650566631954,
81234880673210146739058568557934581403627822703280,
82616570773948327592232845941706525094512325230608,
22918802058777319719839450180888072429661980811197,
77158542502016545090413245809786882778948721859617,
72107838435069186155435662884062257473692284509516,
20849603980134001723930671666823555245252804609722,
53503534226472524250874054075591789781264330331690]
result_sum = []
tmp_sum = 0
for j in xrange(50):
for i in xrange(100):
tmp_sum += nums[i] % 10
nums[i] =nums[i] / 10
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum = tmp_sum / 10
for i in xrange(10):
print result_sum[i]
Your code works by adding all the numbers in nums like a person would: adding column by column. Your code does not work because when you are summing the far left column, you treat it like every other column. Whenever people get to the far left, they write down the entire sum. So this line
result_sum.insert(0,int(tmp_sum % 10))
doesn't work for the far left column; you need to insert something else into result_sum in that case. I would post the code, but 1) I'm sure you don't need it, and 2) it's agains the Project-Euler tag rules. If you would like, I can email it to you, but I'm sure that won't be necessary.
You could save the numbers in a file (with a number on each line), and read from it:
nums = []
with open('numbers.txt', 'r') as f:
for num in f:
nums.append(int(num))
# nums is now populated with all of the numbers, so do your actual algorithm
Also, it looks like you want to store the sum as an array of digits. The cool thing about Python is that it automatically handles large integers. Here is a quote from the docs:
Plain integers (also just called integers) are implemented using long in C, which gives them at least 32 bits of precision (sys.maxint is always set to the maximum plain integer value for the current platform, the minimum value is -sys.maxint - 1). Long integers have unlimited precision.
So using an array of digits isn't really necessary if you are working with Python. In C, it is another story...
Also, regarding your code, you need to factor in the digits in tmp_sum, which contains your carry-over digits. You can add them into result_sum like this:
while tmp_sum:
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum /= 10
This will fix your issue. Here, it works.
Since you already have all the numbers in a list, you should be able to take the sum of them pretty easily. Then you just need to take the first ten digits of the sum. I won't put any code here, though.
As Simple as this :
Values.txt will contain all digits.
nums = []
with open("values.txt",'r') as f:
for num in f:
nums.append(int(num))
print(str(sum(nums))[:10])
Just as easy is storing it in csv and using pandas:
def foo():
import pandas as pd
table = pd.read_csv("data.txt", header = None, usecols = [0])
and then iterate through panda dataframe:
sum = 0
for x in range(len(table)):
sum += int(table[0][x])
return str(sum)[:10]
just keep in mind that Python handles the large digits for you.

From text file to a market matrix format

I am working in Python and I have a matrix stored in a text file. The text file is arranged in such a format:
row_id, col_id row_id, col_id ... row_id, col_id
row_id and col_id are integers and they take values from 0 to n (in order to know n for row_id and col_id I have to scan the entire file first).
there's no header and row_ids and col_ids appear multiple times in the file, but each combination row_id,col_id appears once. There's no explicit value for each combination row_id,col_id , actually each cell value is 1. The file is almost 1 gigabyte of size.
Unfortunately the file is difficult to handle in the memory, in fact, it is 2257205 row_ids and 122905 col_ids for 26622704 elements. So I was looking for better ways to handle it. Matrix market format could be a way to deal with it.
Is there a fast and memory efficient way to convert this file into a file in a market matrix format (http://math.nist.gov/MatrixMarket/formats.html#mtx) using Python?
There is a fast and memory efficient way of handling such matrices: using the sparse matrices offered by SciPy (which is the de facto standard in Python for this kind of things).
For a matrix of size N by N:
from scipy.sparse import lil_matrix
result = lil_matrix((N, N)) # In order to save memory, one may add: dtype=bool, or dtype=numpy.int8
with open('matrix.csv') as input_file:
for line in input_file:
x, y = map(int, line.split(',', 1)) # The "1" is only here to speed the splitting up
result[x, y] = 1
(or, in one line instead of two: result[map(int, line.split(',', 1))] = 1).
The argument 1 given to split() is just here to speed things up when parsing the coordinates: it instructs Python to stop parsing the line when the first (and only) comma is found. This can matter some, since you are reading a 1 GB file.
Depending on your needs, you might find one of the other six sparse matrix representations offered by SciPy to be better suited.
If you want a faster but also more memory-consuming array, you can use result = numpy.array(…) (with NumPy) instead.
Unless I am missing something...
The MatrixMarket MM format is a line with the dimensions and "row col value". If you already have rows and cols and all values are 1, simply add the value and that should be it.
Wouldn't it be easier to simply use sed as in
n=`wc -l file`
echo "2257205 122905 $n" > file.mm
cat file | sed -e 's/$/ 1/g' >> file.mm
That should work if your coordinates are one-offset. If they are zero-offset you should add +1 to each coordinate, simply read coordinates, add one to each of them and print coordx, coordy, "1". Which you can do from the shell, from Awk or from python with very little effort.
Q&D code (untested, produced just as a hint, YMMV and you may want to preprocess the file to compute some values):
In the shell
echo "2257205 122905 $n"
cat file | while read x,y ; do x=$((x+1)); y=$((y+1)); echo "$x $y 1" ; done
In python, more or less...
f=open("file")
lines=f.readlines()
print 2257205, 122905, len(lines)
for l in lines:
(x,y) = l.split(' ')
x = int(x) + 1
y = int(y) + 1
print x, y, 1
Or am I missing something?

Categories