Historical database number formatting - python

Currently I am working with a historical database (in MS Access) containing passages of ships through the Sound (the strait between Denmark and Sweden).
I am having problems with the way amounts of products on board of ships were recorded. This generally takes the following forms:
12 1/15 (integer - space - fraction)
1/4 (fraction)
1 (integer)
I'd like to convert all these numbers to floats/decimal, in order to do some calculations. There are some additional challenges which are mainly caused by the lack of uniform input:
-not all rows have a value
-some rows have value: '-', i'd like to skip these
-some rows contain '*' when a number or a part of a number is missing, these can be skipped too
My first question is: Is there a way I could directly convert this in Access SQL? I have not been able to find anything but perhaps I overlooked something.
The second option I attempted is to export the table (called cargo), use python to convert the value and then output it and import the table again. I have a function to convert the standard three formats:
from fractions import Fraction
import pandas
import numpy
def fracToString(number):
conversionResult = float(sum(Fraction(s) for s in number.split()))
return conversionResult
df = pandas.read_csv('cargo.csv', usecols = [0,5], header = None, names = ['id_passage', 'amount'])
df['amountDecimal'] = df['amount'].dropna().apply(fracToString)
This works for empty rows, however the values containing '*' or '-' or other characters that the fractToString function can't handle raise a ValueError. Since these are just a couple of records out of over 4 million these can be omitted. Is there a way to tell pandas.apply() to just skip to the next row if the fracToString function throws a ValueError?
Thank you in advance,
Alex

Related

How to count cells containing numbers in specific range with cells that contain both text and numbers

I thought I could easily sort this issue out but it took me ages to solve just half of it.
I have a table that contains 100 data cells in a row. Data in each cell are either text-only or text and numbers (see layout at bottom).
I need a function that COUNTs how many cells are present in the table that report the value of N2 OR E to be >=37.
Negative
Positive (N2: 23, E: 23)
Negative Function answer: 2
Positive (N2: 37, E: 26)
Positive (N2: 31, E: 38)
So far I could only extract each N2 number with a function [=MID(A2,15,FIND(",",A2)-15)] that considers the 15th character, then a second function counts how many extracted numbers (they have been extracted in B row) are >=37, [=COUNTIF(B2:B100, ">=37")] but have not a clue on how to take the E value into account.
In addition, I would like the function to consider cells containing the N2 value OR the E value >=37.
Is there the chance to have one big function that does that? Is there the chance not to rely on KUTools for Excel?
If you have the newest version of Excel, you can use FILTERXML after making some minor changes. First concatenate the whole range with CONCAT, then eliminate all ","s and replace ")"s with spaces in the concatenated string.
For example, the below gets you all the instances over 36 (if you only want the number of times, wrap it in a COUNT):
=FILTERXML("<t><s>"&SUBSTITUTE(
SUBSTITUTE(SUBSTITUTE(CONCAT($F$2:$F$7), ")", " "), ",", ""), " ",
"</s><s>")&"</s></t>", "//s[number()>=37]")
For more info on dealing with strings, see here.
EDIT: Thanks #MarkBalhoff for catching a missing space in the formula and
#JvdV for giving another way with =IFERROR(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(TEXTJOIN(" ",,F2:F6)," ","</s><s>")&"</s></t>","//s[translate(.,',','')*1>=37 or translate(following::*[2],')','')*1>=37]")),0)
Since you include the python tag and also reference KU-Tools, I assume you have some familiarity with VBA.
You could easily, and flexibly, implement the logic in Excel VBA using regular expressions.
For this function, I allowed three arguments:
The range to search
The threshold for the values
A list of values to look for
In the regex, the pattern looks for the digits that follow either of the strings in "searchFor". Note that, as written, you need to include the colons in the searchFor strings, and that that the strings are case-sensitive. (easily changed)
Option Explicit
Function CountVals(r As Range, Threshold As Long, ParamArray searchFor() As Variant) As Long
Dim RE As Object, MC As Object, M As Object
Dim counter As Long
Dim vSrc As Variant, v As Variant
Dim sPat As String
'read range into variant array for fastest processing
vSrc = r
'create Pattern
sPat = "(?:" & Join(searchFor, "|") & ")\s*(\d+)"
'initialize regex
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.ignorecase = False 'or change to true if capitalization not important
.Pattern = sPat
counter = 0
'check each string for the values
For Each v In vSrc
Set MC = .Execute(v)
For Each M In MC
If CLng(M.submatches(0)) >= Threshold Then counter = counter + 1
Next M
Next v
CountVals = counter
End With
End Function

Add leading 0 to column of data in data frame

Existing data in data frame has dropped leading zero which is required as a part number. The number stored, needs to be 8 digits long and now they vary based on the removal of the leading 0's
Sorry I am new to Python and this may be built in function in Pandas but I have not found a way to convert this type of formatting.
I have over 2000 part numbers to convert over all
EG:
Part No
9069
38661
90705
9070
907
970206
Part number needs to be:
Part No
00009069
00038661
00090705
00009070
00000907
00970206
Use astype before using zfill, as follows:
df['Part'].astype(str).str.zfill(8)

Fastest way to compare two huge csv files in python(numpy)

I am trying find the intesect sub set between two pretty big csv files of
phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help
import pandas as pd
import numpy as np
df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])
dnc_phone = df_dnc['phone']
test_phone = df_test['phone']
np.intersect1d(dnc_phone, test_phone)
I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.
This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.
Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:
bits[0] identifies the number 0 000 000 000
bits[193] identifies the number 0 000 000 193
having a number 659 234-4567 would be addressed by the bits[6592344567]
Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.
I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.
Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.
from bitstring import BitArray
max_number = 9999999999
found_phone_numbers = BitArray(length=max_number+1)
# replace this function with the file open function and retrieving
# the next found phone number
def number_from_file_iteator(dummy_data):
for number in dummy_data:
yield number
def calculate_intersect():
# should be open a file1 and getting the generator with numbers from it
# we use dummy data here
for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
found_phone_numbers[number] = True
# open second file and check if the number is there
for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
if found_phone_numbers[number]:
yield number
number_intersection = set(calculate_intersect())
print number_intersection
I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.
Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.
Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).
Read the 600k phone numbers into a set.
Input the larger file row by row, checking each row against the set.
Write matches to an output file immediately.
That way you don't have to load all the data in memory at once.

How can one convert an uncertainty expression (e.g. 3.23 +/- 0.01) from a string to a float?

I taking some data from a .csv file and placing it into a dict within my Python script, when I noticed a discrepancy in one of columns that contained values of uncertainty (e.g. 3.23 +/- 0.01). After a new table was built and the results were exported to Excel, this column would not sort itself numerically– only the very first value was treated like a number, while the rest of the values were treated like they were an expression.
I suspect this might have to do with the fact that, when I first reading the .csv file, it was read with 'rU' (read universal characters, instead of 'rb' for read binary). I did this since the original +/- symbol in the .csv file was not being read properly. So after the .csv file was read in, it had ' \xb1 ' as a placeholder for the +/- symbol, which I subsequently replaced again with ' +/- '.
import csv
import re
folder_contents={}
with open("greencandidates.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
fluorescence= line[1].replace(" \xb1 "," +/- ")
folder_contents[candidate_number]= [fluorescence]
However, given that there is a lot of data that gets processed from the original .csv file, I really would like to be able to sort the data in descending order (largest to smallest). Although there is a module that does allow for the creation of expressions of uncertainty, (https://pythonhosted.org/uncertainties/), I'm not sure how to use it in order to make the expressions of uncertainty be treated as floats that can be arranged in descending order. I posted a way in which uncertainty expressions can be created with the Sympy package below.
from uncertainties import ufloat
x = ufloat(1, 0.1) # x = 1+/-0.1
Use a key function in your sort, such as:
def u_float_key(num):
return float(num.split('+')[0])
Then you can use the built-in sorted even with strings:
sorted(results, key=u_float_key, reverse=True)
>>> test = ["1+/-1", "0.2+/-0", "4+/-2", "3+/-100"]
>>> sorted(test, key=ufloatkey)
['0.2+/-0', '1+/-1', '3+/-100', '4+/-2']

input box to populate excel formulas in a file

I would like to create a Python script which would open the csv (or xls) file and with an input box I could copy and paste the Excel formula to the specific row...then apply this to the rest of the empty rows in that column. To help visualize it here is an example
DATA, FORMULA
001, [here would be inserted the formula]
002, [here would be populated the amended formula]
003, [here would be populated the amended formula]
004, [here would be populated the amended formula]
So, the idea is to have a script, which would get me the input box asking
- from which row you want to start? | answer = B2
- what formula you want to populat there? | "=COUNTIF(A:A,A2)"
...and then it will populate the formula in the B2 column and auto populate the next B3, B4, B5 and B6, where the formula is adjustusted to the specific cell. The reason why I want to do this is the fact I deal with large excel files which very often crash on my computer, so I would like to execute it without running Excel itself.
I did some research adn xlwt probably is not capable to do this. Could you please help me to find the solution how should I do this? I would appreciate any ideas and guidance from you.
Unfortunately what you want to do can't be done without implementing a part of the spreadsheet program (Excel) in your code. There are no shortcuts there.
As for the file format, Python can deal natively with the CSV files, but I think you'd have trouble importing raw formulas (as opposed to numeric or textual content) from CSV into Excel itself.
Since you are already into Python, maybe it would be a better idea to move you logic from the spreadsheet into the program: use Excel or other spreadsheet program to input your data, just the numbers, and use your script not to modify the sheet, but to effect the calculations you need - maybe storing the data in a SQL database (Python s built-in SQLite will perform nicely for a single user app like in this case) - and output just the calculated numbers to a spreadsheet file, or maybe, generate your intend charts directly from Python using matplotlib.
That said, what you are asking can be done from Python - but it might lead to more and more complications on your general workflow as your dataset grows -
Hre - these helper functions will allow you to convert from the Excel cell naming convention to numbers and vice-versa - so that you can have the numeric indices with which to operate in the Python programing.
Parsing the formula typed in to extract the cell - addresses, is no easy-deal, however
rendering them back into the formula, after the numeric indices are adjusted should be a lot easier). I'd suggest you to hard-code your formula in the script, instead of allowing for the input of any possible formula.
def parse_num(address):
x = ""
for chr in (address):
if chr.isdigit():
x += chr
return int(x) - 1
def parse_col(address):
x = 0
for chr in address:
if chr.isdigit():
break
x = x * 26 + (ord(chr.upper()) - ord("A"))
return x
def render_address(col, row):
base = 26
power = col // base
col_letters = ""
tmp_col = col
for p in xrange(power, -1, -1):
dig = tmp_col // (base ** p)
letter = chr(dig + ord("A"))
col_letters += letter
tmp_col %= base ** p
return col_letters + str(row + 1)
Now, if you are willing to do your work in Python, just have your data input as CSV and use a small python program to get your results,instead of trying to fit them in a spreadsheet - for the formula above COUNTIF(A:A,A2) Basically, you want to count how many other rows have the first column as this row - for 750000 data positions, it is a piece of cake in Python - (it starts to get tougher if all data won't fit in RAM - but that would happen with about 100 million data points in a 2GB machine - at that point you can still fit everything in RAM resorting to specialized structures- above that it would start to need some more logic, which would be a few lines long using SQLIte as I mentioned above.
Now, the code for, given a CSV file with one column of data produce a second CSV file, where the second column contains the total of occurrences of the number in the first column:
import csv
from collections import Counter
data_count = Counter()
with open("data.csv", "rt") as input_file:
reader = csv.reader(input_file)
# skip header:
reader.next()
for row in reader():
data_count[int(row[0])] += 1
# everything is accounted for now - output the result:
with open("data.csv", "rt") as input_file, open("counted_data.csv", "wt") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header = reader.next()
header.append("Count")
writer.writerow(header)
for row in reader():
writer.writerow(row + [str(data_count[int(row[0])])] )
And that is only if you really need all of the first column in order on
the final file. If all you want are the count for each number in column 1,
regardless of the order they appear, you just need the data in data_count after the first block - and you can play with that interactively in the Python prompt and have results in fractions of a second what would take tens of minutes in a spreadsheet program.
If you have datasets that don't fit in memory, you just drop them in a database with a simpler script than this, and you still will have your results in a fraction of a second.

Categories