Extracting integer prices from different variants of budgets from Wikipedia - python

I'm trying to use Python to call an API and clean a bunch of strings that represent a movie budget.
So far, I have the following 6 variants of data that come up.
"$1.2 million"
"$1,433,333"
"US$ 2 million"
"US$1,644,736 (est.)
"$6-7 million"
"£3 million"
So far, I've only gotten 1 and 2 parsed without a problem with the following code below. What is the best way to handle all of the other cases or a general case that may not be listed below?
def clean_budget_string(input_string):
number_to_integer = {'million' : 1000000, 'thousand' : 1000}
budget_parts = input_string.split(' ')
#Currently, only indices 0 and 1 are necessary for computation
text_part = budget_parts[1]
if text_part in number_to_integer:
number = budget_parts[0].lstrip('$')
int_representation = number_to_integer[text_part]
return int(float(number) * int_representation)
else:
number = budget_parts[0]
idx_dollar = 0
for idx in xrange(len(number)):
if number[idx] == '$':
idx_dollar = idx
return int(number[idx_dollar+1:].replace(',', ''))

The way I would approach a parsing task like this -- and I'm happy to hear other opinions -- would be to break up your function into several parts, each of which identify a single piece of information in the input string.
For instance, I'd start by identifying what float number can be parsed from the string, ignoring currency and order of magnitude (a million, a thousand) for now :
f = float(''.join([c for c in input_str if c in '0123456789.']))
(you might want to add error handling for when you end up with a trailing dot, because of additions like 'est.')
Then, in a second step, you determine whether the float needs to be multiplied to adjust for the correct order of magnitude. One way of doing this would be with multiple if-statements :
if 'million' in input_str :
oom = 6
elif 'thousand' in input_str :
oom = 3
else :
oom = 1
# adjust number for order of magnitude
f = f*math.pow(10, oom)
Those checks could of course be improved to account for small differences in formatting by using regular expressions.
Finally, you separately determine the currency mentioned in your input string, again using one or more if-statements :
if '£' in input_str :
currency = 'GBP'
else :
currency = 'USD'
Now the one case that this doesn't yet handle is the dash one where lower and upper estimates are given. One way of making the function work with these inputs is to split the initial input string on the dash and use the first (or second) of the substrings as input for the initial float parsing. So we would replace our first line of code with something like this:
if '-' in input_str :
lower = input_str.split('-')[0]
f = float(''.join([c for c in lower if c in '0123456789.']))
else :
f = float(''.join([c for c in input_str if c in '0123456789.']))

using regex and string replace method, i added the return of the curency as well if needed.
Modify accordingly to handle more input or multiplier like billion etc.
import re
# take in string and return integer amount and currency
def clean_budget_string(s):
mult_dict = {'million':1000000,'thousand':1000}
tmp = re.search('(^\D*?)\s*((?:\d+\.?,?)+)(?:-\d+)?\s*((?:million|thousand)?)', s).groups()
currency = tmp[0]
mult = tmp[-1]
tmp_int = ''.join(tmp[1:-1]).replace(',', '') # join digits and multiplier, remove comma
tmp_int = int(float(tmp_int) * mult_dict.get(mult, 1))
return tmp_int, currency
>>? clean_budget_string("$1.2 million")
(1200000, '$')
>>? clean_budget_string("$1,433,333")
(1433333, '$')
>>? clean_budget_string("US$ 2 million")
(2000000, 'US$')
>>? clean_budget_string("US$1,644,736 (est.)")
(1644736, 'US$')
>>? clean_budget_string("$6-7 million")
(6000000, '$')
>>? clean_budget_string("£3 million")
(3000000, '£') # my script don't recognize the £ char, might need to set the encoding properly

Related

search for two identical characters and filter them on python

how this task can be carried out is more optimized.
Various data come to me.
Example data : test/123, test , test/, test/123/
and I need to write the correct data to my database, but first I need to find it. The correct ones will be in the test/123/ format
and then divide them into variables
a = test
b = 123
Tell me how it can be done ?
the data can be of any size
This answers for a data set of one or more letters ("test" in the example above) and one or more digits ("123" in the example above)
import re
data = 'test/123/'
m = re.match(r'(\w+)/(\d+)/', data)
if m:
a = m[1]
b = m[2]
print(f'Found a = {a}, b = {b}')
else:
print('no match')
For more tweaks of the pattern, refer e.g. to python's Regular Expressions HOWTO

python - combine 2 time formats

def convert(time):
pos = ["s","m","h","d"]
time_dict = {"s": 1,"m": 60,"h": 3600,"d": 24*3600 }
unit = time[-1]
if unit not in pos:
return -1
try:
timeVal = int(time[:-1])
except:
return -2
return timeVal*time_dict[unit]
Currently, this is my code and I'm using it to translate Strings like 5d or 30m to seconds. And that's work, but if I try to combine them (like 5d 30m, it gives me the output -2. I don't really see what's wrong here.
Your problem is that you're only checking the last character, you need to parse the string to find each individual group and then work off of that
import re
def convert(time):
time_dict = {"s": 1,"m": 60,"h": 3600,"d": 24*3600 }
regex_groups = re.findall("(\d+)([smhd])", time)
return sum(int(x) * time_dict[y] for x,y in regex_groups)
I don't really see what's wrong here.
Lets say you provided 5d 30m as input, [:-1] does jettison last character which result in 5d 30. You then try to convert it to int which fails, as d is not allowed in integer representation.
You need first to tokenize elements then convert every piece to value in seconds then sum them together, simplified example with h and m only:
def to_seconds(token):
q = {"h":3600,"m":60}
return int(token[:-1])*q[token[-1]]
def convert(time):
return sum(to_seconds(i) for i in time.split())
print(convert("5h 30m"))
output
19800
Disclaimer: this solution assumes that elements are whitespaces sheared

How to solve a Linear System of Equations in Python When the Coefficients are Unknown (but still real numbers)

Im not a programer so go easy on me please ! I have a system of 4 linear equations and 4 unknowns, which I think I could use python to solve relatively easily. However my equations not of the form " 5x+2y+z-w=0 " instead I have algebraic constants c_i which I dont know the explicit numerical value of, for example " c_1 x + c_2 y + c_3 z+ c_4w=c_5 " would be one my four equations. So does a solver exist which gives answers for x,y,z,w in terms of the c_i ?
Numpy has a function for this exact problem: numpy.linalg.solve
To construct the matrix we first need to digest the string turning it into an array of coefficients and solutions.
Finding Numbers
First we need to write a function that takes a string like "c_1 3" and returns the number 3.0. Depending on the format you want in your input string you can either iterate over all chars in this array and stop when you find a non-digit character, or you can simply split on the space and parse the second string. Here are both solutions:
def find_number(sub_expr):
"""
Finds the number from the format
number*string or numberstring.
Example:
3x -> 3
4*x -> 4
"""
num_str = str()
for char in sub_expr:
if char.isdigit():
num_str += char
else:
break
return float(num_str)
or the simpler solution
def find_number(sub_expr):
"""
Returns the number from the format "string number"
"""
return float(sub_expr.split()[1])
Note: See edits
Get matrices
Now we can use that to split each expression into two parts: The solution and the equation by the "=". The equation is then split into sub_expressions by the "+" This way we would end turn the string "3x+4y = 3" into
sub_expressions = ["3x", "4y"]
solution_string = "3"
Each sub expression then needs to be fed into our find_numbers function. The End result can be appended to the coefficient and solution matrices:
def get_matrices(expressions):
"""
Returns coefficient_matrix and solutions from array of string-expressions.
"""
coefficient_matrix = list()
solutions = list()
last_len = -1
for expression in expressions:
# Note: In this solution all coefficients must be explicitely noted and must always be in the same order.
# Could be solved with dicts but is probably overengineered.
if not "=" in expression:
print(f"Invalid expression {expression}. Missing \"=\"")
return False
try:
c_string, s_string = expression.split("=")
c_strings = c_string.split("+")
solutions.append(float(s_string))
current_len = len(c_strings)
if last_len != -1 and current_len != last_len:
print(f"The expression {expression} has a mismatching number of coefficients")
return False
last_len = current_len
coefficients = list()
for c_string in c_strings:
coefficients.append(find_number(c_string))
coefficient_matrix.append(coefficients)
except Exception as e:
print(f"An unexpected Runtime Error occured at {coefficient}")
print(e)
exit()
return coefficient_matrix, solutions
Now let's write a simple main function to test this code:
# This is not the code you want to copy-paste
# Look further down.
from sys import argv as args
def main():
expressions = args[1:]
matrix, solutions = get_matrices(expressions)
for row in matrix:
print(row)
print("")
print(solutions)
if __name__ == "__main__":
main()
Let's run the program in the console!
user:$ python3 solve.py 2x+3y=4 3x+3y=2
[2.0, 3.0]
[3.0, 3.0]
[4.0, 2.0]
You can see that the program identified all our numbers correctly
AGAIN: use the find_number function appropriate for your format
Put The Pieces Together
These Matrices now just need to be pumped directly into the numpy function:
# This is the main you want
from sys import argv as args
from numpy.linalg import solve as solve_linalg
def main():
expressions = args[1:]
matrix, solutions = get_matrices(expressions)
coefficients = solve_linalg(matrix, solutions)
print(coefficients)
# This bit needs to be at the very bottom of your code to load all functions first.
# You could just paste the main-code here, but this is considered best-practice
if __name__ == '__main__':
main()
Now let's test that:
$ python3 solve.py x*2+y*4+z*0=20 x*1+y*1+z*-1=3 x*2+y*2+z*-3=3
[2. 4. 3.]
As you can see the program now solves the functions for us.
Out of curiosity: Math homework? This feels like math homework.
Edit: Had a typo "c_string" instead of "c_strings" worked out in all tests out of pure and utter luck.
Edit 2: Upon further inspection I would reccomend to split the sub-expressions by a "*":
def find_number(sub_expr):
"""
Returns the number from the format "string number"
"""
return float(sub_expr.split("*")[1])
This results in fairly readable input strings

Sorting with two digits in string - Python

I am new to Python and I have a hard time solving this.
I am trying to sort a list to be able to human sort it 1) by the first number and 2) the second number. I would like to have something like this:
'1-1bird'
'1-1mouse'
'1-1nmouses'
'1-2mouse'
'1-2nmouses'
'1-3bird'
'10-1birds'
(...)
Those numbers can be from 1 to 99 ex: 99-99bird is possible.
This is the code I have after a couple of headaches. Being able to then sort by the following first letter would be a bonus.
Here is what I've tried:
#!/usr/bin/python
myList = list()
myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
def mysort(x,y):
x1=""
y1=""
for myletter in x :
if myletter.isdigit() or "-" in myletter:
x1=x1+myletter
x1 = x1.split("-")
for myletter in y :
if myletter.isdigit() or "-" in myletter:
y1=y1+myletter
y1 = y1.split("-")
if x1[0]>y1[0]:
return 1
elif x1[0]==y1[0]:
if x1[1]>y1[1]:
return 1
elif x1==y1:
return 0
else :
return -1
else :
return -1
myList.sort(mysort)
print myList
Thanks !
Martin
You have some good ideas with splitting on '-' and using isalpha() and isdigit(), but then we'll use those to create a function that takes in an item and returns a "clean" version of the item, which can be easily sorted. It will create a three-digit, zero-padded representation of the first number, then a similar thing with the second number, then the "word" portion (instead of just the first character). The result looks something like "001001bird" (that won't display - it'll just be used internally). The built-in function sorted() will use this callback function as a key, taking each element, passing it to the callback, and basing the sort order on the returned value. In the test, I use the * operator and the sep argument to print it without needing to construct a loop, but looping is perfectly fine as well.
def callback(item):
phrase = item.split('-')
first = phrase[0].rjust(3, '0')
second = ''.join(filter(str.isdigit, phrase[1])).rjust(3, '0')
word = ''.join(filter(str.isalpha, phrase[1]))
return first + second + word
Test:
>>> myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
>>> print(*sorted(myList, key=callback), sep='\n')
1-1bird
1-1cat
1-1mouse
1-1nmouses
1-1person
1-2bird
1-2cat
1-2mouse
1-2nmouses
1-2person
1-3bird
1-3cat
1-3mouse
1-3nmouses
1-3person
1-10bird
1-10cat
1-10mouse
1-10nmouses
1-10person
1-11bird
1-11cat
1-11mouse
1-11nmouses
1-11person
1-12bird
1-12mouse
1-12nmouses
1-12person
1-13mouse
1-13nmouses
1-13person
1-14bird
1-14cat
1-14mouse
1-14nmouses
1-14person
1-15cat
2-1bird
2-1cat
2-1mouse
2-1nmouses
2-1person
2-2bird
2-2mouse
2-2nmouses
2-2person
2-14cat
2-15cat
2-16cat
You need leading zeros. Strings are sorted alphabetically with the order different from the one for digits. It should be
'01-1bird'
'01-1mouse'
'01-1nmouses'
'01-2mouse'
'01-2nmouses'
'01-3bird'
'10-1birds'
As you you see 1 goes after 0.
The other answers here are very respectable, I'm sure, but for full credit you should ensure that your answer fits on a single line and uses as many list comprehensions as possible:
import itertools
[''.join(r) for r in sorted([[''.join(x) for _, x in
itertools.groupby(v, key=str.isdigit)]
for v in myList], key=lambda v: (int(v[0]), int(v[2]), v[3]))]
That should do nicely:
['1-1bird',
'1-1cat',
'1-1mouse',
'1-1nmouses',
'1-1person',
'1-2bird',
'1-2cat',
'1-2mouse',
...
'2-2person',
'2-14cat',
'2-15cat',
'2-16cat']

I need to change a zip code into a series of dots and dashes (a barcode), but I can't figure out how

Here's what I've got so far:
def encodeFive(zip):
zero = "||:::"
one = ":::||"
two = "::|:|"
three = "::||:"
four = ":|::|"
five = ":|:|:"
six = ":||::"
seven = "|:::|"
eight = "|::|:"
nine = "|:|::"
codeList = [zero,one,two,three,four,five,six,seven,eight,nine]
allCodes = zero+one+two+three+four+five+six+seven+eight+nine
code = ""
digits = str(zip)
for i in digits:
code = code + i
return code
With this I'll get the original zip code in a string, but none of the numbers are encoded into the barcode. I've figured out how to encode one number, but it wont work the same way with five numbers.
codeList = ["||:::", ":::||", "::|:|", "::||:", ":|::|",
":|:|:", ":||::", "|:::|", "|::|:", "|:|::" ]
barcode = "".join(codeList[int(digit)] for digit in str(zipcode))
Perhaps use a dictionary:
barcode = {'0':"||:::",
'1':":::||",
'2':"::|:|",
'3':"::||:",
'4':":|::|",
'5':":|:|:",
'6':":||::",
'7':"|:::|",
'8':"|::|:",
'9':"|:|::",
}
def encodeFive(zipcode):
return ''.join(barcode[n] for n in str(zipcode))
print(encodeFive(72353))
# |:::|::|:|::||::|:|:::||:
PS. It is better not to name a variable zip, since doing so overrides the builtin function zip. And similarly, it is better to avoid naming a variable code, since code is a module in the standard library.
You're just adding i (the character in digits) to the string where I think you want to be adding codeList[int(i)].
The code would probably be much simpler by just using a dict for lookups.
I find it easier to use split() to create lists of strings:
codes = "||::: :::|| ::|:| ::||: :|::| :|:|: :||:: |:::| |::|: |:|::".split()
def zipencode(numstr):
return ''.join(codes[int(x)] for x in str(numstr))
print zipencode("32345")
This is made in python.
number = ["||:::",
":::||",
"::|:|",
"::||:",
":|::|",
":|:|:",
":||::",
"|:::|",
"|::|:",
"|:|::"
]
def encode(num):
return ''.join(map(lambda x: number[int(x)], str(num)))
print encode(32345)
I don't know what language you are usingm so I made an example in C#:
int zip = 72353;
string[] codeList = {
"||:::", ":::||", "::|:|", "::||:", ":|::|",
":|:|:", ":||::", "|:::|", "|::|:", "|:|::"
};
string code = String.Empty;
while (zip > 0) {
code = codeList[zip % 10] + code;
zip /= 10;
}
return code;
Note: Instead of converting the zip code to a string, and the convert each character back to a number, I calculated the digits numerically.
Just for fun, here's a one-liner:
return String.Concat(zip.ToString().Select(c => "||::::::||::|:|::||::|::|:|:|::||::|:::||::|:|:|::".Substring(((c-'0') % 10) * 5, 5)).ToArray());
It appears you're trying to generate a "postnet" barcode. Note that the five-digit ZIP postnet barcodes were obsoleted by ZIP+4 postnet barcodes, which were obsoleted by ZIP+4+2 delivery point postnet barcodes, all of which are supposed to include a checksum digit and leading and ending framing bars. In any case, all of those forms are being obsoleted by the new "intelligent mail" 4-state barcodes, which require a lot of computational code to generate and no longer rely on straight digit-to-bars mappings. Search USPS.COM for more details.

Categories