Convert a value using a value from a different row with petl?

Convert a value using a value from a different row with petl? - python

I have the following table:
+---------+------------+----------------+
| IRR | Price List | Cambrdige Data |
+=========+============+================+
| '1.56%' | '0' | '6/30/1989' |
+---------+------------+----------------+
| '5.17%' | '100' | '9/30/1989' |
+---------+------------+----------------+
| '4.44%' | '0' | '12/31/1990' |
+---------+------------+----------------+
I'm trying to write a calculator that updates the Price List field by making a simple calculation. The logic is basically this:
previous price * ( 1 + IRR%)
So for the last row, the calculation would be: 100 * (1 + 4.44%) = 104.44
Since I'm using petl, I'm trying to figure out how to update a field with its above value and a value from the same row and then populate this across the whole Price List column. I can't seem to find a useful petl utility for this. Should I just manually write a method? What do you guys think?

Try this:
# conversion can access other values from the same row
table = etl.convert(table, 'Price List',
lambda row: 100 * (1 + row.IRR),
pass_row=True)

Related

PySpark Cum Sum of two values

Given the following example dataframe:
advertiser_id| name | amount | total |max_total_advertiser|
4061 |source1|-434.955284|-354882.75336200005| -355938.53950700007
4061 |source2|-594.012216|-355476.76557800005| -355938.53950700007
4061 |source3|-461.773929|-355938.53950700007| -355938.53950700007
I need to sum the amount and the max_total_advertiser field in order to get the correct total value in each row. Taking into account that I need this total value for every group partitioned by advertiser_id. (The total column in the initial dataframe is incorrect, that's why I want to calculate correctly)
Something like that should be:
w = Window.partitionBy("advertiser_id").orderBy("advertiser_id")
df.withColumn("total_aux", when( lag("advertiser_id").over(w) == col("advertiser_id"), lag("total_aux").over(w) + col("amount") ).otherwise( col("max_total_advertiser") + col("amount") ))
This lag("total_aux") is not working because the column is not generated yet, that's what I want to achieve, if it is the first row in the group, sum the columns in the same row if not sum the previous obtained value with the current amount field.
Example output:
advertiser_id| name | amount | total_aux |
4061 |source1|-434.955284|-356373.494791 |
4061 |source2|-594.012216|-356967.507007 |
4061 |source3|-461.773929|-357429.280936 |
Thanks.

I assume that name is a distinct value for each advertiser_id and your dataset is therefore sortable by name. I also assume that max_total_advertiser contains the same value for each advertiser_id. If one of those is not the case, please add a comment.
What you need is a rangeBetween window which gives you all preceding and following rows within the specified range. We will use Window.unboundedPreceding as we want to sum up all the previous values.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
(4061, 'source1',-434.955284,-354882.75336200005, -355938.53950700007),
(4061, 'source2',-594.012216,-355476.76557800005, -345938.53950700007),
(4062, 'source1',-594.012216,-355476.76557800005, -5938.53950700007),
(4062, 'source2',-594.012216,-355476.76557800005, -5938.53950700007),
(4061, 'source3',-461.773929,-355938.53950700007, -355938.53950700007)
]
columns = ['advertiser_id','name' ,'amount', 'total', 'max_total_advertiser']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('advertiser_id').orderBy('name').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('total', F.sum('amount').over(w) + df.max_total_advertiser)
df.show()
Output:
+-------------+-------+-----------+-------------------+--------------------+
|advertiser_id| name| amount| total|max_total_advertiser|
+-------------+-------+-----------+-------------------+--------------------+
| 4062|source1|-594.012216|-6532.5517230000705| -5938.53950700007|
| 4062|source2|-594.012216| -7126.563939000071| -5938.53950700007|
| 4061|source1|-434.955284| -356373.4947910001| -355938.53950700007|
| 4061|source2|-594.012216| -346967.5070070001| -345938.53950700007|
| 4061|source3|-461.773929|-357429.28093600005| -355938.53950700007|
+-------------+-------+-----------+-------------------+--------------------+

You might be looking for the orderBy() function. Does this work?
from pyspark.sql.window import *
df.withColumn("cumulativeSum", sum(df("amount"))
.over( Window.partitionBy("advertiser_id").orderBy("amount")))

Min Values in each row of selected columns in Python

Hi I have a rather simple task but seems like all online help is not working.
I have data set like this:
ID | Px_1 | Px_2
theta| 106.013676 | 102.8024788702673
Rho | 100.002818 | 102.62640389123405
gamma| 105.360589 | 107.21999706084836
Beta | 106.133046 | 115.40449479551263
alpha| 106.821119 | 110.54312246081719
I want to find min by each row in a fourth col so the output I can have is for example, theta is 102.802 because it is the min value of both Px_1 and Px_2
I tried this but doesnt work
I constantly get max value
df_subset = read.set_index('ID')[['Px_1','Px_2']]
d = df_subset.min( axis=1)
Thanks

You can try this
df["min"] = df[["Px_1", "Px_2"]].min(axis=1)
Select the columns needed, here ["Px_1", "Px_2"], to perform min operation.

Making Table without using Texttable

I am writing Python code to show items in a store .... as I am still learning I want to know how to make a table which looks exactly like a table made by using Texttable ....
My code is
Goods = ['Book','Gold']
Itemid= [711001,711002]
Price= [200,50000]
Count= [100,2]
Category= ['Books','Jewelry']
titles = ['', 'Item Id', 'Price', 'Count','Category']
data = [titles] + list(zip(Goods, Itemid, Price, Count, Category))
for i, d in enumerate(data):
line = '|'.join(str(x).ljust(12) for x in d)
print(line)
if i == 0:
print('=' * len(line))
My Output:
|Item Id |Price |Count |Category
================================================================
Book |711001 |200 |100 |Books
Gold |711002 |50000 |2 |Jewelry
Output I want:
+------+---------+-------+-------+-----------+
| | Item Id | Price | Count | Category |
+======+=========+=======+=======+===========+
| Book | 711001 | 200 | 100 | Books |
+------+---------+-------+-------+-----------+
| Gold | 711002 | 50000 | 2 | Jewelry |
+------+---------+-------+-------+-----------+

You code is building your output by hand, using string.join(). You can do it that way but it is very tedious. Use string formatting instead.
To help you along here is one line:
content_format = "| {Goods:4.4s} | {ItemId:<7d} | {Price:<5d} | {Count:<5d} | {Category:9s} |"
output_line = content_format.format(Goods="Book",ItemId=711001,Price=200,Count=100,Category="Books")
Texttable adjusts its cell widths to fit the data. If you want to do the same, then you will have to put computed field widths in content_format instead of using numeric literals the way I have done in the example above. Again, here is one example to get you going:
content_format = "| {Goods:4.4s} | {ItemId:<7d} | {Price:<5d} | {Count:<5d} | {Category:{CategoryWidth}s} |"
output_line = content_format.format(Goods="Book",ItemId=711001,Price=200,Count=100,Category="Books",CategoryWidth=9)
But if you already know how to do this using Texttable, why not use that? Your comment says it's not available in Python: not true, I just downloaded version 0.9.0 using pip.

Python Output Length

I'm attempting to output my database table data, which works aside from long table rows. The columns need to be as large as the longest database row. I'm having trouble implementing a calculation to correctly output the table proportionally instead of a huge mess when long rows are outputted (without using a third party library e.g. Print results in MySQL format with Python). Please let me know if you need more information.
Database connection:
connection = sqlite3.connect("test_.db")
c = connection.cursor()
c.execute("SELECT * FROM MyTable")
results = c.fetchall()
formatResults(results)
Table formatting:
def formatResults(x):
try:
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in x:
print(tavnit % row)
print(separator)
print ""
except:
showMainMenu()
pass
Output problem example:
+------+------+---------+
| Date | Name | LinkOrFile |
+------+------+---------+
| 03-17-2016 | hi.com | Locky |
| 03-18-2016 | thisisitsqq.com | None |
| 03-19-2016 | http://ohiyoungbuyff.com\69.exe?1 | None |
| 03-20-2016 | http://thisisitsqq..com\69.exe?1 | None |
| 03-21-2016 | %Temp%\zgHRNzy\69.exe | None |
| 03-22-2016 | | None |
| 03-23-2016 | E52219D0DA33FDD856B2433D79D71AD6 | Downloader |
| 03-24-2016 | microsoft.com | None |
| 03-25-2016 | 89.248.166.132 | None |
| 03-26-2016 | http://89.248.166.131/55KB5js9dwPtx4= | None |

If your main problem is making column widths consistent across all the lines, this python package could do the job: https://pypi.python.org/pypi/tabulate
Below you find a very simple example of a possible formatting approach.
The key point is to find the largest length of each column and then use format method of the string object:
#!/usr/bin/python
import random
import string
from operator import itemgetter
def randomString(minLen = 1, maxLen = 10):
""" Random string of length between 1 and 10 """
l = random.randint(minLen, maxLen)
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(l))
COLUMNS = 4
def randomTable():
table = []
for i in range(10):
table.append( [randomString() for j in range(COLUMNS)] )
return table
def findMaxColumnLengs(table):
""" Returns tuple of max column lengs """
maxLens = [0] * COLUMNS
for l in table:
lens = [len(s) for s in l]
maxLens = [max(maxLens[e[0]], e[1]) for e in enumerate(lens)]
return maxLens
if __name__ == '__main__':
ll = randomTable()
ml = findMaxColumnLengs(ll)
# tuple of formatting statements, see format docs
formatStrings = ["{:<%s}" % str(m) for m in ml ]
fmtStr = "|".join(formatStrings)
print "=================================="
for l in ll:
print l
print "=================================="
for l in ll:
print fmtStr.format(*l)
This prints the initial table packed in the list of lists and the formatted output.
==================================
['2U7Q', 'DZK8Z5XT', '7ZI0W', 'A9SH3V3U']
['P7SOY3RSZ1', 'X', 'Z2W', 'KF6']
['NO8IEY9A', '4FVGQHG', 'UGMJ', 'TT02X']
['9S43YM', 'JCUT0', 'W', 'KB']
['P43T', 'QG', '0VT9OZ0W', 'PF91F']
['2TEQG0H6A6', 'A4A', '4NZERXV', '6KMV22WVP0']
['JXOT', 'AK7', 'FNKUEL', 'P59DKB8']
['BTHJ', 'XVLZZ1Q3H', 'NQM16', 'IZBAF']
['G0EF21S', 'A0G', '8K9', 'RGOJJYH2P9']
['IJ', 'SRKL8TXXI', 'R', 'PSUZRR4LR']
==================================
2U7Q |DZK8Z5XT |7ZI0W |A9SH3V3U
P7SOY3RSZ1|X |Z2W |KF6
NO8IEY9A |4FVGQHG |UGMJ |TT02X
9S43YM |JCUT0 |W |KB
P43T |QG |0VT9OZ0W|PF91F
2TEQG0H6A6|A4A |4NZERXV |6KMV22WVP0
JXOT |AK7 |FNKUEL |P59DKB8
BTHJ |XVLZZ1Q3H|NQM16 |IZBAF
G0EF21S |A0G |8K9 |RGOJJYH2P9
IJ |SRKL8TXXI|R |PSUZRR4LR

The code that you used is for MySQL. The critical part is the line widths.append(max(cd[2], len(cd[0]))) where cd[2] gives the length of the longest data in that column. This works for MySQLdb.
However, you are using sqlite3, for which the value cd[2] is set to None:
https://docs.python.org/2/library/sqlite3.html#sqlite3.Cursor.description
Thus, you will need to replace the following logic:
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
with your own. The rest of the code should be fine as long as widths is computed correctly.
The easiest way to get the widths variable correctly, would be to traverse through each row of the result and find out the max width of each column, then append it to widths. This is just some pseudo code:
for cd in c.description:
columns.append(cd[0]) # Get column headers
widths = [0] * len(c.description) # Initialize to number of columns.
for row in x:
for i in range(len(row)): # This assumes that row is an iterable, like list
v = row[i] # Take value of ith column
widths[i] = max(len(v), widths[i]) # Compare length of current value with value already stored
At the end of this, widths should contain the maximum length of each column.

Improving the algorithm to distinguish the different types of table

I have two tables with the following structures where in table 1, the ID is next to Name while in table 2, the ID is next to Title 1. The one similarity between the two tables are that, the first person always has the ID next to their name. They are different for the subsequent people.
Table 1:
Name&Title | ID #
----------------------
Random_Name 1|2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| 2000
Title_2_1 | -
Title_2_2 | -
... |...
Table 2:
Name&Title | ID #
----------------------
Random_Name 1| 2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| -
Title_2_1 | 2000
Title_2_2 | -
... |...
I have the code to recognize table 1 but struggle to incorporate structure 2. The table is stored as a nested list of row (each row is a list). Usually, for one person there are only 1 row of name but multiple rows of titles. The pseudo-code is this:
set count = 0
find the ID next to the first name, set it to be a recognizer
for row_i,row in enumerate(table):
compare the ID of the next row until I found: row[1] == recognizer
set count = row i
slice the table to get the first person.
The actual code is this:
header_ind = 0 # something related to the rest of the code
recognizer = data[header_ind+1][1]
count = header_ind+1
result = []
result.append(data[0]) #this append the headers
for i, row in enumerate(data[header_ind+2:]):
if i <= len(data[header_ind+4:]):
if row[1] and data[i+1+header_ind+2][1] is recognizer:
print data[i+header_ind+3]
one_person = data[count:i+header_ind+3]
result.append(one_person)
count = i+header_ind+3
else:
if i == len(data[header_ind+3:]):
last_person = data[count:i+header_ind+3]
result.append(last_person)
count = i+header_ind+3
I have been thinking about this for a while and so I just want to know whether it is possible to get an algorithm to incorporate Table 2 given that the we cannot distinguish the row name and titles.

Going to stick this here
So these are your inputs assumption is you are restricted to this...:
# Table 1
data1 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','2000'],
['Title_2_1','-'],
['Title_2_2','-']]
# TABLE 2
data2 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','-'],
['Title_2_1','2000'],
['Title_2_2','-']]
And this is your desired output:
for x in data:
print x
['Random_Name2', '2000']
['Name&Title', 'ID#']
[['Random_Name1', '2000'], ['Title_1_1', '-'], ['Title_1_2', '-']]
[['Random_Name2', '2000'], ['Title_2_1', '-'], ['Title_2_2', '-']]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a value using a value from a different row with petl? - python

Try this: # conversion can access other values from the same row table = etl.convert(table, 'Price List', lambda row: 100 * (1 + row.IRR), pass_row=True)

Related

PySpark Cum Sum of two values

Min Values in each row of selected columns in Python

Making Table without using Texttable

Python Output Length

Improving the algorithm to distinguish the different types of table

Categories

Resources