I'm looking to add a footer to my PrettyTable, totalling the data stored in the rows above. I've created a count in the script, but I'd like to add this into the table.
The code I have to create the table below is as follows (.add_row is in a loop):
outTbl = PrettyTable(["Projects", "Number"])
outTbl.add_row([eachProj, count])
...which generates a table looking like this:
+--------------------------+-----------+
| Projects | Number |
+--------------------------+-----------+
| Project A | 5 |
| Project B | 9 |
| Project C | 8 |
| Project D | 2 |
+--------------------------+-----------+
...but I'm looking for the functionality to create the above table with a summary footer at the bottom:
+--------------------------+-----------+
| Projects | Number |
+--------------------------+-----------+
| Project A | 5 |
| Project B | 9 |
| Project C | 8 |
| Project D | 2 |
+--------------------------+-----------+
| Total | 24 |
+--------------------------+-----------+
I've searched the module docs online: PrettyTable tutorial, Google prettytable - Tutorial and can't see any reference to a footer, which I find surprising given header is one. Can this be done in PrettyTable, or is there another Python module with this functionality anyone can recommend?
You can use texttable with small hack around it:
import texttable
table = texttable.Texttable()
table.add_rows([['Projects', 'Number'],
['Project A\nProject B\nProject C\nProject D', '5\n9\n8\n2'],
['Total', 24]])
print(table.draw())
Output:
+-----------+--------+
| Projects | Number |
+===========+========+
| Project A | 5 |
| Project B | 9 |
| Project C | 8 |
| Project D | 2 |
+-----------+--------+
| Total | 24 |
+-----------+--------+
There is no separate function to create footer in pretty table. However you can do little trick to create, in case you are particular to use only pretty table as follows
sum = 0
for row in outTbl:
sum = sum + int(row.get_string(fields=["Number"]).split('\n')[3].replace('|','').replace(' ',''))
outTbl.add_row(['------------','-----------'])
outTbl.add_row(['Total',sum])
print (outTbl)
or if you are looking for particular function with footers you can look at
https://stackoverflow.com/a/26937531/3249782
for different approaches you can use
I had the same problem today and used the following approach to treat the last n lines of my table as result lines that are seperated by a horizontal line (like one that is separating the header):
from prettytable import PrettyTable
t = PrettyTable(['Project', 'Numbers'])
t.add_row(['Project A', '5'])
t.add_row(['Project B', '9'])
t.add_row(['Project C', '8'])
t.add_row(['Project D', '2'])
# NOTE: t is the prettytable table object
# Get string to be printed and create list of elements separated by \n
list_of_table_lines = t.get_string().split('\n')
# Use the first line (+---+-- ...) as horizontal rule to insert later
horizontal_line = list_of_table_lines[0]
# Print the table
# Treat the last n lines as "result lines" that are seperated from the
# rest of the table by the horizontal line
result_lines = 1
print("\n".join(list_of_table_lines[:-(result_lines + 1)]))
print(horizontal_line)
print("\n".join(list_of_table_lines[-(result_lines + 1):]))
This results in the following output:
+-----------+---------+
| Project | Numbers |
+-----------+---------+
| Project A | 5 |
| Project B | 9 |
| Project C | 8 |
+-----------+---------+
| Project D | 2 |
+-----------+---------+
I know I'm late but I've created a function to automagically append a "Total" row to the table. For now NOT resolving if the column is wider than the table.
Python3.6++
Function:
def table_footer(tbl, text, dc):
res = f"{tbl._vertical_char} {text}{' ' * (tbl._widths[0] - len(text))} {tbl._vertical_char}"
for idx, item in enumerate(tbl.field_names):
if idx == 0:
continue
if not item in dc.keys():
res += f"{' ' * (tbl._widths[idx] + 1)} {tbl._vertical_char}"
else:
res += f"{' ' * (tbl._widths[idx] - len(str(dc[item])))} {dc[item]} {tbl._vertical_char}"
res += f"\n{tbl._hrule}"
return res
Usage:
tbl = PrettyTable()
tbl.field_names = ["Symbol", "Product", "Size", "Price", "Subtotal", "Allocation"]
tbl.add_row([......])
print(tbl)
print(table_footer(tbl, "Total", {'Subtotal': 50000, 'Allocation': '29 %'}
+--------+-------------------------------+-------+---------+----------+------------+
| Symbol | Product | Size | Price | Subtotal | Allocation |
+--------+-------------------------------+-------+---------+----------+------------+
| AMD | Advanced Micro Devices Inc | 999.9 | 75.99 | 20000.0 | 23.00 |
| NVDA | NVIDIA Corp | 88.8 | 570.63 | 30000.0 | 6.00 |
+--------+-------------------------------+-------+---------+----------+------------+
| Total | | | | 50000 | 29 % |
+--------+-------------------------------+-------+---------+----------+------------+
After inspecting the source code of pretty table you can see that after you print the table you can get each column width. Using this you can create a footer by yourself, because pretty table do not give you that option. Here is my approach:
from prettytable import PrettyTable
t = PrettyTable(['Project', 'Numbers'])
t.add_row(['Project A', '5'])
t.add_row(['Project B', '9'])
t.add_row(['Project C', '8'])
t.add_row(['Project D', '2'])
print(t)
total = '24'
padding_bw = (3 * (len(t.field_names)-1))
tb_width = sum(t._widths)
print('| ' + 'Total' + (' ' * (tb_width - len('Total' + total)) +
' ' * padding_bw) + total + ' |')
print('+-' + '-' * tb_width + '-' * padding_bw + '-+')
And here is the output:
+-----------+---------+
| Project | Numbers |
+-----------+---------+
| Project A | 5 |
| Project B | 9 |
| Project C | 8 |
| Project D | 2 |
+-----------+---------+
| Total 24 |
+---------------------+
Just change the total var in the code and everything should be working fine
I stole #Niels solution and did this function to print with a delimiter before the last num_footers lines:
def print_with_footer(ptable, num_footers=1):
""" Print a prettytable with an extra delimiter before the last `num` rows """
lines = ptable.get_string().split("\n")
hrule = lines[0]
lines.insert(-(num_footers + 1), hrule)
print("\n".join(lines))
Related
I have one column as an object contains multiple data separated by ( | )
I would like to extract only the customer order number which is start with
( 44 ) sometimes the order number in the beginning, sometimes in the middle, sometimes in the end
And sometimes is duplicated
44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 |
87186978110202440 | 44019119315 | 87186978110202440 | 44019119315
87186978110326832 | 44019453624 | 87186978110326832 | 44019453624
44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|
44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 |
Wishing result
44019541285
44019119315
44019453624
44019406029
44019480564
My code:
import pandas as pd
from io import StringIO
data = '''
Order_Numbers
44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032
87186978110202440 | 44019119315 | 87186978110202440 | 44019119315
87186978110326832 | 44019453624 | 87186978110326832 | 44019453624
44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|
44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498
'''
df = pd.read_csv(StringIO(data.replace(' ','')))
df
'''
Order_Numbers
0 44019541285_P_002|0317209757|87186978110350851...
1 87186978110202440|44019119315|8718697811020244...
2 87186978110326832|44019453624|8718697811032683...
3 44019406029|0317196878|87186978110313085|38718...
4 44019480564|0317202711|87186978110335810|38718...
'''
Final codeļ¼
(
df.Order_Numbers.str.split('|', expand=True)
.astype(str)
.where(lambda x: x.applymap(lambda y: y[:2] == '44'))
.bfill(axis=1)
[0]
.str.split('_').str.get(0)
)
0 44019541285
1 44019119315
2 44019453624
3 44019406029
4 44019480564
Name: 0, dtype: object
import pandas as pd
df = pd.DataFrame({
'order_number':[
'44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032',
'87186978110202440 | 44019119315 | 87186978110202440 | 44019119315',
'87186978110326832 | 44019453624 | 87186978110326832 | 44019453624',
'44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|',
'44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498'
]
})
def extract_customer_order(order_number):
order_number = order_number.replace(' ','') # remove all space to make it easy to process e.g. '44019541285_P_002 | 0317209757 ' -> '44019541285_P_002|0317209757'
order_number_list = order_number.split('|') # split the string at every | to multiple string in list '44019541285_P_002|0317209757' -> ['44019541285_P_002', '0317209757']
result = []
for order in order_number_list:
if order.startswith('44'): # select only order number starting with '44'
if order not in result: # to prevent duplicate order number
result += [order]
# if you want the result as string separated by '|', uncomment line below
# result = '|'.join(result)
return result
df['customer_order'] = df['order_number'].apply(extract_customer_order)
I want to calculate APRU for several countries.
country_list = ['us','gb','ca','id']
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
After that, I will create 4 dataframes: df_us, df_gb, df_ca, df_id that show each country's APRU.
But the size of dataset is large. The running time is extremely slow after the country list become larger. So is there a way to decrease the running time?
Consider using numba
Your code thus becomes
from numba import njit
country_list = ['us','gb','ca','id']
#njit
def count(country_list):
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
return count
Numba makes python loops a lot faster and is in the process of being integrated into the more heavy duty python libraries like scipy. Deffinetly give this a look.
IIUC, from your code and variable names, it looks like you are trying to compute average:
# toy data set:
country_list = ['us','gb']
np.random.seed(1)
datalen=10
df_day_country = pd.DataFrame({'country': np.random.choice(country_list, datalen),
'count': np.random.randint(0,100, datalen),
'revenue_sum': np.random.uniform(0,100,datalen)})
df_day_country['APRU'] = (df_day_country.groupby('country',group_keys=False)
.apply(lambda x: x['revenue_sum']/x['count'].sum())
)
Output:
+----------+--------+--------------+------------+----------+
| country | count | revenue_sum | APRU | |
+----------+--------+--------------+------------+----------+
| 0 | gb | 16 | 20.445225 | 0.150333 |
| 1 | gb | 1 | 87.811744 | 0.645675 |
| 2 | us | 76 | 2.738759 | 0.011856 |
| 3 | us | 71 | 67.046751 | 0.290246 |
| 4 | gb | 6 | 41.730480 | 0.306842 |
| 5 | gb | 25 | 55.868983 | 0.410801 |
| 6 | gb | 50 | 14.038694 | 0.103226 |
| 7 | gb | 20 | 19.810149 | 0.145663 |
| 8 | gb | 18 | 80.074457 | 0.588783 |
| 9 | us | 84 | 96.826158 | 0.419161 |
+----------+--------+--------------+------------+----------+
I need to format a data containing as list of lists in a table.
I can make a grid using tabulate:
x = [['Alice', 'min', 2],
['', 'max', 5],
['Bob', 'min', 8],
['', 'max', 15]]
header = ['Name', '', 'value']
print(tabulate.tabulate(x, headers=header, tablefmt="grid"))
+--------+-----+---------+
| Name | | value |
+========+=====+=========+
| Alice | min | 2 |
+--------+-----+---------+
| | max | 5 |
+--------+-----+---------+
| Bob | min | 8 |
+--------+-----+---------+
| | max | 15 |
+--------+-----+---------+
However, we require grouping of rows, like this:
+--------+-----+---------+
| Name | | value |
+========+=====+=========+
| Alice | min | 2 |
+ + + +
| | max | 5 |
+--------+-----+---------+
| Bob | min | 8 |
+ + + +
| | max | 15 |
+--------+-----+---------+
I tried using multiline rows (using "\n".join()), which is apparently supported in tabular 0.8.3, with no success.
This is required to run in the production server, so we can't use any heavy libraries. We are using tabulate because the whole tabulate library is a single file, and we can ship the file with the product.
You can try this:
x = [['Alice', 'min\nmax', '2\n5'],
['Bob', 'min\nmax', '8\n15'],
]
+--------+-----+------------------------+
| Name | | ['value1', 'value2'] |
+========+=====+========================+
| Alice | min | 2 |
| | max | 5 |
+--------+-----+------------------------+
| Bob | min | 8 |
| | max | 15 |
+--------+-----+------------------------+
I am trying to get the number of network edges from a normalised SQLite database which has been normalised as follows:
Authors Paper Paper_Authors
| authorID | name | etc | paperID | title | etc | paperID | authorID |
| 1 | .... | ... | 1 | ..... | ... | 1 | 1 |
| 2 | .... | ... | 2 | ..... | ... | 1 | 2 |
| 3 | .... | ... | . | ..... | ... | 1 | 3 |
| 4 | .... | ... | 60,000 | ..... | ... | 2 | 1 |
| 5 | .... | ... | 2 | 4 |
| . | .... | ... | 2 | 5 |
| 120,000 | .... | ... | . | . |
| 60,000 | 120,000 |
With somewhere in the region of 120,000 authors and 60,000 papers, and the index table has around 250,000 rows.
I am trying to get this into networkX to do some connectivity analysis, inputting the nodes is simple:
conn = sqlite3.connect('../input/database.sqlite')
c = conn.cursor()
g = nx.Graph()
c.execute('SELECT authorID FROM Authors;')
authors = c.fetchall()
g.add_nodes_from(authors)
The problem I am having arises from trying to determine the edges to feed to networkX, which requires the values in a tuple of the two nodes to connect, using the data above as an example;
[(1,1),(1,2),(1,3),(2,3),(1,4),(1,5),(4,5)]
Would describe the dataset above.
I have the following code, which works, but is inelegant:
def coauthors(pID):
c.execute('SELECT authorID \
FROM Paper_Authors \
WHERE paperID IS ?;', (pID,))
out = c.fetchall()
g.add_edges_from(itertools.product(out, out))
c.execute('SELECT COUNT() FROM Papers;')
papers = c.fetchall()
for i in range(1, papers[0][0]+1):
if i % 1000 == 0:
print('On record:', str(i))
coauthors(i)
This works by looping through each of the papers in the database, returning a list of authors and iteratively making list of author combination tuples and adding them to the network in a piecemeal way, which works, but took 30 - 45 minutes:
print(nx.info(g))
Name:
Type: Graph
Number of nodes: 120670
Number of edges: 697389
Average degree: 11.5586
So my question is, is there a more elegant way to come to the same result, ideally with the paperID as the edge label, to make it easier to navigate the the network outside of networkX.
You can get all combinations of authors for each paper with a self join:
SELECT paperID,
a1.authorID AS author1,
a2.authorID AS author2
FROM Paper_Authors AS a1
JOIN Paper_Authors AS a2 USING (paperID)
WHERE a1.authorID < a2.authorID; -- prevent duplicate edges
This will be horribly inefficient unless you have an index on paperID, or better, a covering index on both paperID and authorID, or better, a WITHOUT ROWID table.
I have a table with several million transactions. The table contains a timestamp for the transaction, an amount, and several other properties (e.g., address). For each transaction I want to calculate the count and amount sum of transactions that have happened in a timeframe, e.g., 1 month, with the same, e.g., address.
Here is an example input:
+----+---------------------+----------------+--------+
| id | ts | address | amount |
+----+---------------------+----------------+--------+
| 0 | 2016-10-11 00:34:21 | 123 First St. | 56.20 |
+----+---------------------+----------------+--------+
| 1 | 2016-10-13 02:53:58 | 456 Second St. | 96.19 |
+----+---------------------+----------------+--------+
| 2 | 2016-10-23 02:28:17 | 123 First St. | 64.65 |
+----+---------------------+----------------+--------+
| 3 | 2016-10-31 07:14:35 | 456 Second St. | 36.38 |
+----+---------------------+----------------+--------+
| 4 | 2016-11-04 09:25:39 | 123 First St. | 93.65 |
+----+---------------------+----------------+--------+
| 5 | 2016-11-20 22:30:15 | 123 First St. | 88.39 |
+----+---------------------+----------------+--------+
| 6 | 2016-11-28 09:39:14 | 123 First St. | 74.40 |
+----+---------------------+----------------+--------+
| 7 | 2016-12-03 17:09:12 | 123 First St. | 83.13 |
+----+---------------------+----------------+--------+
This should output:
+----+-------+--------+
| id | count | amount |
+----+-------+--------+
| 0 | 0 | 0.00 |
+----+-------+--------+
| 1 | 0 | 0.00 |
+----+-------+--------+
| 2 | 1 | 56.20 |
+----+-------+--------+
| 3 | 1 | 96.19 |
+----+-------+--------+
| 4 | 2 | 120.85 |
+----+-------+--------+
| 5 | 1 | 64.65 |
+----+-------+--------+
| 6 | 1 | 88.39 |
+----+-------+--------+
| 7 | 2 | 162.79 |
+----+-------+--------+
In order to do this, I sorted the table by timestamp and then I'm essentially using queues and dictionaries, but it seems to be running really slow, so I was wondering if there's a better way to do it.
Here is my code:
import csv
import Queue
import time
props = [ 'address', ... ]
spans = { '1m': 2629800, ... }
h = [ 'id' ]
for value in [ 'count', 'amount' ]:
for span in spans:
for prop in props:
h.append(span + '_' + prop + '_' + value)
tq = { }
kq = { }
vq = { }
for span in spans:
tq[span] = Queue.Queue()
kq[span] = { }
vq[span] = { }
for prop in props:
kq[span][prop] = Queue.Queue()
vq[span][prop] = { }
with open('transactions.csv', 'r') as csvin, open('velocities.csv', 'w') as csvout:
reader = csv.DictReader(csvin)
writer = csv.DictWriter(csvout, h)
writer.writeheader()
for i in reader:
o = { 'id': i['id'] }
ts = time.mktime(time.strptime(i['ts'], '%Y-%m-%d %H:%M:%S'))
for span in spans:
while not tq[span].empty() and ts > tq[span].queue[0] + spans[span]:
tq[span].get()
for prop in props:
key = kq[span][prop].get()
vq[span][prop][key].get()
if vq[span][prop][key].empty():
del vq[span][prop][key]
tq[span].put(ts)
for prop in props:
kq[span][prop].put(i[prop])
if not i[prop] in vq[span][prop]:
vq[span][prop][i[prop]] = Queue.Queue()
o[span + '_' + prop + '_count'] = vq[span][prop][i[prop]].qsize()
o[span + '_' + prop + '_amount'] = sum(vq[span][prop][i[prop]].queue)
vq[span][prop][i[prop]].put(float(i['auth']))
writer.writerow(o)
csvout.flush()
I also tried replacing vq[span][prop] with a RB-trees but the performance was even worse.
Either I fundamentally misunderstand what you're trying to do, or you do, because your code is vastly more complicated (not complex, complicated) than it needs to be if you're doing what you say you're doing.
import csv
from collections import namedtuple, defaultdict, Counter
from datetime import datetime
Span = namedtuple('Span', ('start', 'end'))
month_span = Span(start=datetime(2016, 1, 1), end=datetime(2016, 1, 31))
counts = defaultdict(Counter)
amounts = defaultdict(Counter)
with open('transactions.csv') as f:
reader = csv.DictReader(f)
for row in reader:
timestamp = datetime.strptime(row['ts'], '%Y-%m-%d %H:%M:%S')
if month_span.start < timestamp < month_span.end: # or <=
# You do some checking for properties. If you *will* always
# have these columns, you *should* just use ``row['count']``
# and ``row['amount']``
counts[month_span][row['address']] += int(row.get('count', 0))
amount[month_span][row['address']] += float(row.get('amount', 0.00))
print(counts)
print(amounts)
Note that you're still operating, as you say, over "several million transactions". That's going to take a while no matter which way you turn it because you're doing the same thing several million times. If you want to see where your current code is spending all it's time, you can profile it. I find that the line profiler is easy to use and works well.
Chances are, because you're doing what you're doing a million times, you're not going to be able to speed this up much, without dropping to a lower level language, e.g. Cython, C, or C++. That will speed some things up, but it will definitely be a lot harder to write the code.