python datatable read expression from csv - python

I am using python's datatable.
I have 2 csv files.
CSV 1
A,B
1,2
3,4
5,6
CSV 2
NAME,EXPR
A_GREATER_THAN_B, A>B
A_GREATER_THAN_10, A>10
B_GREATER_THAN_5, B>5
Expected Output
A,B,A_GREATER_THAN_B,A_GREATER_THAN_10,B_GREATER_THAN_5
1,2,0,0,0
3,4,0,0,0
5,6,0,0,1
Code
exprdt = dt.fread("csv_2.csv")
exprdict = dict(exprdt.to_tuples())
dt1[:, dt.update(**exprdict)]
print(dt1)
Current output
| A B C A_G_B A_G_1 B_G_4
| int32 int32 int32 str32 str32 str32
-- + ----- ----- ----- ------------- -------- --------
0 | 0 1 1 dt.f.A>dt.f.B dt.f.A>1 dt.f.B>4
1 | 1 5 6 dt.f.A>dt.f.B dt.f.A>1 dt.f.B>4
I am trying to use the extend function to process first datatable using the expressions from second datatable. When i use fread to read the csv files, the expression is processed as string and not as expression.
How do i use the 2nd datatable (csv) to update the first datatable using the NAME and EXPR columns?

You can do what you want, but just that you can do something doesn't mean it's a good idea. Any solution that requires eval() is probably more complicated than it needs to be, and introduces great risks if you don't have complete control over the data going in.
Having said that, this script shows a naive approach without fancy expressions from a table, and the approach you suggest - which I strongly recommend you not use and figure out a better way to achieve what you need:
from io import StringIO
import re
import datatable as dt
csv1 = """A,B
1,2
3,4
5,6"""
csv2 = """NAME,EXPR
A_GREATER_THAN_B, A>B
A_GREATER_THAN_10, A>10
B_GREATER_THAN_5, B>5"""
def naive():
# naive approach
d = dt.fread(StringIO(csv1))
d['A_GREATER_THAN_B'] = d[:, dt.f.A > dt.f.B]
d['A_GREATER_THAN_10'] = d[:, dt.f.A > 10]
d['B_GREATER_THAN_5'] = d[:, dt.f.B > 5]
print(d)
def update_with_expressions(d, expressions):
for n in range(expressions.nrows):
col = expressions[n, :][0, 'NAME']
expr = re.sub('([A-Za-z]+)', r'dt.f.\1', expressions[n, :][0, 'EXPR'])
# here's hoping that expression is trustworthy...
d[col] = d[:, eval(expr)]
def fancy():
# fancy, risky approach
d = dt.fread(StringIO(csv1))
update_with_expressions(d, dt.fread(StringIO(csv2)))
print(d)
if __name__ == '__main__':
naive()
fancy()
Result (showing you get the same result from either approach):
| A B A_GREATER_THAN_B A_GREATER_THAN_10 B_GREATER_THAN_5
| int32 int32 bool8 bool8 bool8
-- + ----- ----- ---------------- ----------------- ----------------
0 | 1 2 0 0 0
1 | 3 4 0 0 0
2 | 5 6 0 0 1
[3 rows x 5 columns]
| A B A_GREATER_THAN_B A_GREATER_THAN_10 B_GREATER_THAN_5
| int32 int32 bool8 bool8 bool8
-- + ----- ----- ---------------- ----------------- ----------------
0 | 1 2 0 0 0
1 | 3 4 0 0 0
2 | 5 6 0 0 1
[3 rows x 5 columns]
Note: if someone knows of a nicer way to iterate over rows in a datatable.Frame, please leave a comment, because I'm not a fan of this part:
for n in range(expressions.nrows):
col = expressions[n, :][0, 'NAME']
Note that StringIO is only imported to have the .csv files in the code, you wouldn't need them.

Related

How to vectorize and speed-up double for-loop for pandas dataframe when doing text similarity scoring

I have the following dataframe:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog and Cat. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. And so on.
I created a code for the logic above:
# pip install thefuzz
from thefuzz import fuzz
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test['id'] = 0
i = 1
for index, row in df_test.iterrows():
for index_, row_ in df_test.iterrows():
if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
if fuzz.ratio(row['name'], row_['name']) > 50:
df_test.loc[index_,'id'] = int(i)
is_i_used = True
if is_i_used == True:
i += 1
is_i_used = False
Code generates expected result:
name cluster_number id
0 South Beach 1 1
1 Dog 2 2
2 Bird 3 3
3 Ant 3 4
4 Big Dog 2 2
5 Beach 1 1
6 Dear 4 5
7 Cat 2 6
Note, for Cat we got id as 6 but it is fine because it is unique anyway.
While algorithm above works for test data I am not able to use it for real data that I have (about 1 million rows) and I am trying to understand how to vectorize the code and get rid of two for-loops.
Also thefuzz module has process function and it allows to process data at once:
from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))
But I don't see if it can help with speeding up the code.
tl;dr: Avoid O(N^2) running time if N is big.
help with speeding up the code.
People get down on .iterrows(), calling it "slow".
Switching from .iterrows to a vectorized approach
might "speed things up" somewhat, but that's a relative measure.
Let's talk about complexity.
time complexity
Your current algorithm is quadratic;
it features a pair of nested .iterrows loops.
But then immediately we filter on
if different_cluster and not_yet_assigned:
Now, that could be workable for "small" N.
But an N of 400K quickly becomes infeasible:
>>> 419_776 ** 2 / 1e9
176.211890176
One hundred seventy-six billion iterations (with a "B")
is nothing to sneeze your nose at,
even if each filter step has trivial (yet non-zero) cost.
At the risk of reciting facts that have tediously been
repeated many times before,
sorting costs O(N log N), and
N log N is very signicantly less than N^2
I'm not convinced that what you want is to "go fast".
Rather, I suspect what you really want is to "do less".
Start by ordering your rows, and then make
a roughly linear pass over that dataset.
You didn't specify your typical cluster group size G.
But since there's many distinct cluster numbers,
we definitely know that G << N.
We can bring complexity down from O(N^2) to O(N × G^2).
df = df_test.sort_values(['cluster_number', 'name'])
You wrote
for index, row in df_test.iterrows():
for index_, row_ in df_test.iterrows():
Turn that into
for index, row in df.iterrows():
while ...
and use .iloc() to examine relevant rows.
The while loop gets to terminate as soon
as a new cluster number is seen, instead
of every time having to slog through hundreds of thousands
of rows until end-of-dataframe is seen.
Why can it exit early?
Due to the sort order.
A more convenient way to structure this might be
to write a clustering helper.
def get_clusters(df):
cur_num = -1
cluster = []
for _, row in df.iterrows():
if row.cluster_number != cur_num and cluster:
yield cluster
cluster = []
cur_num = row.cluster_number
cluster.append(row)
Now your top level code can iterate through a bunch
of clusters, performing a fuzzy match of cost O(G^2)
on each cluster.
The invariant on each generated cluster
is that all rows within cluster
shall have identical cluster_number.
And, due to the sorting,
we guarantee that a given cluster_number
shall be generated at most once.
https://stackoverflow.com/help/self-answer
Please measure current running time,
implement these suggests,
measure again,
and post code + timings.
Attempt #1
Based on #J_H suggestions I made some changes in the original code:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
df_test['id'] = 0
i = 1
for index, row in df_test.iterrows():
row_ = row
index_ = index
while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
df_test.loc[index_,'id'] = i
is_i_used = True
index_ += 1
if is_i_used == True:
i += 1
is_i_used = False
Now instead of hours of computations it runs only 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows.
While it is significant improvement I still looking for vectorized option.
Attempt #2
I created vectorized version:
from rapidfuzz import process, fuzz
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")
but it is not possible to use for dataframe with 1 millions rows because cdist returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float. So for 1 million names, the result matrix would require 3.6 terrabytes of memory.
Following on your own answer, you don't need to compute process.cdist on all the names, you are interested only on those on the same cluster.
To do so, you can iterate over groups:
threshold = 50
index_start = 0
groups = []
for grp_name, grp_df in df_test.groupby("cluster_number"):
names = grp_df["name"]
scores = pd.DataFrame(
data = (process.cdist(names, names, workers=-1)),
columns = names,
index = names
)
x, y = np.where(scores > threshold)
grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.assign(restaurant_id = lambda t: t.index + index_start)
.explode("name")
)
index_start = grps_in_group["restaurant_id"].max()+1
groups.append(grps_in_group)
df_test.merge(pd.concat(groups), on="name")
| | name | cluster_number | id | restaurant_id |
|---:|:------------|-----------------:|-----:|----------------:|
| 0 | Beach | 1 | 0 | 0 |
| 1 | South Beach | 1 | 0 | 0 |
| 2 | Big Dog | 2 | 0 | 1 |
| 3 | Cat | 2 | 0 | 2 |
| 4 | Dog | 2 | 0 | 1 |
| 5 | Dry Fish | 2 | 0 | 3 |
| 6 | Fish | 2 | 0 | 3 |
| 7 | Ant | 3 | 0 | 4 |
| 8 | Bird | 3 | 0 | 5 |
| 9 | Dear | 4 | 0 | 6 |
Yet I am not sure this is an improvement.
Now, transforming the loop functionality in a function we can use .groupby(...).apply(...), however we lose track of the consecutive index. To address that I am using a trick with pandas categorical type:
def create_restaurant_id(
dframe: pd.DataFrame,
threshold: int = 50
) -> pd.DataFrame:
names = dframe["name"]
scores = pd.DataFrame(
data = (process.cdist(names, names, workers=-1)),
columns = names,
index = names
)
x, y = np.where(scores > threshold)
grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.assign(restaurant_id = lambda t: t.index)
.explode("name")
)
return grps_in_group
(df_test
.groupby("cluster_number")
.apply(create_restaurant_id)
.reset_index(level=0)
.assign(restaurant_id = lambda t: (
t["cluster_number"].astype(str) + t["restaurant_id"].astype(str)
).astype("category").cat.codes
)
)
In term of performance on my laptop, with such small dataframe, the two are almost identical.
I think you are thinking very analytically. Try this :
What I'm doing here is giving a non-repeating ID number (Details below).
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
# Does the word occur more than once ? (int)
repeat = 0
for i in range(df_test.shape[0]):
heywtu = df_test[df_test['name'].str.contains(*df_test['name'][i].split())].index.values
# 0 enters the special case, so we took it as 1 directly.
if i == 0:
df_test.loc[i,'id'] = i+1
else :
# Does the word occur more than once?
repeat += len(heywtu) == 2
# Fill all id column with specific id number
df_test.loc[i,'id'] = i - repeat
# Editing the id of people with the same name other than 0
if (len(heywtu) == 2) & (i!=0):
df_test.loc[i,'id'] = heywtu[0]
continue
# Special case, If there only 2 values:
if (len(df_test['name'])==2):
df_test.loc[1,'id'] =2
# For first d_test values
print(df_test.head(10))
>>> name cluster_number id
>>> 0 Beach 1 1.0
>>> 1 South Beach 1 1.0
>>> 2 Big Dog 2 2.0
>>> 3 Cat 2 3.0
>>> 4 Dog 2 2.0
>>> 5 Ant 3 4.0
>>> 6 Bird 3 5.0
>>> 7 Dear 4 6.0
# For last d_test values
print(df_test.head(10))
>>> name cluster_number id
>>> 0 Beach 1 1.0
>>> 1 South Beach 1 1.0
>>> 2 Big Dog 2 2.0
>>> 3 Cat 2 3.0
>>> 4 Dog 2 2.0
>>> 5 Dry Fish 2 4.0
>>> 6 Fish 2 4.0
>>> 7 Ant 3 5.0
>>> 8 Bird 3 6.0
>>> 9 Dear 4 7.0
# If there only 2 values
df_test.head()
>>> name cluster_number id
>>> 0 Big Dog 1 1.0
>>> 1 South Beach 2 2.0
What is repeat? Well, if other strings contains Dog word its gonna be counted, like Dog and Big Dog , and substrack numbers with index number. I hope its gonna helpful for your problem.

Pandas: how to incrementally add one to column while sum is less than corresponding column?

I am trying to increment a column by 1 while the sum of that column is less than or equal to a total supply figure. I also need that column to be less than the corresponding value in the 'allocation' column. The supply variable will be dynamic from 1-400 based on user input. Below is the desired output (Allocation Final column).
supply = 14
| rank | allocation | Allocation Final |
| ---- | ---------- | ---------------- |
| 1 | 12 | 9 |
| 2 | 3 | 3 |
| 3 | 1 | 1 |
| 4 | 1 | 1 |
Below is the code I have so far:
data = [[1.05493,12],[.94248,3],[.82317,1],[.75317,1]]
df = pd.DataFrame(data,columns=['score','allocation'])
df['rank'] = df['score'].rank()
df['allocation_new'] = 0
#static for testing
supply = 14
for index in df.index:
while df.loc[index, 'allocation_new'] < df.loc[index, 'allocation'] and df.loc[index, 'allocation_new'].sum() < supply:
df.loc[index, 'allocation_new'] += 1
print(df)
This should do:
def allocate(df, supply):
if supply > df['allocation'].sum():
raise ValueError(f'Unacheivable supply {supply}, maximal {df["allocation"].sum()}')
under_alloc = pd.Series(True, index=df.index)
df['allocation final'] = 0
while (missing := supply - df['allocation final'].sum()) >= 0:
assert under_alloc.any()
if missing <= under_alloc.sum():
df.loc[df.index[under_alloc][:missing], 'allocation final'] += 1
return df
df.loc[under_alloc, 'allocation final'] = (
df.loc[under_alloc, 'allocation final'] + missing // under_alloc.sum()
).clip(upper=df.loc[under_alloc, 'allocation'])
under_alloc = df['allocation final'] < df['allocation']
return df
At every iteration, we add the missing quotas to any rows that did not reach the allocation yet (rounded down, that’s missing // under_alloc.sum()), then using pd.Series.clip() to ensure we stay below the allocation.
If there’s less missing quotas than available ranks to which to allocate (e.g. run the same dataframe with supply=5 or 6), we allocate to the first missing ranks.
>>> df = pd.DataFrame( {'allocation': {0: 12, 1: 3, 2: 1, 3: 1}, 'rank': {0: 1, 1: 2, 2: 3, 3: 4}})
>>> print(allocate(df, 14))
allocation rank allocation final
0 12 1 9
1 3 2 3
2 1 3 1
3 1 4 1
>>> print(allocate(df, 5))
allocation rank allocation final
0 12 1 2
1 3 2 1
2 1 3 1
3 1 4 1
Here is a simpler version:
def allocate(series, supply):
allocated = 0
values = [0]*len(series)
while True:
for i in range(len(series)):
if allocated >= supply:
return values
if values[i] < series.iloc[i]:
values[i]+=1
allocated+=1
pass
allocate(df['allocation'], 14)
output:
[9,3,1,1]

Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:
>> print a
id | value
1 | 250
2 | 150
3 | 350
4 | 550
5 | 450
>> print b
low | high | class
100 | 200 | 'A'
200 | 300 | 'B'
300 | 500 | 'A'
500 | 600 | 'C'
I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:
>> print a
id | value | class
1 | 250 | 'B'
2 | 150 | 'A'
3 | 350 | 'A'
4 | 550 | 'C'
5 | 450 | 'A'
I have the following code written that sort of does what I want:
a['class'] = pd.Series()
for i in range(len(a)):
val = a['value'][i]
cl = (b['class'][ (b['low'] <= val) \
(b['high'] >= val) ].iat[0])
a['class'].set_value(i,cl)
Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?
Here is a way to do a range join inspired by #piRSquared's solution:
A = a['value'].values
bh = b.high.values
bl = b.low.values
i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))
pd.DataFrame(
np.column_stack([a.values[i], b.values[j]]),
columns=a.columns.append(b.columns)
)
Output:
id value low high class
0 1 250 200 300 'B'
1 2 150 100 200 'A'
2 3 350 300 500 'A'
3 4 550 500 600 'C'
4 5 450 300 500 'A'
Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!
I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.
The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!
# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])
blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])
aclass = [[]]*len(avalue) # initialize aclass
start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
aclass[j]=bclass[i]
# add the class column to the original a DataFrame
a['class'] = aclass
print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

Fix columns indentation with python

There is a file format called .xyz that helps visualizing molecular bonds. Basically the format asks for a specific pattern:
At the first line there must be the number of atoms, which in my case is 30.
After that there should be the data where the first line is the name of the atom, in my case they are all carbon. The second line is the x information and the third line is the y information and the last line is the z information which are all 0 in my case. Indentation should be correct so that all of the corresponding lines should start at the same place. So something like this:
30
C x1 y1 z1
C x2 y2 z2
...
...
...
and not:
30
C x1 y1 z1
C x2 y2 z2
since this is the wrong indentation.
My generated data is stored like this in a .txt file:
C 2.99996 7.31001e-05 0
C 2.93478 0.623697 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.0079 2.22961 0
C 1.50006 2.59812 0
C 0.927076 2.8532 0
C 0.313848 2.98349 0
C -0.313623 2.9837 0
C -0.927229 2.85319 0
C -1.5003 2.5981 0
C -2.00732 2.22951 0
C -2.42686 1.76331 0
C -2.74119 1.22029 0
C -2.93437 0.623802 0
C -2.99992 -5.5509e-05 0
C -2.93416 -0.623574 0
C -2.7409 -1.22022 0
C -2.42726 -1.7634 0
C -2.00723 -2.22941 0
C -1.49985 -2.59809 0
C -0.92683 -2.85314 0
C -0.313899 -2.98358 0
C 0.31363 -2.98356 0
C 0.927096 -2.85308 0
C 1.50005 -2.59792 0
C 2.00734 -2.22953 0
C 2.4273 -1.76339 0
C 2.74031 -1.22035 0
C 2.93441 -0.623647 0
I want to correct the indentation of this by making all of the lines start from the same point. I tried to do this with AWK to no avail. So I turned to Python. So far I have this:
#!/usr/bin/env/python
text_file = open("output.txt","r")
lines = text_file.readlines()
myfile = open("output.xyz","w")
for line in lines:
atom, x, y, z = line.split()
x, y, z = map(float(x,y,z))
myfile.write("{}\t {}\t {}\t {}\t".format(atom,x,y,z))
myfile.close()
text_file.close()
but I don't know currently as to how indentation can be added into this.
tl;dr: I have a data file in .txt, I want to change it into .xyz that's been specified but I am running into problems with indentation.
It appears that I misinterpreted your requirement...
To achieve a fixed width output using awk, you could use printf with a format string like this:
$ awk '{printf "%-4s%12.6f%12.6f%5d\n", $1, $2, $3, $4}' data.txt
C 2.999960 0.000073 0
C 2.934780 0.623697 0
C 2.740920 1.220110 0
C 2.427020 1.763430 0
C 2.007900 2.229610 0
C 1.500060 2.598120 0
C 0.927076 2.853200 0
C 0.313848 2.983490 0
C -0.313623 2.983700 0
# etc.
Numbers after the % specify the width of the field. A negative number means that the output should be left aligned (as in the first column). I have specified 6 decimal places for the floating point numbers.
Original answer, in case it is useful:
To ensure that there is a tab character between each of the columns of your input, you could use this awk script:
awk '{$1=$1}1' OFS="\t" data.txt > output.xyz
$1=$1 just forces awk to touch each line, which makes sure that the new Output Field Separator (OFS) is applied.
awk scripts are built up from a series of condition { action }. If no condition is supplied, the action is performed for every line. If a condition but no action is supplied, the default action is to print the line. 1 is a condition that always evaluates to true, so awk prints the line.
Note that even though the columns are all tab-separated, they are still not lined up because the content of each column is of a variable length.
Your data has already been ill formatted and converted to string. To correctly allign the numeric and non-numeric data, you need to parse the individual fields to respective data types (possibly using duck-typing) before formating using str.format
for line in st.splitlines():
def convert(st):
try:
return int(st)
except ValueError:
pass
try:
return float(st)
except ValueError:
pass
return st
print "{:8}{:12.5f}{:12.5f}{:5d}".format(*map(convert,line.split()))
C 2.99996 0.00007 0
C 2.93478 0.62370 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.00790 2.22961 0
C 1.50006 2.59812 0
C 0.92708 2.85320 0
C 0.31385 2.98349 0
C -0.31362 2.98370 0
C -0.92723 2.85319 0
Using this: awk '{printf "%s\t%10f\t%10f\t%i\n",$1,$2,$3,$4}' atoms
give this output:
C 2.999960 0.000073 0
C 2.934780 0.623697 0
C 2.740920 1.220110 0
C 2.427020 1.763430 0
C 2.007900 2.229610 0
C 1.500060 2.598120 0
C 0.927076 2.853200 0
C 0.313848 2.983490 0
C -0.313623 2.983700 0
C -0.927229 2.853190 0
C -1.500300 2.598100 0
C -2.007320 2.229510 0
C -2.426860 1.763310 0
C -2.741190 1.220290 0
C -2.934370 0.623802 0
C -2.999920 -0.000056 0
C -2.934160 -0.623574 0
C -2.740900 -1.220220 0
C -2.427260 -1.763400 0
C -2.007230 -2.229410 0
C -1.499850 -2.598090 0
C -0.926830 -2.853140 0
C -0.313899 -2.983580 0
C 0.313630 -2.983560 0
C 0.927096 -2.853080 0
C 1.500050 -2.597920 0
C 2.007340 -2.229530 0
C 2.427300 -1.763390 0
C 2.740310 -1.220350 0
C 2.934410 -0.623647 0
Is it what you're meaning or did I misunderstood ?
Edit for side note: I used tabs \t for separation, a space could do too and I limited the output to a precision of 10, I didn't verify your input lenght
You can use string formatting to print values with consistent padding. For your case, you might write lines like this to the file:
>>> '%-12s %-12s %-12s %-12s\n' % ('C', '2.99996', '7.31001e-05', '0')
'C 2.99996 7.31001e-05 0 '
"%-12s" means "take the str() of the value and make it take up at least 12 characters left-justified.

How do I put data from a while loop into a table?

Basically I'm estimating pi using polygons. I have a loop which gives me a value for n, ann and bnn before running the loop again. here is what I have so far:
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
print(t,an,bn)
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
print(t,ann,bnn)
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
return
This is what I get if I run it with these values:
>>> printPiTable(4,2*sqrt(2),4,5)
4 4 2.8284271247461903
8 3.3137084989847607 3.0614674589207187
16 3.1825978780745285 3.121445152258053
32 3.1517249074292564 3.1365484905459398
64 3.1441183852459047 3.1403311569547534
128 3.1422236299424577 3.1412772509327733
I want to find a way to make it instead of printing out these values, just print the values in a nice neat table, any help?
Use string formatting. For example,
print('{:<4}{:>20f}{:>20f}'.format(t,ann,bnn))
produces
4 4.000000 2.828427
8 3.313708 3.061467
16 3.182598 3.121445
32 3.151725 3.136548
64 3.144118 3.140331
128 3.142224 3.141277
{:<4} is replaced by t, left-justified, formatted to a string of length 4.
{:>20f} is replaced by ann, right-justified, formatted as a float to a string of length 20.
The full story on the format string syntax is explained here.
To add column headers, just add a print statement like
print('{:<4}{:>20}{:>20}'.format('t','a','b'))
For fancier ascii tables, consider using a package like prettytable:
import prettytable
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
table = prettytable.PrettyTable(['t', 'a', 'b'])
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
table.add_row((t,an,bn))
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
table.add_row((t,ann,bnn))
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
print(table)
printPiTable(4,2*sqrt(2),4,5)
yields
+-----+---------------+---------------+
| t | a | b |
+-----+---------------+---------------+
| 4 | 4 | 2.82842712475 |
| 8 | 3.31370849898 | 3.06146745892 |
| 16 | 3.18259787807 | 3.12144515226 |
| 32 | 3.15172490743 | 3.13654849055 |
| 64 | 3.14411838525 | 3.14033115695 |
| 128 | 3.14222362994 | 3.14127725093 |
+-----+---------------+---------------+
Perhaps it is overkill for this sole purpose, but Pandas can make nice tables too, and can export them in other formats, such as HTML.
You can use output formatting to make it look pretty. Look here for an example:
http://docs.python.org/release/1.4/tut/node45.html

Categories