Python read csv file and add one column through function computation - python

I have an example csv file with name 'r2.csv':
Factory | Product_Number | Date | mu | cs | co
--------------------------------------------------------------
A | 1 | 01APR2017 | 5.6 | 125 | 275
--------------------------------------------------------------
A | 1 | 02APR2017 | 4.5 | 200 | 300
--------------------------------------------------------------
A | 1 | 03APR2017 | 6.6 | 150 | 250
--------------------------------------------------------------
A | 1 | 04APR2017 | 7.5 | 175 | 325
--------------------------------------------------------------
I would like to add one more column with name 'Order_Number'. With the following function
Order_Number = np.ceil(poisson.ppf(co/(cs+co), mu))
With the following code I have:
import numpy as np
from scipy.stats import poisson, norm
import csv
# Read Data
with open('r2.csv', 'r') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
# To create a list for the following parameters
mu = data['mu']
cs = data['cs']
co = data['co']
# Obtain Order_Number
Order_Number = np.ceil(poisson.ppf(co/(cs+co), mu))
Before Obtaining 'Order_Number' it works fine. And 'Order_Number' function it has the following error:
TypeError: unsupported operand type(s) for /: 'list' and 'list'
How could I change my code in order to obtain the following table as output:
Factory | Product_Number | Date | mu | cs | co | Order_Number
----------------------------------------------------------------------
A | 1 | 01APR2017 | 5.6 | 125 | 275 | ?
----------------------------------------------------------------------
A | 1 | 02APR2017 | 4.5 | 200 | 300 | ?
----------------------------------------------------------------------
A | 1 | 03APR2017 | 6.6 | 150 | 250 | ?
----------------------------------------------------------------------
A | 1 | 04APR2017 | 7.5 | 175 | 325 | ?
----------------------------------------------------------------------

Looks like your content of mu, cs and co is list of strings.
First convert that to float.
mu = map(float,mu)
cs = map(float,cs)
co = map(float,co)
Then , since you have list of values you need to map your np.ceil(poisson.ppf(co/(cs+co), mu)) function to each value of these lists.
Order_Number =map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_),mu_)),mu,cs,co)
Result is as follows,
>>> map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_), mu_)),mu,cs,co)
[7.0, 5.0, 7.0, 8.0]
Hope this helps.
EDIT-1
Code to add data to an csv file. You may want to look at opening your csv to orderedDict so that you dont need to write each column headers manually. Yo can just call data.keys().
#Covnert string element of list to float
mu = map(float,mu)
cs = map(float,cs)
co = map(float,co)
# Obtain Order_Number
Order_Number =map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_),mu_)),mu,cs,co)
#Add Oder_Number to the data dict
data['Order_Number'] = Order_Number
header = 'Factory','Product_Number','Date','mu','cs','co','Order_Number'
#Add data to csv
with open("output.csv",'wb') as resultFile:
wr = csv.writer(resultFile,quoting=csv.QUOTE_ALL)
wr.writerow(header)
z = zip(data['Factory'],data['Product_Number'],data['Date'],data['mu'],data['cs'],data['co'],data['Order_Number'])
for i in z:
wr.writerow(i)
Result

As created
mu = data['mu']
cs = data['cs']
co = data['co']
are lists of strings. Look at them, or at least a subset, e.g mu[:10]. You can't do array math with lists
co/(cs+co)
cs+co will concatenate the 2 lists (+ definition for lists), but / is not defined for lists.
mu = np.array(data, dtype=float)
cs = ....
co
might do the trick, converting the lists into 1d numpy arrays.
An alternative is to use np.genfromtxt with dtype=None and names=True to load the data into a structured array. But then I'd have to explain how to access the named fields. And unfortunately adding a new field to this array (the calc results) isn't trivial. And writing a new csv from a structured array requires some extra knowledge.
Try the list to array conversion.

Related

Efficient Way to Build Large Scale Hierarchical Data Tree Path

I have a large dataset (think: big data) of network elements that form a tree-like network.
A toy dataset looks like this:
| id | type | parent_id |
|-----:|:-------|:------------|
| 1 | D | <NA> |
| 2 | C | 1 |
| 3 | C | 2 |
| 4 | C | 3 |
| 5 | B | 3 |
| 6 | B | 4 |
| 7 | A | 4 |
| 8 | A | 5 |
| 9 | A | 3 |
Important rules:
The root nodes (in the toy example of type D) and the leaf nodes (in the toy example of type A) cannot be connected with each other and amongst each other. I.e., a D node cannot be connected with another D node (vice-versa for A nodes) and an A node cannot directly be connected with a D node.
For simplicity reasons, any other node type can randomly be connected in terms of types.
The tree depth can be arbitrarily deep.
The leaf node is always of type A.
A leaf node does not need to be connected through all intermediate nodes. In reality there are only a handful intermediary nodes that are mandatory to pass through. This circumstance can be neglected for this example here.
If you are to recommend doing it in Spark, the solution must be written with pyspark in mind.
What I would like to achieve is to build an efficient way (preferably in Spark) to calculate the tree-path for each node, like so:
| id | type | parent_id | path |
|-----:|:-------|:------------|:--------------------|
| 1 | D | <NA> | D:1 |
| 2 | C | 1 | D:1>C:2 |
| 3 | C | 2 | D:1>C:2>C:3 |
| 4 | C | 3 | D:1>C:2>C:3>C:4 |
| 5 | B | 3 | D:1>C:2>C:3>B:5 |
| 6 | B | 4 | D:1>C:2>C:3>C:4>B:6 |
| 7 | A | 4 | D:1>C:2>C:3>C:4>A:7 |
| 8 | A | 5 | D:1>C:2>C:3>B:5>A:8 |
| 9 | A | 3 | D:1>C:2>C:3>A:9 |
Note:
Each element in the tree path is constructed like this: id:type.
If you have other efficient ways to store the tree path (e.g., closure tables) and calculate them, I am happy to hear them as well. However, the runtime for calculation must be really low (less than an hour, preferably minutes) and retrieval later needs to be in the area of few seconds.
The ultimate end goal is to have a data structure that allows me to aggregate any network node underneath a certain node efficiently (runtime of a few seconds at most).
The actual dataset consisting of around 3M nodes can be constructed like this:
Note:
The commented node_counts that produces the above shown toy examples
The distribution of the node elements is close to reality.
import random
import pandas as pd
random.seed(1337)
node_counts = {'A': 1424383, 'B': 596994, 'C': 234745, 'D': 230937, 'E': 210663, 'F': 122859, 'G': 119453, 'H': 57462, 'I': 23260, 'J': 15008, 'K': 10666, 'L': 6943, 'M': 6724, 'N': 2371, 'O': 2005, 'P': 385}
#node_counts = {'A': 3, 'B': 2, 'C': 3, 'D': 1}
elements = list()
candidates = list()
root_type = list(node_counts.keys())[-1]
leaf_type = list(node_counts.keys())[0]
root_counts = node_counts[root_type]
leaves_count = node_counts[leaf_type]
ids = [i + 1 for i in range(sum(node_counts.values()))]
idcounter = 0
for i, (name, count) in enumerate(sorted(node_counts.items(), reverse=True)):
for _ in range(count):
_id = ids[idcounter]
idcounter += 1
_type = name
if i == 0:
_parent = None
else:
# select a random one that is not a root or a leaf
if len(candidates) == 0: # first bootstrap case
candidate = random.choice(elements)
else:
candidate = random.choice(candidates)
_parent = candidate['id']
_obj = {'id': _id, 'type': _type, 'parent_id': _parent}
#print(_obj)
elements.append(_obj)
if _type != root_type and _type != leaf_type:
candidates.append(_obj)
df = pd.DataFrame.from_dict(elements).astype({'parent_id': 'Int64'})
In order to produce the tree path in pure python with the above toy data you can use the following function:
def get_hierarchy_path(df, cache_dict, ID='id', LABEL = 'type', PARENT_ID = 'parent_id', node_sep='|', elem_sep=':'):
def get_path(record):
if pd.isna(record[PARENT_ID]):
return f'{record[LABEL]}{elem_sep}{record[ID]}'
else:
if record[PARENT_ID] in cache_dict:
parent_path = cache_dict[record[PARENT_ID]]
else:
try:
parent_path = get_path(df.query(f'{ID} == {record[PARENT_ID]}').iloc[0])
except IndexError as e:
print(f'Index Miss for {record[PARENT_ID]} on record {record.to_dict()}')
parent_path = f'{record[LABEL]}{elem_sep}{record[ID]}'
cache_dict[record[PARENT_ID]] = parent_path
return f"{parent_path}{node_sep}{record[LABEL]}{elem_sep}{record[ID]}"
return df.apply(get_path, axis=1)
df['path'] = get_hierarchy_path(df, dict(), node_sep='>')
What I have already tried:
Calculating in pure python with the above function on the large dataset takes me around 5.5 hours. So this is not really a solution. Anything quicker than this is appreciated.
Technically using the spark graphframes package, I could use BFS. This would give me a good solution for individual leave nodes, but it does not scale to the entire network.
I think Pregel is the way to go here. But I do not know how to construct it in Pyspark.
Thank you for your help.
My current solution for this challenge relies now no longer on Spark but on SQL.
I load the whole dataset to a Postgres DB and place a Unique Index on id, type and parent_id.
Then using the following query, I can calculate the path:
with recursive recursive_hierarchy AS (
-- starting point
select
parent_id
, id
, type
, type || ':' || id as path
, 1 as lvl
from hierarchy.nodes
union all
-- recursion
select
ne.parent_id as parent_id
, h.id
, h.type
, ne.type || ':' || ne.id || '|' || h.path as path
, h.lvl + 1 as lvl
from (
select *
from hierarchy.nodes
) ne
inner join recursive_hierarchy h
on ne.id = h.parent_id
), paths as (
-- complete results
select
*
from recursive_hierarchy
), max_lvl as (
-- retrieve the longest path of a network element
select
id
, max(lvl) as max_lvl
from paths
group by id
)
-- all results with only the longest path of a network element
select distinct
, p.id
, p.type
, p.path
from paths p
inner join max_lvl l
on p.id = l.id
and p.lvl = l.max_lvl

How should I solve logic error in timestamp using Python?

I have written a code to calculate a, b, and c. They were initialized at 0.
This is my input file
-------------------------------------------------------------
| Line | Time | Command | Data |
-------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | WRITING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | WRITING | |
and the timestamps and commands are used to process the calculation for a, b, and c.
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
#print(m[2])
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
#print(m[3], m[2])
else:
timestamps[m[3]] = [m[2]]
#print(m[3], m[2])
a = b = c = 0
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
if timestamps["ACTIVE"] > timestamps["PRECHARGE"]: #line causing logic error
a = a + 1
print(a)
Before getting into the calculation, I assign the timestamps with respect to the commands. This is the output for this section.
ACTIVE : 3, ['0015', '0150', '0318']
WRITING : 4, ['0030', '0200', '0447', '0503']
WRITING_A : 3, ['0100', '0345', '0430']
PRECHARGE : 2, ['0115', '0314']
REFRESH : 1, ['0120']
To get a, the timestamps of ACTIVE must be greater than PRECHARGE and WRITING must be greater than ACTIVE. (Line 4, 6, 7 will contribute to the first a and Line 8, 9, and 12 contributes to the second a)
To get b, the timestamps of WRITING must be greater than ACTIVE. For the lines that contribute to a such as Line 4, 6, 7, 8, 9, and 12, they cannot be used to calculate b. So, Line 1 and 2 contribute to b.
To get c, the rest of the unused lines containing WRITING will contribute to c.
The expected output:
a = 2
b = 1
c = 1
However, in my code, when I print a, it displays 0, which shows the logic has some error. Any suggestion to amend my code to achieve the goal? I have tried for a few days and the problem is not solved yet.
I made a function that will return the commands in order that match a pattern with gaps allowed.
I also made a more compact version of your file reading.
There is probably a better version to divide the list into two parts, the problem was to only allow elements in that match the whole pattern. In this one I iterate over the elements twice.
import re
commands = list()
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
_, line, time, command, data, _ = m
commands.append((line,time,command))
def search_pattern(pattern, iterable, key=None):
iter = 0
count = 0
length = len(pattern)
results = []
sentinel = object()
for elem in iterable:
original_elem = elem
if key is not None:
elem = key(elem)
if elem == pattern[iter]:
iter += 1
results.append((original_elem,sentinel))
if iter >= length:
iter = iter % length
count += length
else:
results.append((sentinel,original_elem))
matching = []
nonmatching = []
for res in results:
first,second = res
if count > 0:
if second is sentinel:
matching.append(first)
count -= 1
elif first is sentinel:
nonmatching.append(second)
else:
value = first if second is sentinel else second
nonmatching.append(value)
return matching, nonmatching
pattern_a = ['PRECHARGE','ACTIVE','WRITING']
pattern_b = ['ACTIVE','WRITING']
pattern_c = ['WRITING']
matching, nonmatching = search_pattern(pattern_a, commands, key=lambda t: t[2])
a = len(matching)//len(pattern_a)
matching, nonmatching = search_pattern(pattern_b, nonmatching, key=lambda t: t[2])
b = len(matching)//len(pattern_b)
matching, nonmatching = search_pattern(pattern_c, nonmatching, key=lambda t: t[2])
c = len(matching)//len(pattern_c)
print(f'{a=}')
print(f'{b=}')
print(f'{c=}')
Output:
a=2
b=1
c=1

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

How to change just one column in a txt file leaving all else the same and respecting the whitespaces?

I have a file with a lot of rows. To safe space, I am just c/p in a sense that gives the shape of my file.
| Martini system from 2b97.pdb |
| 55601 |
| 1ALA BB 1 13.904 5.512 1.259 |
| 12VAL BB 12 4.199 35.292 21.353 |
| 112VAL SCC 113 4.367 5.234 21.445 |
| 1113CYS BB 1114 4.041 4.969 21.220 |
| 11113CYS SCC11115 4.088 14.816 21.041 |
| 19293DEC C55598 19.018 0.828 7.094 |
| 9.05570 9.05570 30.02670 |
I need to add 0.1 units to the last column.
Therefore, my output file should look exactly like this:
| Martini system from 2b97.pdb |
| 55601 |
| 1ALA BB 1 13.904 5.512 1.359 |
| 12VAL BB 12 4.199 35.292 21.453 |
| 112VAL SCC 113 4.367 5.234 21.545 |
| 1113CYS BB 1114 4.041 4.969 21.320 |
| 11113CYS SCC11115 4.088 14.816 21.141 |
| 19293DEC C55598 19.018 0.828 7.194 |
| 9.05570 9.05570 30.02670 |
The most important thing is that my output file should have exactly the same whitespaces, format and dtype. Everything in this file is a string.
If whitespaces, format and dtype are not respected then I cannot use the output file to run in the program I need.
Just in case, I do not need to keep the initial file (although I think this part is irrelevant for me to add).
Thanks for your help.
I have tried but my problem is that I cannot keep the same shape with python.
Like another answer, I would use string slicing to get just the final column, and string concatenation to put the line back together again. However, I would use decimal.Decimal for the fixed-point math:
import fileinput
import decimal
import sys
files = ['x.txt']
for line in fileinput.input(files, inplace=True):
number = line[38:46]
try:
number = decimal.Decimal(number)
number += decimal.Decimal('.1')
number = '{:8}'.format(number)
line = line[:38] + number + line[46:]
except decimal.InvalidOperation:
pass
sys.stdout.write(line)
Similar to other answers, but here's another option. String reversing done to do the string replace right->left.
with open('in.txt', 'r') as fi, open('out.txt', 'w') as fo:
fo.write(fi.readline())
fo.write(fi.readline())
for line in fi.readlines():
try:
old = line[-10:-3]
new = '{:>7.3f}'.format(float(old) + 0.1)
line = line[::-1].replace(old[::-1], new[::-1], 1)[::-1]
except ValueError as _:
pass
finally:
fo.write(line)
I also used decimal math and string slicing. Here's my version:
from decimal import Decimal, InvalidOperation
def add_zero_point_one(data):
new_data = []
for l in data.split('\n'):
try:
d = Decimal(l[38:46]) + Decimal('0.1')
l = l[:38] + str(d).rjust(8) + l[46:]
except InvalidOperation:
pass
new_data.append(l)
return '\n'.join(new_data)
This worked with the provided sample, but assumes:
That all data out in slice 38:46 is column data you want to increment
That the column widths are fixed
Here's my full working sample:
from decimal import Decimal, InvalidOperation
data = '''| Martini system from 2b97.pdb |
| 55601 |
| 1ALA BB 1 13.904 5.512 1.259 |
| 12VAL BB 12 4.199 35.292 21.353 |
| 112VAL SCC 113 4.367 5.234 21.445 |
| 1113CYS BB 1114 4.041 4.969 21.220 |
| 11113CYS SCC11115 4.088 14.816 21.041 |
| 19293DEC C55598 19.018 0.828 7.094 |
| 9.05570 9.05570 30.02670 |'''
def add_zero_point_one(data):
new_data = []
for l in data.split('\n'):
try:
d = Decimal(l[38:46]) + Decimal('0.1')
l = l[:38] + str(d).rjust(8) + l[46:]
except InvalidOperation:
pass
new_data.append(l)
return '\n'.join(new_data)
print(data)
print(add_zero_point_one(data))

How do I put data from a while loop into a table?

Basically I'm estimating pi using polygons. I have a loop which gives me a value for n, ann and bnn before running the loop again. here is what I have so far:
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
print(t,an,bn)
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
print(t,ann,bnn)
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
return
This is what I get if I run it with these values:
>>> printPiTable(4,2*sqrt(2),4,5)
4 4 2.8284271247461903
8 3.3137084989847607 3.0614674589207187
16 3.1825978780745285 3.121445152258053
32 3.1517249074292564 3.1365484905459398
64 3.1441183852459047 3.1403311569547534
128 3.1422236299424577 3.1412772509327733
I want to find a way to make it instead of printing out these values, just print the values in a nice neat table, any help?
Use string formatting. For example,
print('{:<4}{:>20f}{:>20f}'.format(t,ann,bnn))
produces
4 4.000000 2.828427
8 3.313708 3.061467
16 3.182598 3.121445
32 3.151725 3.136548
64 3.144118 3.140331
128 3.142224 3.141277
{:<4} is replaced by t, left-justified, formatted to a string of length 4.
{:>20f} is replaced by ann, right-justified, formatted as a float to a string of length 20.
The full story on the format string syntax is explained here.
To add column headers, just add a print statement like
print('{:<4}{:>20}{:>20}'.format('t','a','b'))
For fancier ascii tables, consider using a package like prettytable:
import prettytable
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
table = prettytable.PrettyTable(['t', 'a', 'b'])
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
table.add_row((t,an,bn))
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
table.add_row((t,ann,bnn))
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
print(table)
printPiTable(4,2*sqrt(2),4,5)
yields
+-----+---------------+---------------+
| t | a | b |
+-----+---------------+---------------+
| 4 | 4 | 2.82842712475 |
| 8 | 3.31370849898 | 3.06146745892 |
| 16 | 3.18259787807 | 3.12144515226 |
| 32 | 3.15172490743 | 3.13654849055 |
| 64 | 3.14411838525 | 3.14033115695 |
| 128 | 3.14222362994 | 3.14127725093 |
+-----+---------------+---------------+
Perhaps it is overkill for this sole purpose, but Pandas can make nice tables too, and can export them in other formats, such as HTML.
You can use output formatting to make it look pretty. Look here for an example:
http://docs.python.org/release/1.4/tut/node45.html

Categories