Substitute function for pandas read_table not working - python

I am new to Python!
Due to some reasons, I can not use pandas in my environment. So I am writing the pandas read_table() by myself. Basically I am converting one code which is using pandas.read_table by myself. The code using pandas which has to be replaced is as follows :
import pandas as pd
import numpy as np
import scipy as sp
data_file = pd.read_table(r'records.csv', sep = ';', header=None)
id = np.unique(data_file[0])
tags = np.unique(data_file[1])
number_of_rows = len(id)
number_of_columns = len(tags)
words_indices, letter_indices = {}, {}
for i in range(len(tags)):
words_indices[tags[i]] = i
for i in range(len(id)):
letter_indices[id[i]] = i
#scipy sparse matrix
Vector = sp.lil_matrix((number_of_rows, number_of_columns))
#adds data into the sparse matrix
for line in data_file.values:
u, i , r = map(str,line)
Vector[letter_indices[u], words_indices[i]] = r
The csv file is having some 100 records of this format :
REC000034232657,CRC FIX OE Resubmit,0.0073410, 45
Now, I have replaced the pandas.read_table as follows by directly reading it from database rather than from .csv file :
def fetch_table(**kwargs):
qry = kwargs['qrystr']
try:
cursor = conn.cursor()
cursor.execute(qry)
all_tuples = cursor.fetchall()
return all_tuples
except pyodbc.ProgrammingError as e:
print ("Exception occured as :", type(e) , e)
# pandas alternate code
total_col = 0
count = 0
dict_csv = {}
stmt = "select * from tickets;"
fetched_rows = fetch_table(qrystr = stmt)
for row in fetched_rows:
total_col = len(row)
break
for i in range(0,total_col):
dict_csv[i] = []
for row in fetched_rows:
for i in range(0,total_col):
dict_csv[i].append(row[i])
# End of pandas alternate code
Rest of the code just continues to be same as the earlier chunk of code except that instead of data_file (returned by pd.read_table()), now I am using dict_csv , so the for loop in the earlier code which adds data to the sparse matrix is changed to :
for line in data_file.values:
u, i , r = map(str,line)
Vector[letter_indices[u], words_indices[i]] = r
However, I am getting below TypeError for that :
Traceback (most recent call last):
File "C:\Python32\my_scripts\ds.py", line 132, in <module>
for line in dict_csv.values:
TypeError: 'builtin_function_or_method' object is not iterable
I understand that dict_csv.values is not returning an iterable list, can anybody please point me what mistake I am making.
Also the integer 45 is coming as Decimal(45), how can I get rid of that ?
Thanks a lot

Related

Read file.csv (two columns; x and y) then calculate cumulative moving average of second column

I wanted to read my CSV file first.
https://github.com/hamzaal014/file/blob/main/file.csv
the .csv file contains two columns X and Y
here is my script:
import numpy as np
from pandas import DataFrame as df
import csv
origin_data = open("file.csv", "r")
dato = list(csv.reader(origin_data, delimiter=","))
print(dato)
rowcount = 0
#iterating through the whole file
for row in dato:
rowcount+= 1
#printing the result
#_ print("Number of lines present:-", rowcount)
print(rowcount)
dati = df(dato, columns=['x', 'y'])
window = 6
roll_avg = dati.rolling(window).mean()
roll_avg_cumulative = dati['y'].cumsum()/np.arange(1, 25)
print(roll_avg_cumulative)
but my script is not working ???
Error --------------------------------------------------------------------
Traceback (most recent call last):
File "/home/haz/miniconda39/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 163, in _na_arithmetic_op
result = func(left, right)
File "/home/haz/miniconda39/lib/python3.9/site-packages/pandas/core/computation/expressions.py", line 239, in evaluate
return _evaluate(op, op_str, a, b) # type: ignore[misc]
File "/home/haz/miniconda39/lib/python3.9/site-packages/pandas/core/computation/expressions.py", line 128, in _evaluate_numexpr
result = _evaluate_standard(op, op_str, a, b)
File "/home/haz/miniconda39/lib/python3.9/site-packages/pandas/core/computation/expressions.py", line 69, in _evaluate_standard
return op(a, b)
TypeError: unsupported operand type(s) for /: 'str' and 'int'
When reading from a file you are returned strings. This is the source of your problem since the strings are never converted into numbers. You can fix it by:
dati = df(dato, columns=['x', 'y'], dtype_float)
If it is helpful to you I would also like to poit out a few things that may improve your code:
you are using pandas as your container for data so I would suggest using the pandas functions to convert a CSV file to a DataFrame instead of doing it manually (do it by using pandas.read_csv)
the row count can be easily calculated with the len operator without needing to iterate over all rows
please stick to the more widely used import aliases (import pandas as pd) instead of creating your own. This will help your code be more readable to everyone else
So your code can become:
import numpy as np
import pandas as pd
dati = pd.read_csv("file.csv", sep=",", dtype=float, names=["x", "y"])
rowcount = len(dati)
window = 6
roll_avg = dati.rolling(window).mean()
roll_avg_cumulative = dati["y"].cumsum() / np.arange(1, 25)
print(roll_avg_cumulative)
What went wrong in your code:
All vals are loaded as str.
Simple way
import numpy as np
import pandas as pd
import csv
dati = pd.read_csv('file.csv', header=None)
window = 6
roll_avg = dati.rolling(window).mean()
print(dati[1].cumsum())
roll_avg_cumulative = dati[1].cumsum()/np.arange(1, 25)
print(roll_avg_cumulative)

Reading data in exponential format in python (numpy)

I am trying to read the data but its first coloumn have data in exp format which is not allowing me to read the file, here is the minimal working example of my code and here is the link datafile for trying out the code
import numpy as np
filename ="0 A.dat"
data = np.loadtxt(filename, delimiter=',', skiprows=3)
but I am getting this error
ValueError: could not convert string to float:
You can read them with pandas:
import pandas as pd
data = pd.read_csv(filename, delimiter=',', skiprows=3)
import numpy as np
def yesfloat(string):
""" True if given string is float else False"""
try:
return float(string)
except ValueError:
return False
data = []
with open('0 A.dat', 'r') as f:
d = f.readlines()
for i in d:
k = i.rstrip().split(",")
data.append([float(i) if yesfloat(i) else i for i in k])
data = np.array(data, dtype='O')
data
I don't know if that is the answer you are looking for but i tried it with you data and it returned this
array([list(['% Version 1.00']), list(['%']),
list(['%freq[Hz]\tTrc1_S21[dB]\tTrc2_S21[U]\tTrc3_S21[U]\tTrc4_S21[U]']),
...,
list([9998199819.981998, -22.89936928953151, 0.07161954135843378, -0.0618770495057106, -0.03606368601322174, '']),
list([9999099909.991, -22.91188769540125, 0.07151639513438152, -0.06464007496833801, -0.03059829212725163, '']),
list([10000000000.0, -22.92596306398167, 0.07140059761720122, -0.0669037401676178, -0.02493862248957157, ''])],
dtype=object)

Load data from csv into numpy array

I am trying to load data in a csv file (with delimiter ',') into a numpy array. Example of a line is: 81905.75578271,81906.6205052,50685.487931,.... (1000 columns).
I have this code but it seems to not be working properly as in the exit of the function the debugger cannot recognize the data, and when I call the xtrain.shape it returns 0:
def load_data(path):
# return np.loadtxt(path,dtype=int,delimiter=',')
file = open(path,'r')
data = []
for line in file:
array_vals = line.split(",")
array = []
for val in array_vals:
if not val:
array.append(float(val))
data.append(np.asarray(array))
return np.asarray(data)
x_train = load_data(path)
This should give you your required output.
import numpy as np
def load_data(path):
return np.loadtxt(path,delimiter=',')

Numpy ValueError: setting an array element with a sequence reading in list

I have this code that reads numbers and is meant to calculate std and %rms using numpy
import numpy as np
import glob
import os
values = []
line_number = 6
road = '/Users/allisondavis/Documents/HCl'
for pbpfile in glob.glob(os.path.join(road, 'pbpfile*')):
lines = open(pbpfile, 'r').readlines()
while line_number < len(lines) :
variables = lines[line_number].split()
values.append(variables)
line_number = line_number + 3
a = np.asarray(values).astype(np.float)
std = np.std(a)
rms = std * 100
print rms
However I keep getting the error code:
Traceback (most recent call last):
File "rmscalc.py", line 17, in <module>
a = np.asarray(values).astype(np.float)
ValueError: setting an array element with a sequence.
Any idea how to fix this? I am new to python/numpy. If I print my values it looks something like this:
[[1,2,3,4],[2,4,5,6],[1,3,5,6]]
I can think of a modification to your code which can potentially fix your problem:
Initialize values as a numpy array, and use numpy append or concatenate:
values = np.array([], dtype=float)
Then inside loop:
values = np.append(values, [variables], axis=0)
# or
variables = np.array(lines[line_number].split(), dtype=float)
values = np.concatenate((values, variables), axis=0)
Alternatively, if you files are .csv (or any other type Pandas can read):
import pandas as pd
# Replace `read_csv` with your appropriate file reader
a = pd.concat([pd.read_csv(pbpfile)
for pbpfile in glob.glob(os.path.join(road, 'pbpfile*'))]).values
# or
a = np.concatenate([pd.read_csv(pbpfile).values
for pbpfile in glob.glob(os.path.join(road, 'pbpfile*'))], axis=0)

How to filter out data into unique pandas dataframes from a combined csv of multiple datatypes?

Sample csv
time,type,-1,
time,type,0,w
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,blue,font,13
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,yellow,font,9
time,type,19,b,12
type,19,b,42
I would like to filter each of the following "type,1", "type,5", "type,11", "type,19" into a separate pandas frame for further analysis. What's the best way to do it ? [Also, I will be ignoring "type,0" and "type,-1"]
Sample Code
import pandas as pd
type1_header = ['type','a','b','c','name']
type5_header = ['type','r','s','t','u','style','font']
type11_header = ['type','a','c']
type19_header = ['type','b']
type1_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10] , names=type1_header)
type5_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10,12,14] , names=type5_header)
import pandas as pd
headers = {1:['a','b','c','name'],
5:['r','s','t','u','style','font'],
}
usecols = {1:[4,6,8,10],
5:[4,6,8,10,12,14],
}
frames = {}
for h in headers:
frames[h] = pd.DataFrame(columns=headers[h])
count = 0
for line in open('irreg.csv'):
row = line.split(',')
count += 1
ID = int(row[2])
row_subset = []
if ID in frames:
for col in usecols[ID]: row_subset.append(row[col])
frames[ID].loc[len(frames[ID])] = row_subset
else:
print('WARNING: line %d: type %s not found'%(count, row[2]))
Although, that done, how often do you do this and how often does the data change? For a one-off it's probably easiest to split up the incoming csv file, e.g. by
grep type,19 irreg.csv > 19.csv
at the commandline, and then import each csv according to its headers and usecols.

Categories