Reading Text files and Skipping rows if not integer values - python

I am trying to read values out of a very long text file (2552 lines) inputting various columns of the file into different arrays. I want to later use these values to plot a graph using the data from the file. However, not all the rows in a column are integers (eg"<1.6" instead of "1.6") and some of the rows are blank.
Is there a way to skip over these rows which are blank completely or hold non-integer values without skipping a value in my array? (and hence find out how long my arrays need to be in the first place to remove excess zeros at the end)
Here is my code so far:
# Light curve plot
jul_day = np.zeros(2551)
mag = np.zeros(2551)
mag_err = np.zeros(2551)
file = open("total_data.txt")
lines = file.readlines()[1:]
i = 0
for line in lines:
fields = line.split(",")
jul_day[i] = float(fields[0])
mag[i] = float(fields[1])
mag_err[i] = float(fields[2])
i = i + 1
Here is an example of an error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-d091536c6666> in <module>()
18 fields = line.split(",")
19 jul_day[i] = float(fields[0])
---> 20 mag[i] = float(fields[1])
21 #mag_err[i] = float(fields[2])
22
ValueError: could not convert string to float: '<1.6'

I've found that isinstance is good for discerning types.
You could insert the logic to determine if a value is, indeed, an integer prior to casting it. For instance:
if isinstance(fields[1], int):
mag[i] = float(fields[1])

Use isinstance() to make sure the type is a int:
for line in lines:
fields = line.split(",")
jul_day[i] = float(fields[0])
if isinstance(fields[1],int):
mag[i] = float(fields[1])
if isinstance(fields[2],int):
mag_err[i] = float(fields[2])

Related

Too many values to unpack in multi dictionary

I'm importing data from .csvs and creating a lot of data dictionaries. My code is based off someone else's work with a dataset that has substantially fewer columns than mine. I'll show both her code and then mine and then the error I'm receiving.
Original Code:
capacitya = open('C:/Users/Nafiseh/Desktop/Book chapter-code/arc-s.csv', 'r')
csv_capacitya = csv.reader(capacitya)
mydict_capacitya = {}
for row in csv_capacitya:
mydict_capacitya[(row[0], row[1],row[2])] = float(row[3])
My modification:
# arc capacity
capacitya = open('C:/Users/Emma/Documents/2021-2022/Thesis/Data/arcs.csv', 'r')
csv_capacitya = csv.reader(capacitya)
mydict_capacitya = {}
for row in csv_capacitya:
mydict_capacitya[(row[0], row[1],row[2])] = list(row[3:22])
When I run this later segment of code:
# arc capacity
capacitya = open('C:/Users/Emma/Documents/2021-2022/Thesis/Data/arcs.csv', 'r')
csv_capacitya = csv.reader(capacitya)
mydict_capacitya = {}
for row in csv_capacitya:
mydict_capacitya[(row[0], row[1],row[2])] = list(row[3:22])
#print(mydict_capacitya)
capacityaatt = open('C:/Users/Emma/Documents/2021-2022/Thesis/Data/distarc.csv', 'r')
csv_capacityaatt = csv.reader(capacityaatt)
mydict_capacityaatt = {}
for row in csv_capacityaatt:
mydict_capacityaatt[(row[0], row[1],row[2])] = float(row[3])
attarc, capacityatt= multidict(mydict_capacityaatt)
attarc = tuplelist(attarc)
arc, capacitya = multidict(mydict_capacitya)
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-66e3074f2135> in <module>
120 attarc, capacityatt= multidict(mydict_capacityaatt)
121 attarc = tuplelist(attarc)
--> 122 arc, capacitya = multidict(mydict_capacitya)
123
ValueError: too many values to unpack (expected 2)
If it helps, both in the original code and in my modification, columns [0:2] represent [k,i,j]. In the original dataset, column [4] represented the value. In the updated dataset, columns [3:22] represent values on the new index g. That is, column [4] represents values when g = 2, for example.
Thanks!
Edit: Added more relevant segments of code

How to loop through different Dataframes to check in pandas whether any value in a given column is different than 0 and save the result in an array

The goal of this python pandas code would be to loop through several dataframes - results of an sql query -, check for every column of each dataframe whether there is any value different than 0, and based on that, assign the column name to a given array (ready_data or pending_data) for each dataframe.
The code is as follows:
#4). We will execute all the queries and change NAN for 0 so as to be able to track whether data is available or not
SQL_Queries = ('dutch_query', 'Fix_Int_Period_query', 'Zinsen_Port_query')
Dataframes = ('dutch', 'Fix_Int_Period', 'Zinsen_Port')
Clean_Dataframes = ('clean_dutch', 'clean_Fix_Int_Period', 'clean_Zinsen_Port')
dutch = pd.read_sql(dutch_query.format(ultimo=report_date), engine)
clean_dutch = dutch.fillna(0)
Fix_Int_Period = pd.read_sql(Fix_Int_Period_query.format(ultimo=report_date), engine)
clean_Fix_Int_Period = Fix_Int_Period.fillna(0)
Zinsen_Port = pd.read_sql(Zinsen_Port_query.format(ultimo=report_date), engine)
clean_Zinsen_Port = Zinsen_Port.fillna(0)
#5). We will check whether all data is available by looping through the columns and checking whether values are different than 0
dutch_ready_data=[]
dutch_pending_data=[]
Fix_Int_Period_ready_data=[]
Fix_Int_Period_pending_data=[]
Zinsen_Port_ready_data=[]
Zinsen_Port_pending_data=[]
for df in Dataframes:
for cdf in Clean_Dataframes:
for column in cdf:
if (((str(cdf)+[column]) != 0).any()) == False:
(str((str(df))+str('_pending_data'))).append([column])
else:
(str((str(df))+str('_ready_data'))).append([column])
The error message I keep getting is:
TypeError Traceback (most recent call last)
<ipython-input-70-fa18d45f0070> in <module>
13 for cdf in Clean_Dataframes:
14 for column in cdf:
---> 15 if (((str(cdf)+[column]) != 0).any()) == False:
16 (str((str(df))+str('_pending_data'))).append([column])
17 else:
TypeError: can only concatenate str (not "list") to str
It would be much appreciated if someone could help me out.
Thousand thanks!

Question regarding index in decision-tree code in Python

I'm building a decision tree following the next tutorial and base code:
https://www.youtube.com/watch?v=LDRbO9a6XPU
and https://github.com/random-forests/tutorials/blob/master/decision_tree.py
However when loading my own datasets into the base it's throwing the following error
File "main.py", line 245, in find_best_split
values = set([row[col] for row in rows]) # unique values in the column
File "main.py", line 245, in <listcomp>
values = set([row[col] for row in rows]) # unique values in the column
IndexError: list index out of range
And I'm not quite sure why it is happening?
The code:
def find_best_split(rows):
"""Find the best question to ask by iterating over every feature / value
and calculating the information gain."""
print("All rows in find_best_split are: ",len(rows))
best_gain = 0 # keep track of the best information gain
best_question = None # keep train of the feature / value that produced it
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1 # number of columns
for col in range(n_features): # for each feature
values = set([row[col] for row in rows]) # unique values in the column
print("Just read the col: ",col)
print("All the values are: ",len(values))
for val in values: # for each value
question = Question(col, val)
# try splitting the dataset
true_rows, false_rows = partition(rows, question)
# Skip this split if it doesn't divide the
# dataset.
if len(true_rows) == 0 or len(false_rows) == 0:
continue
# Calculate the information gain from this split
gain = info_gain(true_rows, false_rows, current_uncertainty)
# You actually can use '>' instead of '>=' here
# but I wanted the tree to look a certain way for our
# toy dataset.
if gain >= best_gain:
best_gain, best_question = gain, question
return best_gain, best_question
I added the prints for clarity, it prints:
Length of all rows in find_best_split are: 200
Just read the col: 0
All the values length are: 200
yet with the basic fruit-example it came with with this didnt happen, I just don't get it. All help is very appreciated!

I have a dataset containing both strings and integer, how do I write a program that will only read the integer values on Python?

Need it to only read the integer values, and not the strings.
This is an example of a line in the text file:
yye5 mooProject No yeetcity Nrn de 0 .1 .5 0
We want to skip the first 5 columns (Nrn de is one column) and put every line in the file (which looks like this) into a numpy or pandas array.
Try/Except block is your friend.
x = ('yye5','mooProject','No','yeetcity','Nrn','de','0','.1','.5','0')
result = []
for i in x:
try:
result.append(float(i))
except ValueError:
pass
print(result)

ValueError: could not convert string to float, values in CSV can't be converted to floats

I am trying to do some mathematical operation on the values of a column fetched from a csv file. For that I wrote the code given below:
rows = csv.reader(open('sample_data_ml.csv', 'r'))
newrows = []
selling_price = []
count = 0
Y_pred = np.asarray(Y_pred, dtype='float64')
for margin in Y_pred:
for row in rows:
if count == 0:
count = count+1
else:
#print(row[7])
sell = float(row[7]) + margin*float(row[7])
selling_price.append(sell)
print(selling_price)
I am getting this error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-d6009e8dad12> in <module>()
16 #row[7] = float(row[7])
17
---> 18 sell = float(row[7]) + margin*float(row[7])
19 selling_price.append(sell)
20
ValueError: could not convert string to float:
Problem is likely with the values of row[7]. How to overcome?
Edit:
The row[7] in csv looks like this (some sample values):
After adding try except block as suggested, I am getting all values of the column as the output of except block.
[array([312.81321038]), array([223.43800741]), array([1489.58671609]), array([49.34255997]), array([726.17352409]), array([2583.50196071]), array([116.37396219]), array([395.67147146]), array([27.92975093]), array([260.67767531]), array([1117.19003706]), array([1024.09086731]), array([884.44211268]), array([325.84709414]), array([186.19833951]), array([316.53717717]), array([43.75660979]), array([605.14460341]), array([5492.85101557]), array([65.16941883]), array([3798.44612602]), array([884.44211268]), array([1210.28920682]), array([726.17352409]), array([625.62642076]), array([698.24377317]), array([204.81817346]), array([1396.48754633]), array([325.84709414]), array([1815.43381023]....)
It seems all the values in that column involve in the problem. How to proceed?
Put it in a catch and try:
try:
sell = float(row[7]) + margin*float(row[7])
selling_price.append(sell)
except ValueError, e:
# report the error in some way that is helpful -- maybe print
print(row[7])
row[7] = 0 # just to be safe
Possible solutions could be.
you could wrap your code try except block and handle error in except
At except block
you can extract only numbers and convert to float
just skip
make default value to 0 if there is not any string number which can be converted to float
make log of failed rows
try:
except ValueError:
#handle here

Categories