Too many values to unpack in multi dictionary - python

I'm importing data from .csvs and creating a lot of data dictionaries. My code is based off someone else's work with a dataset that has substantially fewer columns than mine. I'll show both her code and then mine and then the error I'm receiving.
Original Code:
capacitya = open('C:/Users/Nafiseh/Desktop/Book chapter-code/arc-s.csv', 'r')
csv_capacitya = csv.reader(capacitya)
mydict_capacitya = {}
for row in csv_capacitya:
mydict_capacitya[(row[0], row[1],row[2])] = float(row[3])
My modification:
# arc capacity
capacitya = open('C:/Users/Emma/Documents/2021-2022/Thesis/Data/arcs.csv', 'r')
csv_capacitya = csv.reader(capacitya)
mydict_capacitya = {}
for row in csv_capacitya:
mydict_capacitya[(row[0], row[1],row[2])] = list(row[3:22])
When I run this later segment of code:
# arc capacity
capacitya = open('C:/Users/Emma/Documents/2021-2022/Thesis/Data/arcs.csv', 'r')
csv_capacitya = csv.reader(capacitya)
mydict_capacitya = {}
for row in csv_capacitya:
mydict_capacitya[(row[0], row[1],row[2])] = list(row[3:22])
#print(mydict_capacitya)
capacityaatt = open('C:/Users/Emma/Documents/2021-2022/Thesis/Data/distarc.csv', 'r')
csv_capacityaatt = csv.reader(capacityaatt)
mydict_capacityaatt = {}
for row in csv_capacityaatt:
mydict_capacityaatt[(row[0], row[1],row[2])] = float(row[3])
attarc, capacityatt= multidict(mydict_capacityaatt)
attarc = tuplelist(attarc)
arc, capacitya = multidict(mydict_capacitya)
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-66e3074f2135> in <module>
120 attarc, capacityatt= multidict(mydict_capacityaatt)
121 attarc = tuplelist(attarc)
--> 122 arc, capacitya = multidict(mydict_capacitya)
123
ValueError: too many values to unpack (expected 2)
If it helps, both in the original code and in my modification, columns [0:2] represent [k,i,j]. In the original dataset, column [4] represented the value. In the updated dataset, columns [3:22] represent values on the new index g. That is, column [4] represents values when g = 2, for example.
Thanks!
Edit: Added more relevant segments of code

Related

Strange difference in performance of Pandas, dataframe on small & large scale

I have a dataframe read from a CSV file. I need to generate new data and add them to the end of old ones.
But it's strange that it shows a totally different result when compare small scale and large scale. I guess it may relate to view, copy() & Chained assignment.
I tried 2 options to use pd.copy() to avoid potential problems.
First option:
d_jlist = pd.read_csv('127case.csv', sep=',') #got the data shape: (46355,48) from CSV file
d_jlist2 = d_jlist.copy() #Use deep copy, in case of change the raw data
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365) #Select from row 5 to row 46350
for i in a:
d_jlist3 = d_jlist3.append(d_jlist.iloc[i].copy() +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1127case_1.csv',header = a,index=False)
Second option:
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365)
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
The problem is, if I use these code on a small scale, it works as expected. New rows add to the old ones, and nothing in old data changed.
However, if I come to the scale above (1000*365), the old rows will get changed.
And the strange thing is: only the first two columns of each row will stay unchanged. The rest of the columns of each row will all get changed.
The results:
The left one is old dataframe, it has (46356,48) shape. Below are the new data generated.
The right one is result got from option 1 (both options got same result). From the third columns, the old data got changed.
If I try either of the options in smaller scale (3 rows), it will be fine. All the old data will be kept.
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist = d_jlist.iloc[:10] #Only select 10 rows from old ones
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,6),size = 3) #Only select 3 rows randomly from old data
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
How can I understand this? I spent lots of time try to find explanation for this but failed.
Are some rules changed in Pandas when the scale is larger (To 365K level)?

Question regarding index in decision-tree code in Python

I'm building a decision tree following the next tutorial and base code:
https://www.youtube.com/watch?v=LDRbO9a6XPU
and https://github.com/random-forests/tutorials/blob/master/decision_tree.py
However when loading my own datasets into the base it's throwing the following error
File "main.py", line 245, in find_best_split
values = set([row[col] for row in rows]) # unique values in the column
File "main.py", line 245, in <listcomp>
values = set([row[col] for row in rows]) # unique values in the column
IndexError: list index out of range
And I'm not quite sure why it is happening?
The code:
def find_best_split(rows):
"""Find the best question to ask by iterating over every feature / value
and calculating the information gain."""
print("All rows in find_best_split are: ",len(rows))
best_gain = 0 # keep track of the best information gain
best_question = None # keep train of the feature / value that produced it
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1 # number of columns
for col in range(n_features): # for each feature
values = set([row[col] for row in rows]) # unique values in the column
print("Just read the col: ",col)
print("All the values are: ",len(values))
for val in values: # for each value
question = Question(col, val)
# try splitting the dataset
true_rows, false_rows = partition(rows, question)
# Skip this split if it doesn't divide the
# dataset.
if len(true_rows) == 0 or len(false_rows) == 0:
continue
# Calculate the information gain from this split
gain = info_gain(true_rows, false_rows, current_uncertainty)
# You actually can use '>' instead of '>=' here
# but I wanted the tree to look a certain way for our
# toy dataset.
if gain >= best_gain:
best_gain, best_question = gain, question
return best_gain, best_question
I added the prints for clarity, it prints:
Length of all rows in find_best_split are: 200
Just read the col: 0
All the values length are: 200
yet with the basic fruit-example it came with with this didnt happen, I just don't get it. All help is very appreciated!

Reading Text files and Skipping rows if not integer values

I am trying to read values out of a very long text file (2552 lines) inputting various columns of the file into different arrays. I want to later use these values to plot a graph using the data from the file. However, not all the rows in a column are integers (eg"<1.6" instead of "1.6") and some of the rows are blank.
Is there a way to skip over these rows which are blank completely or hold non-integer values without skipping a value in my array? (and hence find out how long my arrays need to be in the first place to remove excess zeros at the end)
Here is my code so far:
# Light curve plot
jul_day = np.zeros(2551)
mag = np.zeros(2551)
mag_err = np.zeros(2551)
file = open("total_data.txt")
lines = file.readlines()[1:]
i = 0
for line in lines:
fields = line.split(",")
jul_day[i] = float(fields[0])
mag[i] = float(fields[1])
mag_err[i] = float(fields[2])
i = i + 1
Here is an example of an error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-d091536c6666> in <module>()
18 fields = line.split(",")
19 jul_day[i] = float(fields[0])
---> 20 mag[i] = float(fields[1])
21 #mag_err[i] = float(fields[2])
22
ValueError: could not convert string to float: '<1.6'
I've found that isinstance is good for discerning types.
You could insert the logic to determine if a value is, indeed, an integer prior to casting it. For instance:
if isinstance(fields[1], int):
mag[i] = float(fields[1])
Use isinstance() to make sure the type is a int:
for line in lines:
fields = line.split(",")
jul_day[i] = float(fields[0])
if isinstance(fields[1],int):
mag[i] = float(fields[1])
if isinstance(fields[2],int):
mag_err[i] = float(fields[2])

Trouble importing Excel fields into Python via Pandas - index out of bounds error

I'm not sure what happened, but my code has worked today, however not it won't. I have an Excel spreadsheet of projects I want to individually import and put into lists. However, I'm getting a "IndexError: index 8 is out of bounds for axis 0 with size 8" error and Google searches have not resolved this for me. Any help is appreciated. I have the following fields in my Excel sheet: id, funding_end, keywords, pi, summaryurl, htmlabstract, abstract, project_num, title. Not sure what I'm missing...
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
cols = [0,1,2,3,4,5,6,7,8]
df = df[df.columns[cols]]
tt = df['funding_end'] = df['funding_end'].astype(str)
tt = df.funding_end.tolist()
for t in tt:
allenddates.append(t)
bb = df['keywords'] = df['keywords'].astype(str)
bb = df.keywords.tolist()
for b in bb:
allkeywords.append(b)
uu = df['pi'] = df['pi'].astype(str)
uu = df.pi.tolist()
for u in uu:
allpis.append(u)
vv = df['summaryurl'] = df['summaryurl'].astype(str)
vv = df.summaryurl.tolist()
for v in vv:
allsummaryurls.append(v)
ww = df['htmlabstract'] = df['htmlabstract'].astype(str)
ww = df.htmlabstract.tolist()
for w in ww:
allhtmlabstracts.append(w)
xx = df['abstract'] = df['abstract'].astype(str)
xx = df.abstract.tolist()
for x in xx:
allabstracts.append(x)
yy = df['project_num'] = df['project_num'].astype(str)
yy = df.project_num.tolist()
for y in yy:
allprojectnums.append(y)
zz = df['title'] = df['title'].astype(str)
zz = df.title.tolist()
for z in zz:
alltitles.append(z)
"IndexError: index 8 is out of bounds for axis 0 with size 8"
cols = [0,1,2,3,4,5,6,7,8]
should be cols = [0,1,2,3,4,5,6,7].
I think you have 8 columns but your col has 9 col index.
IndexError: index out of bounds means you're trying to insert or access something which is beyond its limit or range.
Every time, when you load either of these files such as an test.xlx, test.csv or test.xlsx file using Pandas such as:
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
It would be better for everyone to find the length of columns of a DataFrame that will help you move forward when working with large Data_Sets. e.g.
import pandas as pd
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
data_frames = pd.DataFrame(data_set)
print("Length of Columns:", len(data_frames.columns))
This will give you the exact number of columns of an Excel Spread-Sheet. Then you can specify the Data Frames Accordingly:
Length of Columns: 8
cols = [0, 1, 2, 3, 4, 5, 6, 7]
I agree with #Bill CX that it sounds like you're trying to access a column that doesn't exist. Although I cannot reproduce your error, I have some ideas that may help you move forward.
First, double check the shape of your data frame:
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
print(df.shape) # print shape of data read in to python
The output should be
(X, 9) # "X" is the number of rows
If the data frame has 8 columns, then the df.shape will be (X, 8). This could be why your are getting the error.
Another check for you is to print out the first few rows of your data frame.
print(df.head)
This will let you double-check to see if you have read in the data in the correct form. I'm not sure, but it might be possible that your .xlsx file has 9 columns, but pandas is reading in only 8 of them.

Sorting for duplicates by several columns as unique in Python

I need to find a duplicates in my txt file. The file looks like this:
3,3090,21,f,2,3
4,231,22,m,2,3
5,9427,13,f,2,2
6,9942,7,m,2,3
7,6802,33,f,3,2
8,8579,11,f,2,4
9,8598,11,f,2,4
10,16729,23,m,1,1
11,8472,11,f,3,4
12,10976,21,f,3,3
13,2870,21,f,2,3
14,12032,10,f,3,4
15,16999,13,m,2,2
16,570,7,f,2,3
17,8485,11,f,2,4
18,8728,11,f,3,4
19,20861,9,f,2,2
20,19771,34,f,2,2
21,17964,10,f,2,2
There are ~30000 lines of this. And now, I need to find duplicates in the second column and save to the the new files without any duplicates. My code is:
def dedupe(data):
d = []
for l in lines:
if l[0] in d:
d[l[0]] += l[:1]
else:
d[l[0]] = l[1]
return d
#m - male
#f - female
data = open('plec.txt', 'r')
save_m = open('plec_m.txt', 'w')
save_f = open('plec_f.txt', 'w')
lines = data.readlines()[1:]
for line in lines:
gender = line.strip().split(',')[3]
if gender is 'f':
dedupe(line)
save_f.write(line)
elif gender is 'm':
dedupe(line)
save_m.write(line)
But I'm getting this error:
Traceback (most recent call last):
File "plec.py", line 88, in <module>
dedupe(line)
File "plec.py", line 75, in dedupe
d[l[0]] = l[1]
TypeError: list indices must be integers, not str'
EDIT 2018-10-28:
I don't remember what I had to sort in this file, I think that 2nd and 4th column must be unique but I'm not sure now. But I found wrong part in my code and because of it, I rebuilt all of code which is also working.
def dedup(my_list, new_file):
d = list()
for single_line in my_list:
if single_line.split(',')[1] not in [i.split(',')[1] for i in d]:
d.append(single_line)
print(len(my_list), len(d))
new_file.writelines(d)
data = open('plec.txt', 'r').readlines()[:1]
males = open('m.txt', 'w')
females = open('f.txt', 'w')
males_list = list()
females_list = list()
for line in data:
gender = line.split(',')[3]
if gender == 'm':
males_list.append(line)
if gender == 'f':
females_list.append(line)
dedup(males_list, males)
dedup(females_list, females)
You can use Pandas to read your input file and remove the duplicates based on any column you want.
from StringIO import StringIO
from pandas import DataFrame
data =StringIO("""col1,col2,col3,col4,col5,col6
3,3090,21,f,2,3
4,231,22,m,2,3
5,9427,13,f,2,2
6,9942,7,m,2,3
7,6802,33,f,3,2
8,8579,11,f,2,4
9,8598,11,f,2,4
10,16729,23,m,1,1
11,8472,11,f,3,4
12,10976,21,f,3,3
13,2870,21,f,2,3
14,12032,10,f,3,4
15,16999,13,m,2,2
16,570,7,f,2,3
17,8485,11,f,2,4
18,8728,11,f,3,4
19,20861,9,f,2,2
20,19771,34,f,2,2
21,17964,10,f,2,2""")
df = DataFrame.from_csv(data, sep=",", index_col=False)
df.drop_duplicates(subset='col2')
df.to_csv("no_dups.txt", index = false)
seen = set()
for row in my_filehandle:
my_2nd_col = row.split(",")[1]
if my_2nd_col in seen:
continue
output_filehandle.write(row)
seen.add(my_2nd_column)
is one very verbose way of doing this
OP, I don't know what's wrong with your code but this solution should fit your requirements, assuming your requirements are:
Filter the file based on second column
Store male and female entries in seperate files
Here's the code:
with open('plec.txt') as file:
lines = map(lambda line: line.split(','), file.read().split('\n')) # split the file into lines and the lines by comma
filtered_lines_male = []
filtered_lines_female = []
second_column_set = set()
for line in lines:
if(line[1] not in second_column_set):
second_column_set.add(line[1]) # add to index set
if(line[3] == 'm'):
filtered_lines_male.append(line) # add to male list
else:
filtered_lines_female.append(line) # add to female list
filtered_lines_male = '\n'.join([','.join(line) for line in filtered_lines_male]) # apply source formatting
filtered_lines_female = '\n'.join([','.join(line) for line in filtered_lines_female]) # apply source formatting
with open('plec_m.txt', 'w') as male_write_file:
male_write_file.write(filtered_lines_male) # write male entries
with open('plec_f.txt', 'w') as female_write_file:
female_write_file.write(filtered_lines_female) # write female entries
Please use better variable naming the next time you write code and please make sure your questions are more specific.

Categories