Python, convert all entries of list from string to float - python

I am brand new to Python and looking up examples for what I want to do. I am not sure what is wrong with this loop, what I would like to do is read a csv file line by line and for each line:
Split by comma
Remove the first entry (which is a name) and store it as name
Convert all other entries to floats
Store name and the float entries in my Community class
This is what I am trying at the moment:
class Community:
num = 0
def __init__(self, inName, inVertices):
self.name = inName
self.vertices = inVertices
Community.num += 1
allCommunities = []
f = open("communityAreas.csv")
for i, line in enumerate(f):
entries = line.split(',')
name = entries.pop(0)
for j, vertex in entries: entries[j] = float(vertex)
print name+", "+entries[0]+", "+str(type(entries[0]))
allCommunities.append(Community(name, entries))
f.close()
The error I am getting is:
>>>>> PYTHON ERROR!!! Traceback (most recent call last):
File "alexChicago.py", line 86, in <module>
for j, vertex in entries: entries[j] = float(vertex)
ValueError: too many values to unpack
It may be worth pointing out that this is running in omegalib, a library for a visual cluster that runs in C and interprets Python.

I think you forgot the enumerate() function on line 86; should be
for j, vertex in enumerate(entries): entries[j] = float(vertex)

If there's always a name and then a variable number of float values, it sounds like you need to split twice: the first time with a maxsplit of 1, and the other as many times as possible. Example:
name, float_values = line.split(',',1)
float_values = [float(x) for x in float_values.split(',')]

I may not be absolutely certain about what you want to achieve here, but converting all the element in entries to float, should not this be sufficient?: Line 86:
entries=map(float, entries)

Related

Pandas dataframe not returning the index using the loc method

I'm trying to retrieve the index of a row within a dataframe using the loc method and a comparison of data from another dataframe within a for loop. Maybe I'm going about this wrong, I dunno. Here's a bit of information to help give the problem some context...
The following function imports some inventory data into a pandas dataframe from an xlsx file; this seemingly works just fine:
def import_inventory():
import warnings
try:
with warnings.catch_warnings(record=True):
warnings.simplefilter("always")
return pandas.read_excel(config_data["inventory_file"],header=1)
except Exception as E:
writelog.error(E)
sys.exit(E)
The following function imports some data from a combination of CSV files, creating a singular dataframe to work from during comparison; this seemingly works just fine:
def get_report_results():
output_dir = f"{config_data['output_path']}/reports"
report_ids = []
......
...execute and download the report csv files
......
reports_content = []
for path,current_directory,files in os.walk(output_dir):
for file in files:
file_path = os.path.join(path,file)
clean_csv_data(file_path) # This function simply cleans up the CSV content (removes blank rows, removes unnecessary footer data); updates same file that was sent in upon successful completion
current_file_content = pandas.read_csv(file_path,index_col=None,header=7)
reports_content.append(current_file_content)
reports_content = pandas.concat(reports_content,axis=0,ignore_index=True)
return reports_content
The problems exist here, at the following function that is supposed to search the reports content for the existence of an ID value then grab that row's index so I can use it in the future to modify some columns, add some columns.
def search_reports(inventory_df,reports_df):
for index,row in inventory_df.iterrows():
reports_index = reports_df.loc[reports_df["Inventory ID"] == row["Inv ID"]].index[0]
print(reports_df.iloc[reports_index]["Lookup ID"])
Here's the error I receive upon comparison
Length of values (1) does not match length of index (4729)
I can't quite figure out why this is happening. If I pull everything out of functions the work seems to happen the way it should. Any ideas?
There's a bit more work happening to the dataframe that comes from import_inventory, but didn't want to clutter the question. It's nothing major - one function adds a few columns that splits out a comma-separated value in the inventory into its own columns, another adds a column based on the contents of another column.
Edit:
As requested, the full stack trace is below. I've also included the other functions that operate on the original inventory_df object between its retreival (import_inventory) and its final comparison (search_reports).
This function again operates on the inventory_df function, only this time it retrieves a single column from each row (if it has data) and breaks the semicolon-separated list of key-value pair tags apart for further inspection. If it finds one, it creates the necessary column for it and populates that row with the found value.
def sort_tags(inventory_df):
cluster_key = "Cluster:"
nodetype_key = "NodeType:"
project_key = "project:"
tags = inventory_df["Tags List"]
for index,tag in inventory_df.items():
if not pandas.isna(tag):
tag_keysvalues = tag.split(";")
if any(cluster_key in string for string in tag_keysvalues):
pair = [x for x in tag_keysvalues if x.startswith(cluster_key)]
key_value_split = pair[0].split(":")
inventory_df.loc[index, "Cluster Name"] = key_value_split[1]
if any(nodetype_key in string for string in tag_keysvalues):
pair = [x for x in tag_keysvalues if x.startswith(nodetype_key)]
key_value_split = pair[0].split(":")
inventory_df.loc[index, "Node Type"] = key_value_split[1]
if any(project_key in string for string in tag_keysvalues):
pair = [x for x in tag_keysvalues if x.startswith(project_key)]
key_value_split = pair[0].split(":")
inventory_df.loc[index, "Project Name"] = key_value_split[1]
return inventory_df
This function compares the new inventory DF with a CSV import-to-DF of the old inventory. It creates new columns based on old inventory data if it finds a match. I know this is ugly code, but I'm hoping to replace it when I can find a solution to my current problem.
def compare_inventories(old_inventory_df,inventory_df):
aws_rowcount = len(inventory_df)
now = parser.parse(datetime.utcnow().isoformat()).replace(tzinfo=timezone.utc).astimezone(tz=None)
for a_index,a_row in inventory_df.iterrows():
if a_row["Comments"] != "none":
for o_index,o_row in old_inventory_df.iterrows():
last_checkin = parser.parse(str(o_row["last_checkin"])).replace(tzinfo=timezone.utc).astimezone(tz=None)
if (a_row["Comments"] == o_row["asset_name"]) and ((now - timedelta(days=30)) <= last_checkin):
inventory_df.loc[a_index,["Found in OldInv","OldInv Address","OldInv Asset ID","Inv ID"]] = ["true",o_row["address"],o_row["asset_id"],o_row["host_id"]]
return inventory_df
Here's the stack trace for the error:
Traceback (most recent call last):
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\main.py", line 52, in main
reports_index = reports_df.loc[reports_df["Inventory ID"] == row["Inv ID"]].index
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\ops\common.py", line 70, in new_method
return method(self, other)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\arraylike.py", line 40, in __eq__
return self._cmp_method(other, operator.eq)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\series.py", line 5625, in _cmp_method
return self._construct_result(res_values, name=res_name)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\series.py", line 3017, in _construct_result
out = self._constructor(result, index=self.index)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\series.py", line 442, in __init__
com.require_length_match(data, index)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (1) does not match length of index (7150)
reports_index = reports_df.loc[report_data["Inventory ID"] == row["Inv ID"].index[0]
missing ] at end

Error: matplotlib does not support generators as input in python code

I am trying to generate a chart that shows the top 5 spending categories.
I've got it working up to a certain point and then it says "matplotlib does not support generators as input". I am pretty new to python in general but am trying to learn more about it.
Up to this point in the code it works:
import Expense
import collections
import matplotlib.pyplot as plt
expenses = Expense.Expenses()
expenses.read_expenses(r"C:\Users\budget\data\spending_data.csv")
spending_categories = []
for expense in expenses.list:
spending_categories.append(expense.category)
spending_counter = collections.Counter(spending_categories)
top5 = spending_counter.most_common(5)
If you did a print(top5) on the above it would show the following results:
[('Eating Out', 8), ('Subscriptions', 6), ('Groceries', 5), ('Auto and Gas', 5), ('Charity', 2)]
Now I was trying to separate the items (the count from the category) and I guess I'm messing up on that part.
The rest of the code looks like this:
categories = zip(*top5)
count = zip(*top5)
fig, ax = plt.subplots()
ax.bar(count,categories)
ax.set_title('# of Purchases by Category')
plt.show()
This is where the error is occurring. I can get something to show if I make count and categories a string but it doesn't actually plot anything and doesn't make sense.
The error shows (the name of this .py file I'm working in is FrequentExpenses.py)
Traceback (most recent call last):
File "C:\Users\budget\data\FrequentExpenses.py", line 24, in <module>
ax.bar(count,categories)
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\__init__.py", line 1447, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\axes\_axes.py", line 2407, in bar
self._process_unit_info(xdata=x, ydata=height, kwargs=kwargs)
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\axes\_base.py", line 2189, in _process_unit_info
kwargs = _process_single_axis(xdata, self.xaxis, 'xunits', kwargs)
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\axes\_base.py", line 2172, in _process_single_axis
axis.update_units(data)
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\axis.py", line 1460, in update_units
converter = munits.registry.get_converter(data)
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\units.py", line 210, in get_converter
first = cbook.safe_first_element(x)
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\cbook\__init__.py", line 1669, in safe_first_element
raise RuntimeError("matplotlib does not support generators "
RuntimeError: matplotlib does not support generators as input
The "Expense" import is another another file (Expense.py) which looks like this that create two classes (Expenses & Expense) and also has a method of read_expenses()
import csv
from datetime import datetime
class Expense():
def __init__(self, date_str, vendor, category, amount):
self.date_time = datetime.strptime(date_str, '%m/%d/%Y %H:%M:%S')
self.vendor = vendor
self.category = category
self.amount = amount
class Expenses():
def __init__(self):
self.list = []
self.sum = 0
# Read in the December spending data, row[2] is the $$, and need to format $$
def read_expenses(self,filename):
with open(filename, newline='') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
for row in csvreader:
if '-' not in row[3]:
continue
amount = float((row[3][2:]).replace(',',''))
self.list.append(Expense(row[0],row[1], row[2], amount))
self.sum += amount
def categorize_for_loop(self):
necessary_expenses = set()
food_expenses = set()
unnecessary_expenses = set()
for i in self.list:
if (i.category == 'Phone' or i.category == 'Auto and Gas' or
i.category == 'Classes' or i.category == 'Utilities' or
i.category == 'Mortgage'):
necessary_expenses.add(i)
elif(i.category == 'Groceries' or i.category == 'Eating Out'):
food_expenses.add(i)
else:
unnecessary_expenses.add(i)
return [necessary_expenses, food_expenses, unnecessary_expenses]
I know this seems pretty simple to most, can anyone help me? I appreciate all the help and I'm looking forward to learning much more about python!
Python knows a data type called “generators” which is a thing which generates values when asked (similar to an iterator). Very often it is cheaper to have a generator than to have a list produced up front. One example is that zip() function. Instead of returning a list of tuples it returns a generator which in turn would return one tuple after the other:
zip([1,2,3],[4,5,6])
<zip object at 0x7f7955c6dd40>
If you iterate over such a generator it will generate one value after the other, so in this case it behaves like a list:
for q in zip([1,2,3],[4,5,6]):
print(q)
(1, 4)
(2, 5)
(3, 6)
But in other contexts it doesn't behave like the list, e.g. if it is being asked for the length of the result. A generator (typically) doesn't know that up front:
len(zip([1,2,3],[4,5,6]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'zip' has no len()
This is mostly to save time during execution and is called lazy evaluation. Read more about generators in general.
In your case, you can simply skip the performance optimization by constructing a true list out of the generator by calling list(...) explicitly:
r = list(zip([1,2,3],[4,5,6]))
Then you can also ask for the length of the result:
len(r)
3
The matlib library will probably do this internally as well, so it accepts lists as input but not generators. Pass it a list instead of a generator, and you will be fine.

Traceback (most recent call last): File "None", line 18, in <module> IndexError: list index out of range

text= open("/Users/amanshah/Desktop/hsn/a.tcp","r")
lines=text.readlines()
a=[]
c=[]
d=[]
e=[]
sum1=0
for line in lines:
temp=line.split()
a.append(int(temp[1]))
c.append(int(temp[5]))
for i in range(0,len(a)):
if a[i]==a[i+1]:
sum1=sum1+c[i]
d[i].append(a[i])
e[i].append(sum)
else:
d[i+1].append(a[i])
e[i+1].append(sum1)
print d
print e
showing error
Traceback (most recent call last):
File "None", line 18, in
IndexError: list index out of range
You are appending to a list element that doesn't exist:
d[i+1].append(a[i])
This is equivalent to:
d = []
d[1].append('a')
This will give you the same error. I don't know what you are trying to put into d and e, but you can append to them, but not to elements of them that don't exist.
It looks like a dictionary would be a better choice for this application.
d = {}
if i not in d:
d[i] = []
d[i].append(a[i])
I don't have a full understanding of what you are trying, but this will check to see if this item has been found before. If not then it will create an empty list to append to. Then it appends the entry for this.
It looks as though this should be rewritten as something like:
from collections import defaultdict
INPUT = "/Users/amanshah/Desktop/hsn/a.tcp"
payloads = defaultdict(int)
with open(INPUT) as inf:
for line in inf:
values = line.split()
port = int(values[1])
payload = int(values[5])
payloads[port] += payload
for port in sorted(payloads):
print("{}: {}".format(port, payloads[port]))
Edit: based on your comment above, it looks like you are scanning a tcp log file and adding up total transfer amounts per port.
So I created a dictionary called payloads where payloads[port] keeps the sum of all payloads for that port; additionally, I made it a defaultdict(int), which just means if I ask for payloads[port_I_havent_seen_yet] it automagically creates and returns a new entry == 0 instead of throwing a KeyError.
I then scan through each line of the file, updating payloads as I go.
At the end, sorted(payloads) gets a list of payloads's keys (all the ports I have encountered) and sorts it in ascending order; I then print out each port and its total payload. Based on your sample data, you should see
1: 156
2: 97
4: 124
5: 241
Hope this helps!

Python: Create coordinate list (convert string to int)

I want to import several coordinates (could add up to 20.000) from an text file.
These coordinates need to be added into a list, looking like the follwing:
coords = [[0,0],[1,0],[2,0],[0,1],[1,1],[2,1],[0,2],[1,2],[2,2]]
However when i want to import the coordinates i got the follwing error:
invalid literal for int() with base 10
I can't figure out how to import the coordinates correctly.
Does anyone has any suggestions why this does not work?
I think there's some problem with creating the integers.
I use the following script:
Bronbestand = open("D:\\Documents\\SkyDrive\\afstuderen\\99 EEM - Abaqus 6.11.2\\scripting\\testuitlezen4.txt", "r")
headerLine = Bronbestand.readline()
valueList = headerLine.split(",")
xValueIndex = valueList.index("x")
#xValueIndex = int(xValueIndex)
yValueIndex = valueList.index("y")
#yValueIndex = int(yValueIndex)
coordList = []
for line in Bronbestand.readlines():
segmentedLine = line.split(",")
coordList.extend([segmentedLine[xValueIndex], segmentedLine[yValueIndex]])
coordList = [x.strip(' ') for x in coordList]
coordList = [x.strip('\n') for x in coordList]
coordList2 = []
#CoordList3 = [map(int, x) for x in coordList]
for i in coordList:
coordList2 = [coordList[int(i)], coordList[int(i)]]
print "coordList = ", coordList
print "coordList2 = ", coordList2
#print "coordList3 = ", coordList3
The coordinates needed to be imported are looking like (this is "Bronbestand" in the script):
id,x,y,
1, -1.24344945, 4.84291601
2, -2.40876842, 4.38153362
3, -3.42273545, 3.6448431
4, -4.22163963, 2.67913389
5, -4.7552824, 1.54508495
6, -4.99013376, -0.313952595
7, -4.7552824, -1.54508495
8, -4.22163963, -2.67913389
9, -3.42273545, -3.6448431
Thus the script should result in:
[[-1.24344945, 4.84291601],[-2.40876842, 4.38153362],[-3.42273545, 3.6448431],[-4.22163963, 2.67913389],[-4.7552824, 1.54508495],[-4.99013376,-0.313952595],[-4.7552824, -1.54508495],[-4.22163963, -2.67913389],[-3.42273545, -3.6448431]]
I also tried importing the coordinates with the native python csv parser but this didn't work either.
Thank you all in advance for the help!
Your numbers are not integers so the conversion to int fails.
Try using float(i) instead of int(i) to convert into floating point numbers instead.
>>> int('1.5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
int('1.5')
ValueError: invalid literal for int() with base 10: '1.5'
>>> float('1.5')
1.5
Other answers have said why your script fails, however, there is another issue here - you are massively reinventing the wheel.
This whole thing can be done in a couple of lines using the csv module and a list comprehension:
import csv
with open("test.csv") as file:
data = csv.reader(file)
next(data)
print([[float(x) for x in line[1:]] for line in data])
Gives us:
[[-1.24344945, 4.84291601], [-2.40876842, 4.38153362], [-3.42273545, 3.6448431], [-4.22163963, 2.67913389], [-4.7552824, 1.54508495], [-4.99013376, -0.313952595], [-4.7552824, -1.54508495], [-4.22163963, -2.67913389], [-3.42273545, -3.6448431]]
We open the file, make a csv.reader() to parse the csv file, skip the header row, then make a list of the numbers parsed as floats, ignoring the first column.
As pointed out in the comments, as you are dealing with a lot of data, you may wish to iterate over the data lazily. While making a list is good to test the output, in general, you probably want a generator rather than a list. E.g:
([float(x) for x in line[1:]] for line in data)
Note that the file will need to remain open while you utilize this generator (remain inside the with block).

What is the lightest way of doing this task?

I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.

Categories