Increasing Values in one Column Using CSV Module and Python - python

All I would like to do is add .001 to each value that isn't a 0 in one column (say column 7 for example) in my csv file.
So instead of being 35, the value would be changed to 35.001 for example. I need to do this to make my ArcMap script work because if a whole number is the first read, the column is assigned as a short integer when it needs to be read as a float.
As of right now, I have:
writer.writerow([f if f.strip() =='0' else f+.001 for f in row])
This creates a concatenation error however and does not yet address the specific column I need this to work on.
Any help would be greatly appreciated.
Thank you.

The easiest thing to do is to just mutate the row in place, i.e.
if row[7].strip() != '0' and '.' not in row[7]:
row[7] = row[7] + '.001'
writer.writerow(row)
The concatenation error is caused by trying to add a float to a string, you just have to wrap the extra decimals in quotes
The extra condition on the if ensures that you don't accidentally end up with a number with 2 decimal points
It's pretty standard for numbers of the form like 35.0 to be treated as floats even though it's a whole number - check if ArcMap follows this convention, and then you can avoid reducing the accuracy of your numbers by just appending '.0'

Related

Finding rows with specific words in an excel sheet [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a table similar to this format:
Table
My table has 50000 rows and 800 columns, certain cells (tab-separated) contain multiple comma-separated words (e.g L,N). I want to retain only rows that contain one of a set of words (say A to N) at a given column (say col2) and remove the remaining rows.
Is it possible to do using Vlookup or there is another way to do it? Any suggestion is appreciated.
Create a helper column, perhaps on the extreme right of a copy of your worksheet. Enter this formula in row 2 of that column. Modify the formula to replace column C with the ID of the helper column (the one in which you write the formula) and replace B with the column in which to find the word. Cope the formula all the way down.
=ISERROR(FIND(C$1,$B2))
Now enter the word you want to keep in cell(1) of the helper column (C$1 in my example). The column will fill with TRUE and FALSE.
TRUE means that the word wasn't found and the row should be deleted
FALSE means that the word exists and should be kept
Now sort the sheet on that column and delete the block with TRUE in it. 15 seconds flat, after you have done it for a few times. That's faster than any VBA or r or Python solution could be made to run.
The biggest issue I had was in communicating conveying what I did, how and why. No need to, so deleted.
So, Select any part, section or range of your table and run the code.
You can remove the section of the code below 'clear any existng fiters' if you dont need to delete the found data and want to keep in your sheet, still tagged for additional purposes (like you could copy those bits to another file if you want, etc).
The code below should do what you asked in your question, and leave you with just the parts of the table that DONT contain, in your selection, any of your terms. i.e. it will delete and move up rows in your table according to your criteria.
& yes Python & R can simply do this for you too too, with dataframes in python, with less code. But this VBA code worked for my many examples. Don't know how it will fare with 50000 rows and X columns, but it should be alright {Post edit: It works fine}.
Sub SearchTableorSelectionDeletetermsFound5()
Dim corresspondingpartner() As Variant
Dim rng As Range: Set rng = Selection
Dim col As Range
For Each col In rng.Columns
Dim r As Range
Dim rn As Variant
Dim Rownum As Long
Rownum = Selection.Rows.Count
ReDim rm(0 To Rownum) As Variant 'size of this array needs to equal or bigger than your selection
'With Sheet2
terms = Sheets("Sheet2").Cells(1, 1).CurrentRegion
k = 1
For rw = 0 To UBound(terms)
ReDim Preserve corresspondingpartner(rw)
corresspondingpartner(rw) = (k / k) 'gives each correspondong partner element an id of 1.
k = k + 1
Next
'End With
For Each r In Selection
n = 0
m = n
For Each c In terms
' Checks for each term in turn in the terms column.
' If it finds one, it inserts the corresponding corresspondingpartner name in the column cell/corresponding row column O (*post edit: now, column ADU)*
If r.Offset(0, 0).Value Like "*" & c & "*" Then
rm(n) = corresspondingpartner(n) 'corresspondingpartner(n) and in the end, you dont even need this, you can replace with any value which the auto fiter section looks for to delete
'so you can remove any instances and classees of corresspondingpartner, including the making of this corresponding second array
'turns out it could have been just if =1
Cells(r.Row, 801).Value = rm(n) / rm(n) 'Sheets("HXY2").
'##### YOU DONT EVEN NEED A HLOOKUP! :)
'#### BUT I STILL WANT TO GET RID OF THE OFFSET COLUMS, DO IT WITHOUT THEM. DONE!! :)
'''###''' same here , turns out could have just been =1
End If
n = n + 1
Next
Next
Next col
'Clear any existing filters
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
'1. Apply Filter
ActiveSheet.Range("A1:ADU5000").AutoFilter Field:=801, Criteria1:=corresspondingpartner(n) / corresspondingpartner(n)
'2. Delete Rows
Application.DisplayAlerts = False
ActiveSheet.Range("A1:ADU5000").SpecialCells(xlCellTypeVisible).Delete
Application.DisplayAlerts = True
'3. Clear Filter
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
End Sub
You might be able to see that in the beginning I was working with to print offset column results, from the table/selection - which took up unneccisary space and also was employing a VBA Application.WorksheetFunction.Hlookup in the code to give a final result column, tagging the rows to delete, but that was in the end unneccissary. Those earlier versions/macros worked too, but were slower, so I did it without the need for helper columns, using arrays.
I turned to my friend, [excel campus - delete rows based on conditions][1] to embed and tweak the autoflter code at the end which deletes the rows you don't want , so you don't have to do it yourself)
Its now a "virtual" hlookup on the array matches in your selection (or data), by deleting all rows that match your specifications/requirements leaving you with the data you want.
I know and have a huge hunch it could be further improved, expanded streamlined much further (starting with the way I produce the arrays), but I'm happy with its functionality and potential scope for now.
[1]: https://www.youtube.com/watch?v=UwNcSZrzd_w&t=6s)

Problems with function isin, problem with number, with words working normaly

I have a problem with function:
print(df.loc[df['Kriterij1'] == '63'])
and i was tried with (the same)
df[df.Kriterij1.isin(['aaa','63'])]
When I want to try filtered by numbers the output is only the head (empty cells) its work only for word 'aaa'.
Or maybe i can use anoter function?
I think you need change '63' (string) to 63 (number), if mixed numeric with strings values:
print(df.loc[df['Kriterij1'] == 63])
print(df[df.Kriterij1.isin(['aaa',63])])

Selecting rows based on conditions (column) using the comparison operator >

While discovering python, i found myself stuck trying to select rows (food items) based on values of a column (macro nutriments). My condition uses a relational operation and the output is not the correct.
(particularly with the > or < operators not having the problem with the == operator).
data.loc[data['protein']=='10']
Result of my code sample
The result is correct because all the rows (food items) seem to have a protein of value 10.
data.loc[data['protein']>'10']
Result of my code sample
The result is incorrect because all the rows have a value that do not respect the condition given (you have rows with protein < 10 like you have rows with protein >10) .
data.loc[data['protein']>'10']
Any thoughts on the issue ? Do you think it's related with the file format (see code sample below) ? If so, how can i get around the problem ?
data = pd.read_excel('Documents/test.xlsx',names=col_names,usecols="D,E,F,G,H,J,M,N,P,Q,R,T,Y,Z,AA", index_col =[3]).
Thanks in advance and Happy Holidays !!
[EDITED]
So did more digging, indeed i am comparing two different things. #Daniel Mesejo the type of protein is Object. Since i want to have the protein column in a float format, i decided to convert it into string first and then into float. Unfortunately converting it to string using the .astype(str) didn't work
result
Use data['protein'] = data['protein'].astype('int64') to convert string to integer and then retry what you were doing.
Your problem is that you are comparing string rather than integers. Change data.loc[data['protein']>'10'] for data.loc[int(data['protein'])>10]

Function awfully slow

I was looking for historical data from our Brazilian stock market and found it at Bovespa's
website.
The problem is the format the data is in is terrible, it is mingled with all sorts of
other information about any particular stock!
So far so good! A great opportunity to test my fresh python skills (or so I thought)!
I managed to "organize/parse" pretty much all of the data with a few lines of code,
and then stumbled on a very annoying fact about the data. The very information I needed, stock prices(open, high, low, close), had no commas and was formatted like this: 0000000011200, which would be equivalent to 11 digits before the decimal comma.
So basically 0000000011200 = 112,00... You get the gist..
I wrote a few lines of code to edit that and then the nightmare kicked in.
The whole data set is around 358K rows long, and with my current script the deeper it
runs inside the list to edit it the longer it takes per edit.
Here is the code snipped I used for that:
#profile
def dataFix(datas):
x = 0
for entry in datas:
for i in range(9, 16):
data_org[datas.index(entry)][i] = entry[i][:11]+'.'+entry[i][11:]
x += 1
print x
Would anyone mind shining some light into this matter?
datas.index(entry)
There's your problem. datas.index(entry) requires Python to go through the datas list one element at a time, searching for entry. It's an incredibly slow way to do things, slower the bigger the list is, and it doesn't even work, because duplicate elements are always found at their first occurrence instead of the occurrence you're processing.
If you want to use the indices of the elements in a loop, use enumerate:
for index, entry in enumerate(datas):
...
First, probably more easy to convert price directly to a more usable format.
For exemple Decimal format permit you to do easy calculation without loosing precision.
Secondly, i think you didn't even need the index and can just use append.
Thirdly, say welcome to list comprehension and slice :P
from decimal import Decimal
data_org = []
for entries in datas:
data_org.append([Decimal(entry).scaleb(-2) for entry in entries[9:16]])
or even:
data_org = [[Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas]
or in a generator form:
data_org = ([Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas)
or if you want to keeping the text form:
data_org = [['.'.join((entry[:-2], entry[-2:])) for entry in entries[9:16]] for entries in datas]
(replaceing [:11] by [:-2] permit to be independent of the input size and get 2 decimal from the end)

how to only show int in a sorted list from csv file

I have a huge CSV file where im supose to show only the colume "name" and "runtime"
My problem is that i have to sort the file and print the top 10 min and top 10 max from the
row runtime and print them
But the row 'runtime' contains text like this:
['http://dbpedia.org/ontology/runtime',
'XMLSchema#double',
'http://www.w3.org/2001/XMLSchema#double',
'4140.0',
'5040.0',
'5700.0',
'{5940.0|6600.0}',
'NULL',
'6480.0',....n]
how do i sort the list showing only numbers
my code so far:
import csv
run = []
fp = urllib.urlopen('Film.csv')
reader = csv.DictReader(fp,delimiter=',')
for line in reader:
if line:
run.append(line)
name = []
for row in run:
name.append(row['name'])
runtime = []
for row in run:
runtime.append(row['runtime'])
runtime
expected output:
the csv file contaist null values and values looking like this {5940.0|6600.0}
expected output
'4140.0',
'5040.0',
'5700.0',
'6600.0',
'6800.0',....n]
not containg the NULL values and only the higest values in the ones looking
like this
{5940.0|6600.0}
You could filter it like this, but you should probably wait for better answers.
>>>l=[1,1.3,7,'text']
>>>[i for i in l if type(i) in (type(1),type(1.0))] #only ints and floats allowed
[1,1.3,7]
This should do though.
My workflow probably would be: Use str.isdigit() as a filter, convert to a number with BIF int() or float() and then use sort() or sorted().
While you could use one of the many answers that will show up here, I personally would exploit some domain knowledge of your csv file:
runtime = runtime[3:]
Based on your example value for the runtime row, the first three columns contain metadata. So you know more about the structure of your input file than just "it is a csv file".
Then, all you need to do is sort:
runtime = sorted(runtime)
max_10 = runtime[-10:]
min_10 = runtime[:10]
The syntax I'm using here is called "slice", which allows you to access a range of a sequence, by specifying the start index and the "up-to-but-not-including" index in the square brackets separated by a colon. Neat trick: Negative indexes wrap are seen as starting at the end of the sequence.

Categories