Python: How does converters work in genfromtxt() function? - python

I am new to Python, I have a following example that I don't understand
The following is a csv file with some data
%%writefile wood.csv
item,material,number
100,oak,33
110,maple,14
120,oak,7
145,birch,3
Then, the example tries to define a function to convert those trees name above to integers.
tree_to_int = dict(oak = 1,
maple=2,
birch=3)
def convert(s):
return tree_to_int.get(s, 0)
The first question is why is there a "0" after "s"? I removed that "0" and get same result.
The last step is to read those data by numpy.array
data = np.genfromtxt('wood.csv',
delimiter=',',
dtype=np.int,
names=True,
converters={1:convert}
)
I was wondering for the converters argument, what does {1:convert} exact mean? Especially what does number 1 mean in this case?

For the second question, according to the documentation (https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html), {1:convert} is a dictionary whose keys are column numbers (where the first column is column 0) and whose values are functions that convert the entries in that column.
So in this code, the 1 indicates column one of the csv file, the one with the names of the trees. Including this argument causes numpy to use the convert function to replace the tree names with their corresponding numbers in data.

Related

Create new columns in pandas DataFrame using List Comprehension

So I have a pandas DataFrame that has several columns that contain values I'd like to use to create new columns using a function I've defined. I'd been planning on doing this using Python's List Comprehension as detailed in this answer. Here's what I'd been trying:
df['NewCol1'], df['NewCol2'] = [myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])]
This runs correctly until it comes time to assign the values to the new columns, at which point it fails, I believe because it hasn't been iteratively assigning the values and instead tries to assign a constant value to each column. I feel like I'm close to doing this correctly, but I can't quite figure out the assignment.
EDIT:
The data are all strings, and the function performs a fetching of some different information from another source based on those strings like so:
def myFunction(x, y):
# read file based on value of x
# search file for values a and b based on value of y
return(a, b)
I know this is a little vague, but the helper function is fairly complicated to explain.
The error received is:
ValueError: too many values to unpack (expected 4)
You can use zip()
df['NewCol1'], df['NewCol2'] = zip(*[myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])])

gspread - Getting values as string from numeric like column

I am trying to read a google sheet using python using the gspread library.
The initial authentication settings is done and I am able to read the respective sheet.
However when I do
sheet.get_all_records()
The column containing numeric like values (eg. 0001,0002,1000) are converted as numeric field. That is the leading zeroes are truncated. How to prevent this from happening?
You can prevent gspread from casting values to int passing the numericise_ignore parameter to the get_all_records() method.
You can disable it for a specific list of indices in the row:
# Disable casting for columns 1, 2 and 4 (1 indexed):
sheet.get_all_records(numericise_ignore=[1, 2, 4])
Or, disable it for the whole row values with numericise_ignore set to 'all' :
sheet.get_all_records(numericise_ignore=['all'])
How about this answer? In this answer, as one of several workarounds, get_all_values() is used instead of get_all_records(). After the values are retrieved, the array is converted to the list. Please think of this as just one of several answers.
Sample script:
values = worksheet.get_all_values()
head = values.pop(0)
result = [{head[i]: col for i, col in enumerate(row)} for row in values]
Reference:
get_all_values()
If this was not the direction you want, I apologize.

Python for loops with if statement: TypeError: string indices must be integers

I am new to Python and I have been working on this for a few days. I have 9 excel files I am looping over and creating Panda DataFrames.
Each of the 9 xls_file DataFrames have these columns:
code, date_pub (which is uniform within each xls_file but unique between each xls_file), years_q ...
xls_file
My other DataFrame is called estimate. This is a single Dataframe with these columns:
rel, est, date_pub (which is unique)
estimate
My goal is to create 4 panda DataFrames named (AR_41, AR_42, DG_44, DG_46)
The DataFrames would look very similar to 'estimate' but would have an additional column named value equal to the last column of a given xls_file with the same date_pub
x=0
for x in range(1, 9):
xls_file = pd.ExcelFile('Data' + str(x) + '.xls')
final_data = estimate
for row in estimate:
for innerrow in xls_file:
if innerrow['date_pub'] == row['date_pub']:
final_data['value'].append(row.ix[:,-1])
I am getting this error: "string indices must be integers" when checking:
if innerrow['date_pub'] == row['date_pub']:
The error may mean either innerrow or row is actually pointing to a string, when it looks like you need it to be an object. Double-check what your variables are pointing to, you could be over-narrowing the dimension and acessing a string when you want a row or column.
When you call 'string'[i] in Python, it gets the i-th character in 'string', where i is necessarily an integer.
For example s = 'certain_word' then print s[2] will print 'r'.
This may be what your code is trying to do and I am guessing 'date_pub' is a different variable type.

Saving/loading a table (with different column lengths) using numpy

A bit of context: I am writting a code to save the data I plot to a text file. This data should be stored in such a way it can be loaded back using a script so it can be displayed again (but this time without performing any calculation). The initial idea was to store the data in columns with a format x1,y1,x2,y2,x3,y3...
I am using a code which would be simplified to something like this (incidentally, I am not sure if using a list to group my arrays is the most efficient approach):
import numpy as np
MatrixResults = []
x1 = np.array([1,2,3,4,5,6])
y1 = np.array([7,8,9,10,11,12])
x2 = np.array([0,1,2,3])
y2 = np.array([0,1,4,9])
MatrixResults.append(x1)
MatrixResults.append(y1)
MatrixResults.append(x2)
MatrixResults.append(y2)
MatrixResults = np.array(MatrixResults)
TextFile = open('/Users/UserName/Desktop/Datalog.txt',"w")
np.savetxt(TextFile, np.transpose(MatrixResults))
TextFile.close()
However, this code gives and error when any of the data sets have different lengths. Reading similar questions:
Can numpy.savetxt be used on N-dimensional ndarrays with N>2?
Table, with the different length of columns
However, this requires to break the format (either with flattening or adding some filling strings to the shorter columns to fill the shorter arrays)
My issue summarises as:
1) Is there any method that at the same time we transpose the arrays these are saved individually as consecutive columns?
2) Or maybe is there anyway to append columns to a text file (given a certain number of rows and columns to skip)
3) Should I try this with another library such as pandas?
Thank you very for any advice.
Edit 1:
After looking a bit more it seems that leaving blank spaces is more innefficient than filling the lists.
In the end I wrote my own (not sure if there is numpy function for this) in which I match the arrays length with "nan" values.
To get the data back I use the genfromtxt method and then I use this line:
x = x[~isnan(x)]
To remove the these cells from the arrays
If I find a better solution I will post it :)
To save your array you can use np.savez and read them back with np.load:
# Write to file
np.savez(filename, matrixResults)
# Read back
matrixResults = np.load(filename + '.npz').items[0][1]
As a side note you should follow naming conventions i.e. only class names start with upper case letters.

genfromtxt - Force column name generation for unknown number of columns

I have trouble getting numpy to load tabular data and automatically generate column names. It seems pretty simple but I cannot nail it.
If i knew the number of columns I could easily create names parameter, but I don't have this knowledge, and I would like to avoid prior introspection of the data file.
How can I force numpy to generate the column names, or use tuple-like dtype automatically, when I have no knowledge how many columns there are in file? I want to manipulate the column names after reading the data.
My approaches so far:
data = np.genfromtxt(tar_member, unpack = True, names = '') - I wanted to force automatic generation of column names by giving some "empty" parameter. Results with error ValueError: size of tuple must match number of fields.
data = np.genfromtxt(tar_member, unpack = True, names = True) - "Works" but consumes 1st row of data.
data = np.genfromtxt(tar_member, unpack = True, dtype = None) - Worked for data with mixed types. Automatic type guessing expanded dtype into a tuple, and assigned the names. However, for data where everything was actually float, dtype was set to float64, and I got ValueError: there are no fields defined when I tried accessing data.dtype.names.
I know there is a cleaner way to do this, but if you don't mind forcing the issue you can generate your dtype structure and assign it directly to the data array.
x = numpy.random.rand(10,10)
numpy.savetxt('test.out', x, delimiter=',')
dataa = numpy.genfromtxt('test.out',delimiter=",", dtype=None)
if dataa.dtype.names is None:#then dataa is homogenous?
l1 = map(lambda z:('f%d'%(z),dataa.dtype),range(0,dataa.shape[1]))
dataa.dtype = dtype(l1)
dataa.dtype
dataa.dtype.names

Categories