Select columns and lines with condition from file using numpy - python

I'm trying to write a program that opens a text file with only numbers in rows and columns, to save them in a new file. The part where I select columns works, while the part of the rows don't. I must select the lines with the condition x > 10e13 (where x is the value in a specific column).
I have some problems, especially in rows selection.
Since they are very large files, I have been advised to use numpy, so I would like to run the code this way.
This is the code I have written:
import numpy as np
matrix = np.loadtxt('file.dat')
#select columns
column_indicies = [0]
selected_columns = matrix[:,column_indicies]
x=1E14 #select lines
for line in matrix:
if float(line) > x:
#any ideas?
selected_matrix = matrix[selected_lines,selected_columns]
np.savetxt('new_file.dat', selected_matrix, fmt='%1.4f')
This is a small sample of my input data:
185100000000000.0000
121300000000000.0000
257800000000000.0000
43980000000000.0000

There are a lot of ways to do this. Here is what you might want to try which is along the same direction you have already taken:
matrix = <your matrix>
new_matrix = np.array([]).reshape(matrix.shape)
column_index = <the column index you want to compare>
x = 1E14
for line in range(matrix.shape[0]):
if matrix[line][column_index] > x:
new_matrix.append(matrix[line])
Hope that makes sense. We are attaching only those rows to the new matrix that meets your condition. I didn't run the code, so there might be some minor bugs. But, I am hoping you got the idea of how you can accomplish this task.

Related

How to run different functions in different parts of a dataframe in python?

I have a dataframe(df).
I need to find the standard deviation dataframe from this one.For the first row I want to use the traditional variance formula.
sum of the(x - x(mean))/n
and from second row(=i) I want to use the following formula
lamb*(variance of first row) + (1-lamb)* (first row of returns)^2
※by first row, I meant the previous row.
# Generate Sample Dataframe
import numpy as np
import pandas as pd
df=pd.Dataframe({'a':range(1,7),
'b':[x**2 for x in range(1,7)],
'c':[x**3 for x in range(1,7)]})
# Generate return Dataframe
returns=df.pct_change()
# Generate new Zero dataframe
d=pd.DataFrame(0,index=np.arange(len(returns)),columns=returns.columns)
#populate first row
lamb=0.94
d.iloc[0]=list(returns.var())
Now my question is how to populated the second row till the end using the second formula?
It should be something like
d[1:].agg(lambda x: lamb*x.shift(-1)+(1-lamb)*returns[:2]
but it obviously returned a long error.
Could you please help?
for i in range(1,len(d)):
d.iloc[i].apply(lambda x: lamb*d.iloc[i-1] + (1-lamb)*returns.iloc[i-1])
I'm not completely sure if this gives the right answer but it wont throw an error. But try using apply, for loop and .iloc for iterating over rows and this should do the job for you if you use the correct formula.

Calculate Gunning-Fog score on excel values

I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))

Slicing columns in Python

I am new in Python. I want to slice columns from index 1 to end of a marix and perform some operations on the those sliced out columns. Following is the code:
import numpy as np
import pandas as pd
train_df = pd.read_csv('train_475_60_W1.csv',header = None)
train = train_df.as_matrix()
y = train[:,0]
X = train[:,1:-1]
The problem is if I execeute "train.shape", it gives me (89512, 61). But when I execute "X.shape", it give me (89512, 59). I was expecting to get 60 as I want to execute operations on all the colunms except the first one. Can anyone please help me in solving this?
In the line
X = train[:,1:-1]
you cut off the last column. -1 refers to the last column, and Python includes the beginning but not the end of a slice - so lst[2:6] would give you entries 2,3,4, and 5. Correct it to
X = train[:,1:]
BTW, you can make your code format properly by including four spaces before each line (you can just highlight it and hit Ctrl+K).
The thing you should know with slicing for single dimension even in normal lists is that it looks like this:
[start : end]
with start included and end excluded.
you can also use these:
[:x] # from the start to x
[x:] # from x to the end
you can then generalize than to 2D or more, so in your case it would be:
X = train[:,1:] # the first : to get all rows, and 1: to get all columns except the first
you can learn more about these in here if you want, it's a good way to practice

Pandas to_CSV break lines at a given length

I am outputting items from a dataframe to a csv. The rows, however, are too long. I need to have the csv add line breaks (/n) every X items (columns) so that individual rows in the output aren't too long. Is there a way to do this?
A,B,C,D,E,F,G,H,I,J,K
Becomes in the file (X=3) -
A,B,C
D,E,F
G,H,I
J,K
EDIT:
I have a 95% solution (assuming you have only 1 column):
size=50
indexes = np.arange(0,len(data),size) #have to use numpy since range is now an immutable type in python 3
indexes = np.append(indexes,[len(data)]) #add the uneven final index
i=0
while i < len(indexes)-1:
holder = pd.DataFrame(data.iloc[indexes[i]:indexes[i+1]]).T
holder.to_csv(filename, index=False, header=False)
i += 1
The only weirdness is that, despite not throwing any errors, the final loop of the while (with the uneven final index) does not write to the file, even though the information is in holder perfectly. Since no errors are thrown, I cannot figure out why the final information is not being written.
Assuming you have a number of values that is a multiple of 3 (note how I added L):
s = pd.Series(["A","B","C","D","E","F","G","H","I","J","K","L"])
df = pd.DataFrame(s.reshape((-1,3)))
You can they write df to CSV.

Saving/loading a table (with different column lengths) using numpy

A bit of context: I am writting a code to save the data I plot to a text file. This data should be stored in such a way it can be loaded back using a script so it can be displayed again (but this time without performing any calculation). The initial idea was to store the data in columns with a format x1,y1,x2,y2,x3,y3...
I am using a code which would be simplified to something like this (incidentally, I am not sure if using a list to group my arrays is the most efficient approach):
import numpy as np
MatrixResults = []
x1 = np.array([1,2,3,4,5,6])
y1 = np.array([7,8,9,10,11,12])
x2 = np.array([0,1,2,3])
y2 = np.array([0,1,4,9])
MatrixResults.append(x1)
MatrixResults.append(y1)
MatrixResults.append(x2)
MatrixResults.append(y2)
MatrixResults = np.array(MatrixResults)
TextFile = open('/Users/UserName/Desktop/Datalog.txt',"w")
np.savetxt(TextFile, np.transpose(MatrixResults))
TextFile.close()
However, this code gives and error when any of the data sets have different lengths. Reading similar questions:
Can numpy.savetxt be used on N-dimensional ndarrays with N>2?
Table, with the different length of columns
However, this requires to break the format (either with flattening or adding some filling strings to the shorter columns to fill the shorter arrays)
My issue summarises as:
1) Is there any method that at the same time we transpose the arrays these are saved individually as consecutive columns?
2) Or maybe is there anyway to append columns to a text file (given a certain number of rows and columns to skip)
3) Should I try this with another library such as pandas?
Thank you very for any advice.
Edit 1:
After looking a bit more it seems that leaving blank spaces is more innefficient than filling the lists.
In the end I wrote my own (not sure if there is numpy function for this) in which I match the arrays length with "nan" values.
To get the data back I use the genfromtxt method and then I use this line:
x = x[~isnan(x)]
To remove the these cells from the arrays
If I find a better solution I will post it :)
To save your array you can use np.savez and read them back with np.load:
# Write to file
np.savez(filename, matrixResults)
# Read back
matrixResults = np.load(filename + '.npz').items[0][1]
As a side note you should follow naming conventions i.e. only class names start with upper case letters.

Categories