Python: How to parse variables from several pandas dataframes? - python

I want to extract the x and y variables from several pandas dataframes (before proceeding to next steps). I initialize the tab-delimited .txt file, before I extract the information.
Error raised is ValueError: too many values to unpack (expected 2).
import pandas as pd
class DataProcessing:
def __init__(self, data):
self.df = pd.read_csv(data, sep="\t")
X, y = self.df.iloc[1:, 1:]
return X, y
dp_linear_cna = DataProcessing("data_linear_cna.txt")
dp_mrna_seq_v2_rsem = DataProcessing("data_mrna_seq_v2_rsem.txt")
dp_linear_cna.extract_info()
dp_mrna_seq_v2_rsem.extract_info()
Traceback:
ValueError: too many values to unpack (expected 2)

The sep="/t" is supposed to be sep="\t".
Never iterate over rows/columns, select data using index.
e.g. selecting a column: df['some_column_name']

You coding style is quite bad. First of all, don't return anything in init. It's a constructor. Make another function instead.
class DataProcessing:
def __init__(self, data):
self.df = pd.read_csv(data, sep="\t")
def split_data(self):
X = self.df.iloc[:, :-1]
y = self.df.iloc[:, -1]
return X, y
Calling your DataProcessing like this:
def main():
dp = DataProcessing('data_linear_cna.txt')
X, y = dp.split_data()
print(X)
print()
print(y)
Main point here is selection over position via df.iloc[rows, columns]
X, y = self.df.iloc[1:, 1:]
this is not a valid statement. pandas.DataFrame.iloc return another pandas.DataFrame. Not a tuple. You can't do tuple unpacking.
Indexing both axes
You can mix the indexer types for the index and columns. Use : to select the entire axis.

Related

Duplicate columns & possible reduce dimensionality key error 0 Python Error

I have the follow data set:
so as you can see, the shape is: 21 rows x 50 columns
So I would like to apply the follow condition:
If any row from "defaultstore"= 1, then the column "FinalSL" column will receive 4 times the value which "FCST:TOTAL" column contains.
So I create the follow function to do this calculation:
def SLFinal(defaultStore, fcst):
if (defaultStore==1):
return (fcst*4)
else:
return 2
SLFinal(DFstore.iloc[i],FcstList.iloc[i])
The function is working, but I would like to apply in my dataset, so I create the follow loops to run each row and storage the data for the "defaultstore" and "FCST:TOTAL" columns:
Fcst = copiedData.iloc[:,45:46]
FcstList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
FcstList.append(Fcst.loc[i])
DFstoreList`DFstore = copiedData.iloc[:,46:47]
DFstore
DFstoreList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
DFstoreList.append(DFstore.loc[i])
And finally, the new list which will contain the values after the function be applied:
FinalSLlist1 = []
for i in range(0, lenOfRows2 ):
Rows = []
for j in range(45, 50):
Rows.append( SLFinal(DFstore[i],FcstList[i]) )
FinalSLlist1.append(Rows)
But the folloow error is happening:
---------------------------------------------------------------------------
`KeyError Traceback (most recent call last)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality`
KeyError: 0
What should I do ?
You can use boolean indexing and avoid any loops like so:
df.loc[df.defaultstore==1, 'FCST:TOTAL'] *= 4
df.loc[df.defaultstore!=1, 'FCST:TOTAL'] = 2
It might be helpful to look at the pandas documentation on boolean indexing.
import pandas as pd
Just simply use apply() method:
df['FCST:TOTAL']=df.apply(lambda x:x['FCST:TOTAL']*4 if (x['defaultstore']==1) else 2,1)
OR
If you are familiar with numpy then use where() method as it is more efficient then pandas apply() method:
import numpy as np
df['FCST:TOTAL']=np.where(df['defaultstore']==1,df['FCST:TOTAL']*4,2)

Creating dataset from loops

I created small dataset, can find below:
Later formed groups using CIQ column (using pandas group by syntax):
Entire code:
'''
fd = pd.read_csv("C:....\Test.csv")
coder_gr = fd.groupby(["CIQ"])
print(coder_gr.first())
for x, y in coder_gr:
y.Date.duplicated()
'''
Now I checked duplicates inside each group using for loop:
But I want output entire group dataset output plus along with duplicate loop output, for that I tried below code:
emp = []
for x, y in coder_gr:
emp.append(y)
emp.append(y.Date.duplicated())
output look like:
Desired output:
Not getting output in proper format. I don't know how to set proper output.
try this:
pd.option_context('display.max_rows', None, 'display.max_columns', None)
for x, y in coder_gr:
print(y)
print(y.Date.duplicated())
Finally I got the answer:
emp = pd.DataFrame()
for x, y in coder_gr:
emp = emp.append(pd.series(y), ignore_index=True)
emp = emp.append(y)

How can I apply a function to each row in a pandas dataframe?

I am pretty new to coding so this may be simple, but none of the answers I've found so far have provided information in a way I can understand.
I'd like to take a column of data and apply a function (a x e^bx) where a > 0 and b < 0. The (x) in this case would be the float value in each row of my data.
See what I have so far, but I'm not sure where to go from here....
def plot_data():
# read the file
data = pd.read_excel(FILENAME)
# convert to pandas dataframe
df = pd.DataFrame(data, columns=['FP Signal'])
# add a blank column to store the normalized data
headers = ['FP Signal', 'Normalized']
df = df.reindex(columns=headers)
df.plot(subplots=True, layout=(1, 2))
df['Normalized'] = df.apply(normalize(['FP Signal']), axis=1)
print(df['Normalized'])
# show the plot
plt.show()
# normalization formula (exponential) = a x e ^bx where a > 0, b < 0
def normalize(x):
x = A * E ** (B * x)
return x
I can get this image to show, but not the 'normalized' data...
thanks for any help!
Your code is almost correct.
# normalization formula (exponential) = a x e ^bx where a > 0, b < 0
def normalize(x):
x = A * E ** (B * x)
return x
def plot_data():
# read the file
data = pd.read_excel(FILENAME)
# convert to pandas dataframe
df = pd.DataFrame(data, columns=['FP Signal'])
# add a blank column to store the normalized data
headers = ['FP Signal', 'Normalized']
df = df.reindex(columns=headers)
df['Normalized'] = df['FP Signal'].apply(lambda x: normalize(x))
print(df['Normalized'])
df.plot(subplots=True, layout=(1, 2))
# show the plot
plt.show()
I changed apply row to the following: df['FP Signal'].apply(lambda x: normalize(x)).
It takes only the value on df['FP Signal'] because you don't need entire row. lambda x states current values assign to x, which we send to normalize.
You can also write df['FP Signal'].apply(normalize) which is more directly and more simple. Using lambda is just my personal preference, but many may disagree.
One small addition is to put df.plot(subplots=True, layout=(1, 2)) after you change dataframe. If you plot before changing dataframe, you won't see any change in the plot. df.plot actually doing the plot, plt.show just display it. That's why df.plot must be after you done processing your data.
You can use map to apply a function to a field
pandas.Series.map
s = pd.Series(['cat', 'dog', 'rabbit'])
s.map(lambda x: x.upper())
0 CAT
1 DOG
2 RABBIT

vaex: shift column by n steps

I'm preparing a big multivariate time series data set for a supervised learning task and I would like to create time shifted versions of my input features so my model also infers from past values. In pandas there's the shift(n) command that lets you shift a column by n rows. Is there something similar in vaex?
I could not find anything comparable in the vaex documentation.
No, we do not support that yet (https://github.com/vaexio/vaex/issues/660). Because vaex is extensible (see http://docs.vaex.io/en/latest/tutorial.html#Adding-DataFrame-accessors) I thought I would give you the solution in the form of that:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
def shift(self, column, n, inplace=False):
# make a copy without column
df = self.df.copy().drop(column)
# make a copy with just the colum
df_column = self.df[[column]]
# slice off the head and tail
df_head = df_column[-n:]
df_tail = df_column[:-n]
# stitch them together
df_shifted = df_head.concat(df_tail)
# and join (based on row number)
return df.join(df_shifted, inplace=inplace)
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df['shifted_y'] = df.y
df2 = df.mytool.shift('shifted_y', 2)
df2
It generates a single column datagram, slices that up, concatenates and joins it back. All without a single memory copy.
I am assuming here a cyclic shift/rotate.
The function needs to be modified slightly in order to work in the latest release (vaex 4.0.0ax), see this thread.
Code by Maarten should be updated as follows:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
# mytool.shift is the analog of pandas.shift() but add the shifted column with specified name to the end of initial df
def shift(self, column, new_column, n, cyclic=True):
df = self.df.copy().drop(column)
df_column = self.df[[column]]
if cyclic:
df_head = df_column[-n:]
else:
df_head = vaex.from_dict({column: np.ma.filled(np.ma.masked_all(n, dtype=float), 0)})
df_tail = df_column[:-n]
df_shifted = df_head.concat(df_tail)
df_shifted.rename(column, new_column)
return df_shifted
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df2 = df.join(df.mytool.shift('y', 'shifted_y', 2))
df2

A value is trying to be set on a copy of a slice from a DataFrame. using pandas during the initialization

I am trying to initialize the instance and passing data frame, but for some reason I am getting the output
class TestReg:
def __init__(self, x, y, create_intercept=False):
self.x = x
self.y = y
if create_intercept:
self.x['intercept'] = 1
x = data[['class', 'year']]
y = data['performance']
reg = TestReg(x, y, create_intercept=True)
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.x['intercept'] = 1
Any idea what am I doing wrong ?
You are trying to change values into an extract of a dataframe (a slice in pandas wordings).
After cleaning what you try to do is:
x = data[['class', 'year']] # x is a slice here
x['intercept'] = 1 # dangerous because behaviour is undefined => warning
Pandas can use either a copy or a view when you use a slice (here 2 columns from a DataFrame). It does not matter when you only read data, but it does if you try to change it, hence the warning.
You should pass the original dataframe and only make changes through it:
class TestReg:
def __init__(self, data, cols, y, create_intercept=False):
self.data = data
self.y = y
if create_intercept:
self.data['intercept'] = 1
cols.append['intercept']
self.x = data[cols]
...
reg = TestReg(data, ['class', 'year'], y, create_intercept=True)
Alternatively, you could force a copy if you do not want to change the original dataframe:
...
x = data[['class', 'year']].copy()
y = data['performance']
reg = TestReg(x, y, create_intercept=True)

Categories