pandas dataframe to features and labels

pandas dataframe to features and labels - python

Here is a dataframe which I want to convert to features and label list/arrays.
The dataframe represents Fedex Ground Shipping rates for weight and zone Ids (columns of the dataframe).
The features need to be like below
[weight,zone]
e.g. [[1,2],[1,3] ...[1,25],[2,2],[2,3] ...[2,25]....[8,25]]
And the labels corresponding to them are basically the shipping charges so,
[[shipping charge]]
e.g. [[8.95],[9.44] .....[35.18]]
While I am using following code, but I am sure there has to be a faster, more optimized and perhaps more direct way to achieve this, either using dataframe or numpy
i=0
j=0
for weight in df_ground.Weight:
for column in column_list[1:]: # skipping the weight column !
features[j] = [df_ground.Weight[i],column]
labels[j] = df_ground[column][df_ground['Weight'] == df_ground.Weight[i]]
j +=1
i +=1
For a dataframe of size 2700 this code takes between 1 and 2 seconds. I am asking for suggestions on a more optimized way.

First, make 'Weight' index and mix the index and the columns:
mixed = df_ground.set_index('Weight').stack()
#Weight
#1 2 8.95
# 3 9.44
# 4 9.89
#....
#2 2 9.24
# 3 9.92
# 4 10.41
Now, your new index is your features and the data column is your labels:
features = [list(x) for x in mixed.index]
#[[1, 2], [1, 3], [1, 4], ..., [2, 2], [2, 3], [2, 4], ...]
labels = [[x] for x in mixed.values]
#[[8.95],[9.44],[9.89],[9.24],[9.92],[10.41]])

Related

Counting the number of pandas.DataFrame rows for each column

What I want to do
I would like to count the number of rows with conditions. Each column should have different numbers.
import numpy as np
import pandas as pd
## Sample DataFrame
data = [[1, 2], [0, 3], [np.nan, np.nan], [1, -1]]
index = ['i1', 'i2', 'i3', 'i4']
columns = ['c1', 'c2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Output
# c1 c2
# i1 1.0 2.0
# i2 0.0 3.0
# i3 NaN NaN
# i4 1.0 -1.0
## Question 1: Count non-NaN values
## Expected result
# [3, 3]
## Question 2: Count non-zero numerical values
## Expected result
# [2, 3]
Note: Data types of results are not important. They can be list, pandas.Series, pandas.DataFrame etc. (I can convert data types anyway.)
What I have checked
## For Question 1
print(df[df['c1'].apply(lambda x: not pd.isna(x))].count())
## For Question 2
print(df[df['c1'] != 0].count())
Obviously these two print functions are only for column c1. It's easy to check one column by one column. I would like to know if there is a way to calculate counts of all columns at once.
Environment
Python 3.10.5
pandas 1.4.3

You do not iterate over your data using apply. You can achieve your results in a vectorized fashion:
print(df.notna().sum().to_list()) # [3, 3]
print((df.ne(0) & df.notna()).sum().to_list()) # [2, 3]
Note that I have assumed that "Question 2: Count non-zero values" also excluded nan values, otherwise you would get [3, 4].

You was close I think ! To answer your first question :
>>> df.apply(lambda x : x.isna().sum(), axis = 0)
c1 1
c2 1
dtype: int64
You change to axis = 1 to apply this operation on each row.
To answer your second question this is from here (already answered question on SO) :
>>> df.astype(bool).sum(axis=0)
c1 3
c2 4
dtype: int64
In the same way you can change axis to 1 if you want ...
Hope it helps !

Assign values of label into corresponding column in data frame

I have a pandas data series with act like a reference of certain values for specific labels. I would like to popualte the values for the corresponding "label/index" into another data frame. As an example
import pandas as pd
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, "goat"], [4, 5, "monkey"] ,[7,8,"goat"]],columns=["I", "L","data"])
B = pd.DataFrame(index=["goat", "monkey", "sheep"], data=[[10], [ 40] ,[ 70]])
here B acts like a reference for the labels indicated as animals. I would like to add a column to data frame A and fill in the corresponding value for the animal in the data column, i.e. the final result should look like:
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, "goat",10], [4, 5, "monkey",40] ,[7,8,"goat",10]],columns=["I", "L","data","data value"])
I could loop over the unique values of B and just filter for the corresponding rows and add the value. But I feel there is a better way to do it in python

Let's try with join on data and rename the columns in B so they appear as expected in A:
A = A.join(B.rename(columns={0: 'data value'}), on='data')
A:
I L data data value
0 1 2 goat 10
1 4 5 monkey 40
2 7 8 goat 10

Elegant way to assign header names to pandas dataframe when assigning new columns in for loop?

I have a for loop that iteratively adds columns to a pandas dataframe. I wish to also name these new columns based on a list. I have a convoluted way now, is there a more elegant way to do this?
When assigning a new column, you have to specify a column name. However this cannot be variable for some reason. So I use a dummy and after change the column name based on a list I defined prior. This doesn't seem too elegant though.
The dataframe columns should be [wavelength, layers[0]_n, layers[0]_k, ... layers[z]_n, layers[z]_k]
layers = ['Ag', 'SiO2', 'Au']
colnames = ['wavelength']
for l in layers:
colnames.append(l+'_n')
colnames.append(l+'_k')
n = pd.read_csv('matdata\\' + layers[0] + '.csv')
n = n.iloc[:,0] #get only wavelength
for l in layers:
data = pd.read_csv('matdata\\' + l + '.csv') #read appropriate file
n = n.assign(a = data.iloc[:,1].values)
n = n.assign(b = data.iloc[:,2].values)
n.columns = colnames

Because I don't have access to your CSVs, etc, I am creating some fake data to simulate this process...
Let's start with several DataFrames:
n = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
columns=['x', 'y', 'z'])
dfb = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
layers = ['Ag', 'SiO2']
for layer in layers:
n[layer] = dfb.iloc[:, 1].values
Yields:
x y z Ag SiO2
0 1 2 3 2 2
1 4 5 6 5 5
2 7 8 9 8 8
Using this technique, rather than using .assign() allows for the use of a variable name to create the column header as each column is created.

matrices are not aligned error message

I have the following dataframe of returns
ret
Out[3]:
Symbol FX OGDC PIB WTI
Date
2010-03-02 0.000443 0.006928 0.000000 0.012375
2010-03-03 -0.000690 -0.007873 0.000171 0.014824
2010-03-04 -0.001354 0.001545 0.000007 -0.008195
2010-03-05 -0.001578 0.008796 -0.000164 0.015955
And the following weights for each symbol:
df3
Out[4]:
Symbol Weight
0 OGDC 0.182022
1 WTI 0.534814
2 FX 0.131243
3 PIB 0.151921
I am trying to get a weighted return for each day and tried:
port_ret = ret.dot(df3)
but I get the following error message:
ValueError: matrices are not aligned
My objective is to have a weighted return for each date such that, for example 2010-03-02 would be as follows:
weighted_ret = 0.000443*.131243+.006928*.182022+0.000*0.151921+0.012375*.534814 = 0.007937512
I am not sure why I am getting this error but would be very happy for an alternative solution to the weighted return

You have two columns in your weight matrix:
df3.shape
Out[38]: (4, 2)
Set the index to Symbol on that matrix to get the proper dot:
ret.dot(df3.set_index('Symbol'))
Out[39]:
Weight
Date
2010-03-02 0.007938
2010-03-03 0.006430
2010-03-04 -0.004278
2010-03-05 0.009902

For a dot product of dataframe dfA by dataframe dfB, the column names of dfA must coincide with the index of dfB, otherwise you'll get the error 'ValueError: matrices are not aligned'
dfA = pd.DataFrame( data = [[1, 2], [3, 4], [5, 6]], columns=['one', 'two'])
dfB = pd.DataFrame( data = [[1, 2, 3], [4, 5, 6]], index=['one', 'two'])
dfA.dot(dfB)

Check the shape of the matrices you're calling the dot product on. The dot product of matrices A.dot(B) can be computed only if second axis of A is the same size as first axis of B.
In your example you have additional column with date, that ruins your computation. You should just get rid of it in your computation. Try running port_ret = ret[:,1:].dot(df3[1:]) and check if it produces the result you desire.
For future cases, use numpy.shape() function to debug matrix calculations, it is really helpful tool.

Make Tuples from Specific Pandas Columns

I have a pandas dataframe, e.g.
one two three four five
0 1 2 3 4 5
1 1 1 1 1 1
What I would like is to be able to convert only a select number of columns to a list, such that we obtain:
[[1,2],[1,1]]
This is the rows 0,1, where we are selecting columns one and two.
Similarly if we selected columns one, two, four:
[[1,2,4],[1,1,1]]
Ideally I would like to avoid iteration of rows as it is slow!

You can select just those columns with:
In [11]: df[['one', 'two']]
Out[11]:
one two
0 1 2
1 1 1
and get the list of lists from the underlying numpy array using tolist:
In [12]: df[['one', 'two']].values.tolist()
Out[12]: [[1, 2], [1, 1]]
In [13]: df[['one', 'two', 'four']].values.tolist()
Out[13]: [[1, 2, 4], [1, 1, 1]]
Note: this should never really be necessary unless this is your end game... it's going to be much more efficient to do the work inside pandas or numpy.

So I worked out how to do it.
Firstly we select the columns we would like the values from:
y = x[['one','two']]
This gives us a subset df.
Now we can choose the values:
> y.values
array([[1, 2],
[1, 1]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe to features and labels - python

Related

Counting the number of pandas.DataFrame rows for each column

Assign values of label into corresponding column in data frame

Elegant way to assign header names to pandas dataframe when assigning new columns in for loop?

matrices are not aligned error message

Make Tuples from Specific Pandas Columns

Categories

Resources