Data frame mode function

Data frame mode function - python

HI I want to ask I am using df.mode() function to find the most common in one row. This will give me an extra column how could I have only one column? I am using df.mode(axis=1)
for example I have a data frame
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
so I want the output
1 1
2 0
3 0
but I am getting
1 1 NaN
2 0 NaN
3 0 NaN
Does anyone know why?

The code you tried gives the expected output in Python 3.7.6 with Pandas 1.0.3.
import pandas as pd
df = pd.DataFrame(
data=[[1, 0, 1, 1, 1], [0, 1, 0, 0, 1], [0, 0, 1, 1, 0]],
index=[1, 2, 3])
df
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
df.mode(axis=1)
0
1 1
2 0
3 0

There could be different data types in your columns and mode cannot be used to compare column of different data type.
Use str() or int() to convert your df.series to a suitable data type. Make sure that the data type is consistent in the df before employing mode(axis=1)

Related

How to pivot dataframe into ML format

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!

To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.

Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]

You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :

Dataframe column: to find local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio").
I would like to find the local maxima of every non-zero vector contained in column "CumRetperTrade"
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which provides maxima for every vector contained in column "CumRetperTrade" its local max value. The numeric example is below. Thanks in advance.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
1 3 3
1 2 3
1 1 3
0 0 0
0 0 0
0 0 0
1 4 4
1 2 4
1 1 4

You can use :
df1['PeakCumRet'] = (df1.groupby(df1['Portfolio'].ne(df1['Portfolio'].shift()).cumsum())
['CumRetperTrade'].transform('max')
)
Output:
Portfolio CumRetperTrade PeakCumRet
0 1 3 3
1 1 2 3
2 1 1 3
3 0 0 0
4 0 0 0
5 0 0 0
6 1 4 4
7 1 2 4
8 1 1 4

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0

If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0

This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

creating index columns with python

As a minimal working example, I have a file.txt containing a list of numbers:
1.1
2.1
3.1
4.1
5.1
6.1
7.1
8.1
which actually should be presented with indices that makes it a 3D array
0 0 1.1
1 0 2.1
0 1 3.1
1 1 4.1
0 2 5.1
1 2 6.1
0 3 7.1
1 3 8.1
I want to import the 3D array into python and have been using bash to generate the indices and then pasting the index to file.txt before importing the resulting full.txt in python using pandas:
for ((y=0;y<=3;y++)); do
for ((x=0;x<=1;x++)); do
echo -e "$x\t$y"
done
done
done > index.txt
paste index.txt file.txt> full.txt
The writing of index.txt has been slow in my actual code, which has x up to 9000 and y up to 5000. Is there a way to generate the indices into the first 2 columns of a 2D python numpy array so I only need to import the data from file.txt as as the third column?

I would recommend using pandas for loading the data and managing columns with different types.
We can generate the indices with np.indices with the desired dimensions and reshape to match your format.
Then concatenate 'file.txt'.
Creating the index for (9000,5000) takes about 950ms on a colab instance.
import numpy as np
import pandas as pd
x,y = 2,4 # dimensions, also works with 9000,5000 but assumes 'file.txt' has the correct size
pd.concat([
pd.DataFrame(np.indices((x,y)).ravel('F').reshape(-1,2), columns=['ind1','ind2']),
pd.read_csv('file.txt', header=None, names=['Value'])
], axis=1)
Out:
ind1 ind2 Value
0 0 0 1.1
1 1 0 2.1
2 0 1 3.1
3 1 1 4.1
4 0 2 5.1
5 1 2 6.1
6 0 3 7.1
7 1 3 8.1
How this works
First create the indices for your desired dimensions with np.indices
np.indices((2,4))
Out:
array([[[0, 0, 0, 0],
[1, 1, 1, 1]],
[[0, 1, 2, 3],
[0, 1, 2, 3]]])
Which gives us the right indices but in the wrong order.
With np.ravel('F') we can specify to flatten the array in columns first order
np.indices((2,4)).ravel('F')
Out:
array([0, 0, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 0, 3, 1, 3])
To get the desired columns reshape into a 2D array with shape (8,2). With (-1,2) the first dimension is inferred.
np.indices((2,4)).ravel('F').reshape(-1,2)
Out:
array([[0, 0],
[1, 0],
[0, 1],
[1, 1],
[0, 2],
[1, 2],
[0, 3],
[1, 3]])
Then convert into a dataframe with columns ind1 and ind2.
Working with more dimensions
pd.DataFrame(np.indices((2,4,3)).ravel('F').reshape(-1,3)).add_prefix('ind')
Out:
ind0 ind1 ind2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0
6 0 3 0
7 1 3 0
8 0 0 1
9 1 0 1
10 0 1 1
11 1 1 1
12 0 2 1
13 1 2 1
14 0 3 1
15 1 3 1
16 0 0 2
17 1 0 2
18 0 1 2
19 1 1 2
20 0 2 2
21 1 2 2
22 0 3 2
23 1 3 2

Here is a quick example how to create the 3D array from a 1D array. As a dummy i have random numbers. Then it creates tuples of x,y,value.
It takes about a minute for 45M rows
from random import randrange
x = 5000
y = 9000
numbers = [randrange(100000,999999) for i in range(x*y)]
array = [(a,b, numbers[b*(x-1)+a]) for a in range(x) for b in range(y)]
Output
pd.DataFrame(array)
Out[23]:
0 1 2
0 0 0 878704
1 0 1 524573
2 0 2 943657
3 0 3 496507
4 0 4 802714```

If you want to stick to your bash then you can avoid two loops:
Code:
for ((y=0;y<=3;y++)); do
echo -e "0\t$y\n1\t$y"
done
Output:
0 0
1 0
0 1
1 1
0 2
1 2
0 3
1 3
above in python is:
Code:
for y in range(4):
print(f'0\t{y}\n1\t{y}')
Output:
0 0
1 0
0 1
1 1
0 2
1 2
0 3
1 3

How do I one-hot encode one column of a pandas dataframe?

I'm trying to one-hot encode one column of a dataframe.
enc = OneHotEncoder()
minitable = enc.fit_transform(df["ids"])
But I'm getting
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and willraise ValueError in 0.19.
Is there a workaround for this?

I think you can use get_dummies:
df = pd.DataFrame({'ids':['a','b','c']})
print (df)
ids
0 a
1 b
2 c
print (df.ids.str.get_dummies())
a b c
0 1 0 0
1 0 1 0
2 0 0 1
EDIT:
If input is column with lists, first cast to str, remove [] by strip and call get_dummies:
df = pd.DataFrame({'ids':[[0,4,5],[4,7,8],[5,1,2]]})
print(df)
ids
0 [0, 4, 5]
1 [4, 7, 8]
2 [5, 1, 2]
print (df.ids.astype(str).str.strip('[]').str.get_dummies(', '))
0 1 2 4 5 7 8
0 1 0 0 1 1 0 0
1 0 0 0 1 0 1 1
2 0 1 1 0 1 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data frame mode function - python

The code you tried gives the expected output in Python 3.7.6 with Pandas 1.0.3. import pandas as pd df = pd.DataFrame( data=[[1, 0, 1, 1, 1], [0, 1, 0, 0, 1], [0, 0, 1, 1, 0]], index=[1, 2, 3]) df 0 1 2 3 4 1 1 0 1 1 1 2 0 1 0 0 1 3 0 0 1 1 0 df.mode(axis=1) 0 1 1 2 0 3 0

There could be different data types in your columns and mode cannot be used to compare column of different data type. Use str() or int() to convert your df.series to a suitable data type. Make sure that the data type is consistent in the df before employing mode(axis=1)

Related

How to pivot dataframe into ML format

Dataframe column: to find local maxima

Find the rows that share the value

creating index columns with python

How do I one-hot encode one column of a pandas dataframe?

Categories

Resources