Related
My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!
To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.
Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]
You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :
I have two inputs in a dataframe, and I need to create an output that depends on both inputs (same row, different columns), but also on its previous value (same column, previous row).
This dataframe command will create an example of what I need:
df=pd.DataFrame([[0,0,0], [0,1,0], [0,0,0], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0], [0,1,0], [1,1,1], [1,1,1], [0,1,1], [0,1,1], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0]], columns=['input_1', 'input_2', 'output'])
The rules are simple:
If input_1 is 1, output is 1 (input_1 is a trigger function)
output will remain as 1 as long as input_2 is also 1. (input_2 works kind of like a memory function)
For all the others, output will be 0
The rows go in sequence as they happen in time, I mean, row 0 output influences row 1 output, row 1 output influences row 2 output, and so on. So output depends on input_1, input_2, but also on its own previous value.
I could code it looping through the dataframe, computing and assigning values using iloc, but it is painfully slow. I need to run this through many thousands of rows for tens of thousands of dataframes, so I am looking for the most efficient way to do it (preferably vectorization). It can be with numpy or other library/method that you know.
I searched and found some questions about vectorization and row-looping, but I still don't see how to use those techniques. Example questions: How to iterate over rows in a DataFrame in Pandas?. Also this one, What is the most efficient way to loop through dataframes with pandas?
I appreciate your help
If I understand you right, you want to know how to compute column output. You can do for example:
df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)
Prints:
input_1 input_2 output output_2
0 0 0 0 0
1 0 1 0 0
2 0 0 0 0
3 1 1 1 1
4 0 1 1 1
5 0 1 1 1
6 0 0 0 0
7 0 1 0 0
8 0 1 0 0
9 1 1 1 1
10 1 1 1 1
11 0 1 1 1
12 0 1 1 1
13 1 1 1 1
14 0 1 1 1
15 0 1 1 1
16 0 0 0 0
17 0 1 0 0
As you explained in the discussion above we have just two inputs loaded using pandas dataframe:
df=pd.DataFrame([[0,0], [0,1], [0,0], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])
We have to create outputs using following rules:
#1 if input_1 is one the output is one
#2 if both inputs is zero the output is zero
#3 if input_1 is zero and input_2 is one the output holds the previous value
#4 the initial output value is zero
to generate outputs we can
duplicate input_1 to the output
update output with previous value if input_1 is zero and input_2 is one
because of the rules above we don't need to update the first output
df['output'] = df.input_1
for idx, row in df.iterrows():
if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
df.output[idx] = df.output[idx-1]
print(df)
The output is:
>>> print(df)
input_1 input_2 output
0 0 0 0
1 0 1 0
2 0 0 0
3 1 1 1
4 0 1 1
5 0 1 1
6 0 0 0
7 0 1 0
8 0 1 0
9 1 1 1
10 1 1 1
11 0 1 1
12 0 1 1
13 1 1 1
14 0 1 1
15 0 1 1
16 0 0 0
17 0 1 0
UPDATE1
The more fast way to do it is modification of formula proposed by #Andrej
df['output_2'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)
Without modification his formula creates wrong output for input combination [1, 0]. It holds the previous output instead of setting it to 1.
UPDATE2
This just to compare results
df=pd.DataFrame([[0,0], [1,0], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])
df['output'] = df.input_1
for idx, row in df.iterrows():
if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
df.output[idx] = df.output[idx-1]
df['output_1'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)
df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)
The results is:
>>> print(df)
input_1 input_2 output output_1 output_2
0 0 0 0 0 0
1 1 0 1 1 0
2 0 1 1 1 0
3 1 1 1 1 1
4 0 1 1 1 1
5 0 1 1 1 1
6 0 0 0 0 0
7 0 1 0 0 0
8 0 1 0 0 0
9 1 1 1 1 1
10 1 1 1 1 1
11 0 1 1 1 1
12 0 1 1 1 1
13 1 1 1 1 1
14 0 1 1 1 1
15 0 1 1 1 1
16 0 0 0 0 0
17 0 1 0 0 0
I have two dataframes, x.head() looks like this:
top mid adc support jungle
Irelia Ahri Jinx Janna RekSai
Gnar Ahri Caitlyn Leona Rengar
Renekton Fizz Sivir Annie Rengar
Irelia Leblanc Sivir Thresh JarvanIV
Gnar Lissandra Tristana Janna JarvanIV
and dataframe fullmatrix.head() that I have created looks like this:
Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra ... XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ...
Now what I cannot figure out is how to assign a value of 1 for each name in the x dataframe to the respective column that has the same name in the fullmatrix dataframe row by row (both dataframes have the same number of rows).
I'm sure this can be improved but one advantage is that it only requires the first DataFrame, and it's conceptually nice to chain operations until you get the desired solution.
fullmatrix = (x.stack()
.reset_index(name='names')
.pivot(index='level_0', columns='names', values='names')
.applymap(lambda x: int(x!=None))
.reset_index(drop=True))
note that only the names that appear in your x DataFrame will appear as columns in fullmatrix. if you want the additional columns you can simply perform a join.
Consider adding a key = 1 column and then iterating through each column for a list of pivoted dfs which you then horizontally merge with pd.concat. Finally run a DataFrame.update() to update original fullmatrix with values from pvt_df, aligned to indices.
x['key'] = 1
dfs = []
for col in x.columns[:-1]:
dfs.append(x.pivot_table(index=df.index, columns=[col], values='key').fillna(0))
pvt_df = pd.concat(dfs, axis=1).astype(int)
fullmatrix.update(pvt_df)
fullmatrix = fullmatrix.astype(int)
fullmatrix # ONLY FOR VISIBLE COLUMNS IN ORIGINAL POST
# Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
# 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The OP tries to create a table of dummy variables with a set of data points. For each data point, it contains 5 attributes. There are in total N unique attributes.
We will use a simplied dataset to demonstrate how to do it:
5 unique attributes
3 data entries
each data entry contains 3 attributes.
x = pd.DataFrame([['a', 'b', 'c'],
['b', 'd', 'e'],
['e', 'b', 'a']])
fullmatrix = pd.DataFrame([[0 for _ in range(5)] for _ in range(3)],
columns=['a','b','c','d','e'])
""" fullmatrix:
a b c d e
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
"""
# each row in x_temp is a string of attributed delimited by ","
x_row_joined = pd.Series((",".join(row[1]) for row in x.iterrows()))
fullmatrix = x_row_joined.str.get_dummies(sep=',')
The method is inspired by offbyone's answer It uses pandas.Series.str.get_dummies. We first joins each row of x with a specified delimiter. Then make use of the Series.str.get_dummies method. The method takes a delimiter that we just use to join attributes and will generate the dummy-varaible table for you. (Caution: don't pick sep that exists in x.)
Currently have a CSV file that outputs a dateframe as follows:
[in]
df = pd.read_csv(file_name)
df.sort('TOTAL_MONTHS', inplace=True)
print df[['TOTAL_MONTHS','COUNTEM']]
[out]
TOTAL_MONTHS COUNTEM
12 0
12 0
12 2
25 10
25 0
37 1
68 3
I want to get the total number of rows (by TOTAL_MONTHS) for which the 'COUNTEM' value falls within a preset bin.
The data is going to be entered into a histogram via excel/powerpoint with:
X-axis = Number of contracts
Y-axis = Total_months
Color of bar = COUNTEM
The input of the graph is like this (columns being COUNTEM bins):
MONTHS 0 1-3 4-6 7-10 10+ 20+
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
...
12 2 1 0 0 0 0
...
25 1 0 0 0 1 0
...
37 0 1 0 0 0 0
...
68 0 1 0 0 0 0
Ideally I'd like the code to output a dataframe in that format.
Interesting problem. Knowing pandas (as I don't properly) there may well be a much fancier and simpler solution to this. However, doing it through iterations is also possible in the following manner:
#First, imports and create your data
import pandas as pd
DF = pd.DataFrame({'TOTAL_MONTHS' : [12, 12, 12, 25, 25, 37, 68],
'COUNTEM' : [0, 0, 2, 10, 0, 1, 3]
})
#Next create a data frame of 'bins' with the months as index and all
#values set at a default of zero
New_DF = pd.DataFrame({'bin0' : 0,
'bin1' : 0,
'bin2' : 0,
'bin3' : 0,
'bin4' : 0,
'bin5' : 0},
index = DF.TOTAL_MONTHS.unique())
In [59]: New_DF
Out[59]:
bin0 bin1 bin2 bin3 bin4 bin5
12 0 0 0 0 0 0
25 0 0 0 0 0 0
37 0 0 0 0 0 0
68 0 0 0 0 0 0
#Create a list of bins (rather than 20 to infinity I limited it to 100)
bins = [[0], range(1, 4), range(4, 7), range(7, 10), range(10, 20), range(20, 100)]
#Now iterate over the months of the New_DF index and slice the original
#DF where TOTAL_MONTHS equals the month of the current iteration. Then
#get a value count from the original data frame and use integer indexing
#to place the value count in the appropriate column of the New_DF:
for month in New_DF.index:
monthly = DF[DF['TOTAL_MONTHS'] == month]
counts = monthly['COUNTEM'].value_counts()
for count in counts.keys():
for x in xrange(len(bins)):
if count in bins[x]:
New_DF.ix[month, x] = counts[count]
Which gives me:
In [62]: New_DF
Out[62]:
bin0 bin1 bin2 bin3 bin4 bin5
12 2 1 0 0 0 0
25 1 0 0 0 1 0
37 0 1 0 0 0 0
68 0 1 0 0 0 0
Which appears to be what you want. You can rename the index as you see fit....
Hope this helps. Perhaps someone has a solution that uses a built in pandas function, but for now this seems to work.
I am writing a script to calculate the volume of any random shaped 3D object. I don't care if the object is hollow or not I need to calculate its total volume.
The data model I have is a 3D table (histogram of pixels) with ones and zeros. ones are evidently where the object is and zero where we have nothing. to calculate the volume of a well filled object it's as easy as summing all the pixels that contains one and multiply by the pixel volume.
On the other hand, the main difficulty remains where we have a hollow object, so we have zeros surrounded by ones. Therefore applying the straightforward method I described herein is not valid anymore. What we need to do is fill all the object area with ones. here is a 2D example so you can understand What i mean
a 2D table :
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 1 0 0 0 1 1 1 0 0
0 0 0 1 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
I need to transform it to this
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 1 1 1 1 1 1 1 0 0
0 0 0 1 1 1 1 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
If you use scipy you can do this in one line with binary_fill_holes. And this works in n-dimensions. With your example:
import numpy as np
from scipy import ndimage
shape=np.array([
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,1,1,1,1,1,1,0,0,0],
[0,0,1,1,0,0,0,1,1,1,0,0],
[0,0,0,1,0,0,1,0,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,1,1,1,1,1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0]
])
shape[ndimage.binary_fill_holes(shape)] = 1
#Output:
[[0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1 1 0 0 0]
[0 0 1 1 1 1 1 1 1 1 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0]]
A standard flood fill should be extensible to three dimensions. From Wikipedia, the 2-d version in outline:
1. If the color of node is not equal to target-color, return.
2. Set the color of node to replacement-color.
3. Perform Flood-fill (one step to the west of node, target-color, replacement-color).
Perform Flood-fill (one step to the east of node, target-color, replacement-color).
Perform Flood-fill (one step to the north of node, target-color, replacement-color).
Perform Flood-fill (one step to the south of node, target-color, replacement-color).
4. Return.
Notice that in step 3. you are keeping track of all the adjacent cells. If you change this to find all adjacent cells in 3-d and run as before it should work nicely.
Not intuitive and hard to read, but compact:
matrix = [[0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0],
[0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0]]
ranges = [1 in m and range(m.index(1), len(m)-list(reversed(m)).index(1)) or None for m in matrix]
result = [[ranges[j] is not None and i in ranges[j] and 1 or 0 for i,a in enumerate(m)] for j,m in enumerate(matrix)]
result
[[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0]]
matrix=[
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,1,1,1,1,1,1,0,0,0],
[0,0,1,1,0,0,0,1,1,1,0,0],
[0,0,0,1,0,0,1,0,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,1,1,1,1,1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0]
]
def fill (x,y):
global matrix
if ( x==len (matrix)
or y==len (matrix[0])
or x==-1
or y==-1
or matrix[x][y]==1 ):
return
else:
matrix[x][y]=1
fill (x+1,y)
fill (x-1,y)
fill (x,y+1)
fill (x,y-1)
fill (4,4)
for i in matrix:
print i
Assuming you're talking about something like filling a voxel shape, why can't you just do something like this (take it as pseudocode example for the simplified 2D case, as I don't know what data structure you're using - maybe a numpy.array? - so I'm just taking a hypothetical "list of lists as a matrix" and I don't take in consideration the problem of modifying an iterable while traversing it etc.):
for i, row in enumerate(matrix):
last_filled_voxel_j = false
for j, voxel in enumerate(row):
if voxel:
if last_filled_voxel != false:
fill_matrix(matrix, i, last_filled_voxel_j, j)
last_filled_voxel_j = j
...assuming that fill_matrix(matrix, row, column_start, column_end) just fills the row of voxels between and not including column_start and column_end.
I guess this is probably not the answer you're looking for, but can you expand what thing different than what I pseudocoded before you actually need to do so we can be of more help?