Currently, I have a dataframe like this:
0 0 0 3 0 0
0 7 8 9 1 0
0 4 5 2 4 0
My code to stack it:
dt = dataset.iloc[:,0:7].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
dt['variable'] = pandas.Categorical(dt.index).codes+1
dt.rename(columns={0:index_column_name}, inplace=True)
dt.set_index(index_column_name, inplace=True)
dt['variable'] = numpy.sort(dt['variable'])
However, it drops the first row when I'm stacking it, and I want to keep the headers / first row, how would I achieve this?
In essence, I'm losing the data from the first row (a.k.a column headers) and I want to keep it.
Desired Output:
value,variable
0 1
0 1
0 1
0 2
7 2
4 2
0 3
8 3
5 3
3 4
9 4
2 4
0 5
1 5
4 5
0 6
0 6
0 6
Current output:
value,variable
0 1
0 1
7 2
4 2
8 3
5 3
9 4
2 4
1 5
4 5
0 6
0 6
Why not use df.melt as #WeNYoBen mentioned?
print(df)
1 2 3 4 5 6
0 0 0 0 3 0 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
print(df.melt())
variable value
0 1 0
1 1 0
2 1 0
3 2 0
4 2 7
5 2 4
6 3 0
7 3 8
8 3 5
9 4 3
10 4 9
11 4 2
12 5 0
13 5 1
14 5 4
15 6 0
16 6 0
17 6 0
Related
so I have a series, I want to cumsum, but start over every time I hit a 0, somthing like this:
orig
wanted result
0
0
0
1
1
1
2
1
2
3
1
3
4
1
4
5
1
5
6
1
6
7
0
0
8
1
1
9
1
2
10
1
3
11
0
0
12
1
1
13
1
2
14
1
3
15
1
4
16
1
5
17
1
6
any ideas? (pandas, pure python, other)
Use df['orig'].eq(0).cumsum() to generate groups starting on each 0, then cumcount to get the increasing values:
df['result'] = df.groupby(df['orig'].eq(0).cumsum()).cumcount()
output:
orig wanted result result
0 0 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 1 6 6
7 0 0 0
8 1 1 1
9 1 2 2
10 1 3 3
11 0 0 0
12 1 1 1
13 1 2 2
14 1 3 3
15 1 4 4
16 1 5 5
17 1 6 6
Intermediate:
df['orig'].eq(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
11 3
12 3
13 3
14 3
15 3
16 3
17 3
Name: orig, dtype: int64
import pandas as pd
condition = df.Orig.eq(0)
df['reset'] = condition.cumsum()
I have the following column in my pandas dataframe named as FailureLabel
ID FailureLabel
0 1 1
1 2 1
2 3 1
3 4 0
4 5 0
5 6 0
6 7 0
7 8 1
8 9 1
9 10 0
10 11 0
11 12 1
12 13 1
I would like to assign a unique_id to this column such that eachs 1's have a unique id whereas all zeros + the next one have a common "unique id".
I tried using the following code ,
df['unique_id'] = (df['FailureLabel'] | (df['FailureLabel']!=df['FailureLabel'].shift())).cumsum()
which gives me the following output,
ID FailureLabel unique_id
0 1 1 1
1 2 1 2
2 3 1 3
3 4 0 4
4 5 0 4
5 6 0 4
6 7 0 4
7 8 1 5
8 9 1 6
9 10 0 7
10 11 0 7
11 12 1 8
12 13 1 9
But what I desire is,
ID FailureLabel unique_id
0 1 1 1
1 2 1 2
2 3 1 3
3 4 0 4
4 5 0 4
5 6 0 4
6 7 0 4
7 8 1 4
8 9 1 5
9 10 0 6
10 11 0 6
11 12 1 6
12 13 1 7
Use Series.shift with backfilling first value, compare by 1 and add cumulative sum:
df['unique_id'] = df['FailureLabel'].shift().bfill().eq(1).cumsum()
print (df)
ID FailureLabel unique_id
0 1 1 1
1 2 1 2
2 3 1 3
3 4 0 4
4 5 0 4
5 6 0 4
6 7 0 4
7 8 1 4
8 9 1 5
9 10 0 6
10 11 0 6
11 12 1 6
12 13 1 7
I have the following dataframe:
df = pd.DataFrame({'group_nr':[0,0,1,1,1,2,2,3,3,0,0,1,1,2,2,2,3,3]})
print(df)
group_nr
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 0
10 0
11 1
12 1
13 2
14 2
15 2
16 3
17 3
and would like to change from repeating group numbers to incremental group numbers:
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
I can't find a way of doing this without looping through the rows. Does someone have an idea how to implement this nicely?
You can check if the values are equal to the following, and take a cumsum of the boolean series to generate the groups:
df['incremental_group_nr'] = df.group_nr.ne(df.group_nr.shift()).cumsum().sub(1)
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Compare by shifted values by Series.shift with not equal by Series.ne and then add cumulative sum with subract 1:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift()).cumsum() - 1
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Another idea is use backfilling first missing value after shift by bfill:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift().bfill()).cumsum()
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
I've succeed on splitting a DataFrame into several smaller DataFrames. I'm now working on giving these DataFrames sequential names, and can be called independently.
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 3)
for part in result:
print(part, '\n')
movie_id 1 2 5 borda rank IRAM
2 3 4 0 0 4 3 2
1 2 3 0 3 6 2 1
movie_id 1 2 5 borda rank IRAM
4 5 3 0 0 3 4 3
0 1 5 4 4 13 1 4
movie_id 1 2 5 borda rank IRAM
3 4 3 0 0 3 4 3
I want to give names in sequential order to these separated DataFrames with loop(or any helpful methods).
For instance :
df_1
movie_id 1 2 5 borda rank IRAM
2 3 4 0 0 4 3 2
1 2 3 0 3 6 2 1
df_2
movie_id 1 2 5 borda rank IRAM
4 5 3 0 0 3 4 3
0 1 5 4 4 13 1 4
df_3
movie_id 1 2 5 borda rank IRAM
3 4 3 0 0 3 4 3
I've been searching solutions for a while, but I can't find an ideally answer to my problem.
This can be done by taking a dictionary and adding all dataframes into it:
df = pd.DataFrame({'Col1': np.random.randint(10, size=10)})
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 3)
d = {}
for i, part in enumerate(result):
d['df_'+str(i)] = part # If want to start the number for df from 1 then use str(i+1)
print(d['df_0'])
Col1
7 7
6 0
4 5
2 3
print(d['df_1'])
Col1
0 0
8 1
1 5
print(d['df_2'])
Col1
5 2
3 2
9 4
df_dict = {}
for index, splited in enumerate(result):
df_name = "df_{}".format(index)
# if you want to set name of the dataframe
splited.name = df_name
# if you want to set the variable name to dataframe
df_dict[df_name] = splited
print(df_dict)
{'df_0': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
9 10 3 2 0 0 0 4 0 0 0 0 0 9
7 8 1 0 0 0 4 5 0 0 0 4 0 14
6 7 4 0 0 0 2 5 3 4 4 0 0 22
0 1 5 4 0 4 4 0 0 0 4 0 0 21,
'df_1': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
8 9 5 0 0 0 4 5 0 0 4 5 0 23
3 4 3 0 0 0 0 5 0 0 4 0 5 17
5 6 5 0 0 0 0 0 0 5 0 0 0 10,
'df_2': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
4 5 3 0 0 0 0 0 0 0 0 0 0 3
2 3 4 0 0 0 0 0 0 0 0 0 0 4
1 2 3 0 0 3 0 0 0 0 0 0 0 6}
Then you can call any splited_df by df_dict[df_name].
You can use a dictionary, like this:
d = {"df_"+str(k):v for (k,v) in [(i,result[i]) for i in range(len(result))]}
I have a matrix with that looks like this:
com 0 1 2 3 4 5
AAA 0 5 0 4 2 1 4
ABC 0 9 8 9 1 0 3
ADE 1 4 3 5 1 0 1
BCD 1 6 7 8 3 4 1
BCF 2 3 4 2 1 3 0 ...
Where AAA, ABC ... is the dataframe index. The dataframe columns are com 0 1 3 4 5 6
I want to set the cell values in my dataframe equal to 0 when the row values of com is equal the column "number". So for instance, the above matrix will look like:
com 0 1 2 3 4 5
AAA 0 0 0 4 2 1 4
ABC 0 0 8 9 1 0 3
ADE 1 4 0 5 1 0 1
BCD 1 6 0 8 3 4 1
BCF 2 3 4 0 1 3 0 ...
I tried to iterate over rows and use both .loc and .ix but no success.
Just require some numpy trick
In [22]:
print df
0 1 2 3 4 5
0 5 0 4 2 1 4
0 9 8 9 1 0 3
1 4 3 5 1 0 1
1 6 7 8 3 4 1
2 3 4 2 1 3 0
[5 rows x 6 columns]
In [23]:
#making a masking matrix, 0 where column and index values equal, 1 elsewhere, kind of the vectorized way of doing if TURE 0, else 1
print df*np.where(df.columns.values==df.index.values[..., np.newaxis], 0,1)
0 1 2 3 4 5
0 0 0 4 2 1 4
0 0 8 9 1 0 3
1 4 0 5 1 0 1
1 6 0 8 3 4 1
2 3 4 0 1 3 0
[5 rows x 6 columns]
I think this should work.
for line in range(len(matrix)):
matrix[matrix[line][0]+1]=0
NOTE
Depending on your matrix setup you may not need the +1
Basically it takes the first digit of each line in the matrix and uses that as the index of the value to change to 0
i.e. if the row was
c 0 1 2 3 4 5
AAA 4 3 2 3 9 5 9,
it would change the 5 below the number 4 to 0
c 0 1 2 3 4 5
AAA 4 3 2 3 9 0 9