Python Convert data to pivot - python

I am trying to convert a data set with 100,000 rows and 3 columns into pivot. While the following code runs without an error, the values are displayed as NaN.
df1 = pd.pivot_table(df_TEST, values='actions', index=['sku'], columns=['user'])
It is not taking the values (ranges from 1 to 36 ) from DataFrame. Has anyone come across this situation?

This can happen when you are doing a pivot since not all the values might be present. e.g.
In [10]: df_TEST
Out[10]:
a b c
0 0 0 0
1 0 1 0
2 0 2 0
3 1 1 1
4 1 2 3
5 1 4 5
Now, when you do pivot on this,
In [9]: df_TEST.pivot_table(index='a', values='c', columns='b')
Out[9]:
b 0 1 2 4
a
0 0 0 0 NaN
1 NaN 1 3 5
Note that, you got NaN at index 0 and column 4, since there is no entry in df_TEST with column a = 0 and column b = 4.
Typically you fill such values with zeros.
In [11]: df_TEST.pivot_table(index='a', values='c', columns='b').fillna(0)
Out[11]:
b 0 1 2 4
a
0 0 0 0 0
1 0 1 3 5

Related

Getting 2 columns after mode(axis=1) DataFrame

I'm trying to find the most frequent value from each row of a DataFrame. I found the way here to do that. But I'm getting two columns instead of one column after doing that.
What do I want to do?
Let's say I have this DataFrame
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
and I want this
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
I'm trying to apply this in DataFrame but it's not working properly.
My DataFrame looks like.
In [45]: data.head()
a b c d e f
0 1 1 1 1 1 1
1 0 0 0 0 0 0
2 0 0 0 0 1 0
3 0 0 0 0 0 0
4 1 1 1 1 1 1
In [47]: data.shape
Out[48]:(5665, 6)
Getting this output
In [47]: data.mode(axis=1)
Out[48]:
0 1
0 1.0 NaN
1 0.0 NaN
2 0.0 NaN
3 0.0 NaN
4 1.0 NaN
Note: If I apply mode for a few rows data.head().mode(axis=1) it's working fine, but it's not working for full DataFrame.
Is this what you are trying to do?
df['Mode'] = df.mode(axis='columns', numeric_only=True)
df
A set of values can have more than one mode, e.g. the array [0, 0, 0, 1, 1, 1] has mode [0, 1] because both appear equally often. In that case, df.mode will create a second column. If you only want one of the most common values for each row, you can simply drop the second column in the output dataframe.

Pandas: idempotent/force join between dataframes with column overlap

I am working in a notebook, so if I run:
df1 = df1.join(series2)
It works fine. However, if I run it again, I receive the following error:
ValueError: columns overlap but no suffix specified
Because it is equivalent to df1 = df1.join(series2).join(series2). Is there any way I can force an overwrite on the overlapping columns without creating an endless amount of columns with the _y suffix?
Sample df1
index a
0 0
0 1
1 2
1 3
2 4
2 5
Sample series2
index b
0 1
1 2
2 3
Desired output from df1 = df1.join(series2)
index a b
0 0 1
0 1 1
1 2 2
1 3 2
2 4 3
2 5 3
Desired output from df1 = df1.join(series2); df1 = df1.join(series2)
# same as above because of forced overwrite on either the left or right join.
index a b
0 0 1
0 1 1
1 2 2
1 3 2
2 4 3
2 5 3

Delete rows with all the zeros elements in all columns exceptionally leaving a single non zero column in pandas DF

I have a pandas Df with 2 million rows *10 columns.
I want to delete all the zero elements in a row for all columns except single column with non zero elements.
Ex. My Df like:
Índex Time a b c d e
0 1 0 0 0 0 0
1 2 1 2 0 0 0
2 3 0 0 0 0 0
3 4 5 0 0 0 0
4 5 0 0 0 0 0
5 6 7 0 0 0 0
What I needed:
Índex Time a b c d e
0 2 1 2 0 0 0
1 4 5 0 0 0 0
2 6 7 0 0 0 0
My Requirement:
Requirement 1:
Leaving the 1st column (Time) it should check for zero elements in every rows. If all column values are zero delete that particular row.
Requirement 2:
Finally I want my Index to be updated properly
What I tried:
I have been looking at this link.
I understood the logic used but I wasn't able to reproduce the result for my requirement.
I hope there will be a simple method to do the operation...
Use iloc for select all columns without first, comapre for not equal by ne and test at least one True per rows by any for filter by boolean indexing, last reset_index:
df = df[df.iloc[:, 1:].ne(0).any(axis=1)].reset_index(drop=True)
Alternative with remove column Time:
df = df[df.drop('Time', axis=1).ne(0).any(axis=1)].reset_index(drop=True)
print (df)
Time a b c d e
0 2 1 2 0 0 0
1 4 5 0 0 0 0
2 6 7 0 0 0 0

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Python pandas nonzero cumsum

I want to apply cumsum on dataframe in pandas in python, but withouth zeros. Simply I want to leave zeros and do cumsum on dataframe. Suppose I have dataframe like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,2,0,1],
'b' : [2,5,0,0],
'c' : [0,1,2,5]})
a b c
0 1 2 0
1 2 5 1
2 0 0 2
3 1 0 5
and result sould be
a b c
0 1 2 0
1 3 7 1
2 0 0 3
3 4 0 8
Any ideas how to do that avoiding loops? In R there is ave function, but Im very new to python and I dont know.
You can mask the df so that you only overwrite the non-zero cells:
In [173]:
df[df!=0] = df.cumsum()
df
Out[173]:
a b c
0 1 2 0
1 3 7 1
2 0 0 3
3 4 0 8

Categories