What does "col_level" do in the melt function?

What does "col_level" do in the melt function? - python

From the documentation:
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
What does col_level do?
Examples with different values of col_level would be great.
My current dataframe is created by the following:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df.columns = [list('ABC'), list('DEF'), list('GHI')]
Thanks.

You can check melt:
col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
And examples:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
#use Multiindex.from_arrays for set levels names
df.columns = pd.MultiIndex.from_arrays([list('ABC'), list('DEF'), list('GHI')],
names=list('abc'))
print (df)
a A B C
b D E F
c G H I
0 a 1 2
1 b 3 4
2 c 5 6
#melt by first level of MultiIndex
print (df.melt(col_level=0))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level a of MultiIndex
print (df.melt(col_level='a'))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level c of MultiIndex
print (df.melt(col_level='c'))
c value
0 G a
1 G b
2 G c
3 H 1
4 H 3
5 H 5
6 I 2
7 I 4
8 I 6

Related

Python Pandas: How can I make labels for dropped data?

I used drop_duplicates() on original data(subset = A and B), and I made labels for the refined data.
Now I have to make labels for the original data, but It costs to much time and not that efficient.
For example,
My original dataframe is as follows:
A B
1 1
1 1
2 2
2 3
5 3
6 4
5 4
5 4
after drop_duplicates():
A B
1 1
2 2
2 3
5 3
6 4
5 4
after labeling:
A B label
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
Following is my expected output:
A B label
1 1 1
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
5 4 1
My current code for achieving above result is as follows:
for i in range(origin_data):
check = False
j = 0
while not check:
if origin_data['A'].iloc[i] == dropped_data['A'].iloc[j] and origin_data['B'].iloc[i] == dropped_data['B'].iloc[j]:
origin_data['label'].iloc[i] = dropped_data['label'].iloc[j]
check = True
j+=1
As my code takes much more time, is there any way I can perform it more efficiently ?

You can merge the labeled dataset with the original one:
original.merge(labeled, how="left", on=["A", "B"])
result:
A B label
0 1 1 1
1 1 1 1
2 1 2 0
3 1 3 0
4 1 4 1
5 1 4 1
Full code:
import pandas as pd
original = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1},
'B': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4, 5: 4}}
)
labeled = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1},
'B': {0: 1, 1: 2, 2: 3, 3: 4},
'label': {0: 1, 1: 0, 2: 0, 3: 1}}
)
print(original.merge(labeled, how="left", on=["A", "B"]))

If the problem is just mapping the 'B' labels to the original dataframe, you can use map:
origin_data.B.map(dropped_data.set_index('B').label)

python pandas dataframe: Creating new column with default value, when default value is an iterable

I have a pandas dataframe as below:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','b'])
In [3]: print df
Out [3]:
a b
0 1 2
1 3 4
2 5 6
Now I want to add a new column 'c' with a default value as a dictionary. The resulting dataframe should look like this:
a b c
0 1 2 {1: 2, 3: 4}
1 3 4 {1: 2, 3: 4}
2 5 6 {1: 2, 3: 4}
I tried the following:
df.at[:, 'c'] = {1: 2, 3: 4}
ValueError: Length of values does not match length of index
and
df['c'] = {1: 2, 3: 4}
ValueError: Must have equal len keys and value when setting with an iterable
This one works for me
df['c'] = df.apply(lambda x: {1:2, 3:4}, axis=1)
but looks like a dirty approach.
Is there a cleaner way to do this?

It is possible, but not recommended store dicts in column of DataFrame, because all vectorized pandas functions cannot be used:
df['c'] = [{1: 2, 3: 4} for x in np.arange(len(df))]
print (df)
a b c
0 1 2 {1: 2, 3: 4}
1 3 4 {1: 2, 3: 4}
2 5 6 {1: 2, 3: 4}

You have three rows in your DF and only two elements in your dict, do:
c = {0:1,1:1,2:2}
df['c'] = c
Output:
a b c
0 1 2 0
1 3 4 1
2 5 6 2
To have the same dictionary repeated along your dataframe you need to create a list of such dicts
c = {1:2,3:4}
c = [c]*3
df['c'] = c
Output
a b c
0 1 2 {1: 2, 3: 4}
1 3 4 {1: 2, 3: 4}
2 5 6 {1: 2, 3: 4}

Pandas dataframe rearrangement stack to two value columns (for factorplots)

I have been trying to rearrange my dataframe to use it as input for a factorplot. The raw data would look like this:
A B C D
1 0 1 2 "T"
2 1 2 3 "F"
3 2 1 0 "F"
4 1 0 2 "T"
...
My question is how can I rearrange it into this form:
col val val2
1 A 0 "T"
1 B 1 "T"
1 C 2 "T"
2 A 1 "F"
...
I was trying:
df = DF.cumsum(axis=0).stack().reset_index(name="val")
However this produces only one value column not two.. thanks for your support

I would use melt, and you can sort it how ever you like
pd.melt(df.reset_index(),id_vars=['index','D'], value_vars=['A','B','C']).sort_values(by='index')
Out[40]:
index D variable value
0 1 T A 0
4 1 T B 1
8 1 T C 2
1 2 F A 1
5 2 F B 2
9 2 F C 3
2 3 F A 2
6 3 F B 1
10 3 F C 0
3 4 T A 1
7 4 T B 0
11 4 T C 2
then obviously you can name column as you like
df.set_index('index').rename(columns={'D': 'col', 'variable': 'val2', 'value': 'val'})

consider your dataframe df
df = pd.DataFrame([
[0, 1, 2, 'T'],
[1, 2, 3, 'F'],
[2, 1, 3, 'F'],
[1, 0, 2, 'T'],
], [1, 2, 3, 4], list('ABCD'))
solution
df.set_index('D', append=True) \
.rename_axis(['col'], 1) \
.rename_axis([None, 'val2']) \
.stack().to_frame('val') \
.reset_index(['col', 'val2']) \
[['col', 'val', 'val2']]

pandas DataFrame split the column and extend the rows

like:
A B C D
1 1 2 3 ['a','b']
2 4 6 7 ['b','c']
3 1 0 1 ['a']
4 2 1 1 ['b']
5 1 2 3 []
to:
A B C D
1 1 2 3 ['a']
2 1 2 3 ['b']
3 4 6 7 ['b']
4 4 6 7 ['c']
5 1 0 1 ['a']
6 2 1 1 ['b']
7 1 2 3 []
ps: split the row in "D" and extend the row
use: pandas dataframe deal with the data

One way would be to use a list comprehension with a doubly nest for-loop:
>>> [(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]]
# [(1, 2, 3, ['a']),
# (1, 2, 3, ['b']),
# (4, 6, 7, ['b']),
# (4, 6, 7, ['c']),
# (1, 0, 1, ['a']),
# (2, 1, 1, ['b']),
# (1, 2, 3, [])]
Passing the data in this form to pd.DataFrame produces the desired result:
import pandas as pd
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
result = pd.DataFrame(
[(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]])
yields
0 1 2 3
0 1 2 3 [a]
1 1 2 3 [b]
2 4 6 7 [b]
3 4 6 7 [c]
4 1 0 1 [a]
5 2 1 1 [b]
6 1 2 3 []
Another option is to use df['D'].apply to expand the items in the list into different columns, and then use stack to expand the rows:
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
df = df.set_index(['A', 'B', 'C'])
result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
# 0 1
# A B C
# 1 2 3 [a] [b]
# 4 6 7 [b] [c]
# 1 0 1 [a] NaN
# 2 1 1 [b] NaN
# 1 2 3 [] NaN
result = result.stack()
# A B C
# 1 2 3 0 [a]
# 1 [b]
# 4 6 7 0 [b]
# 1 [c]
# 1 0 1 0 [a]
# 2 1 1 0 [b]
# 1 2 3 0 []
# dtype: object
result.index = result.index.droplevel(-1)
result = result.reset_index()
# A B C 0
# 0 1 2 3 [a]
# 1 1 2 3 [b]
# 2 4 6 7 [b]
# 3 4 6 7 [c]
# 4 1 0 1 [a]
# 5 2 1 1 [b]
# 6 1 2 3 []
Although this does not use explicit for-loops or a list comprehension, there is an implicit for-loop hidden in the call to apply. In fact, it is much slower than using a list comprehension:
In [170]: df = pd.concat([df]*10)
In [171]: %%timeit
.....: result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
result = result.stack()
result.index = result.index.droplevel(-1)
result = result.reset_index()
100 loops, best of 3: 11.5 ms per loop
In [172]: %%timeit
.....: result = pd.DataFrame(
[(key + (item,))
for key, val in df['D'].iteritems()
for item in map(list, val) or [[]]])
1000 loops, best of 3: 618 µs per loop

Assuming your column D content is of type string:
print(type(df.loc[0, 'D']))
<class 'str'>
df = df.set_index(['A', 'B', 'C']).sortlevel()
df.loc[:, 'D'] = df.loc[:, 'D'].str.strip('[').str.strip(']')
df = df.loc[:, 'D'].str.split(',', expand=True).stack()
df = df.str.strip('').apply(lambda x: '[{}]'.format(x)).reset_index().drop('level_3', axis=1).rename(columns={0: 'D'})
A B C D
0 1 0 1 ['a']
1 1 2 3 ['a']
2 1 2 3 ['b']
3 1 2 3 []
4 2 1 1 ['b']
5 4 6 7 ['b']
6 4 6 7 ['c']

Use a different row as labels in pandas after read

I need to use the third row as the labels for a dataframe, but keep the first two rows for other uses. How can you change the labels on an existing dataframe to an existing row?
So basically this dataframe
A B C D
1 2 3 4
5 7 8 9
a b c d
6 4 2 1
becomes
a b c d
6 4 2 1
And I cannot just set the headers when the file is read in because I need the first two rows and labels for some processing

One way would be just to take a slice and then overwrite the columns:
In [71]:
df1 = df.loc[3:]
df1.columns = df.loc[2].values
df1
Out[71]:
a b c d
3 6 4 2 1
You can then assign back to df a slice of the rows of interest:
In [73]:
df = df[:2]
df
Out[73]:
A B C D
0 1 2 3 4
1 5 7 8 9

First copy the first two rows into a new DataFrame. Then rename the columns using the data contained in the second row. Finally, delete the first three rows of data.
import pandas as pd
df = pd.DataFrame({'A': {0: '1', 1: '5', 2: 'a', 3: '6'},
'B': {0: '2', 1: '7', 2: 'b', 3: '4'},
'C': {0: '3', 1: '8', 2: 'c', 3: '2'},
'D': {0: '4', 1: '9', 2: 'd', 3: '1'}})
df2 = df.loc[:1, :].copy()
df.columns = [c for c in df.loc[2, :]]
df.drop(df.index[:3], inplace=True)
>>> df
a b c d
3 6 4 2 1
>>> df2
A B C D
0 1 2 3 4
1 5 7 8 9

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What does "col_level" do in the melt function? - python

Related

Python Pandas: How can I make labels for dropped data?

python pandas dataframe: Creating new column with default value, when default value is an iterable

Pandas dataframe rearrangement stack to two value columns (for factorplots)

pandas DataFrame split the column and extend the rows

Use a different row as labels in pandas after read

Categories

Resources