What does "col_level" do in the melt function? - python

From the documentation:
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
What does col_level do?
Examples with different values of col_level would be great.
My current dataframe is created by the following:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df.columns = [list('ABC'), list('DEF'), list('GHI')]
Thanks.

You can check melt:
col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
And examples:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
#use Multiindex.from_arrays for set levels names
df.columns = pd.MultiIndex.from_arrays([list('ABC'), list('DEF'), list('GHI')],
names=list('abc'))
print (df)
a A B C
b D E F
c G H I
0 a 1 2
1 b 3 4
2 c 5 6
#melt by first level of MultiIndex
print (df.melt(col_level=0))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level a of MultiIndex
print (df.melt(col_level='a'))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level c of MultiIndex
print (df.melt(col_level='c'))
c value
0 G a
1 G b
2 G c
3 H 1
4 H 3
5 H 5
6 I 2
7 I 4
8 I 6

Related

Python Pandas: How can I make labels for dropped data?

I used drop_duplicates() on original data(subset = A and B), and I made labels for the refined data.
Now I have to make labels for the original data, but It costs to much time and not that efficient.
For example,
My original dataframe is as follows:
A B
1 1
1 1
2 2
2 3
5 3
6 4
5 4
5 4
after drop_duplicates():
A B
1 1
2 2
2 3
5 3
6 4
5 4
after labeling:
A B label
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
Following is my expected output:
A B label
1 1 1
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
5 4 1
My current code for achieving above result is as follows:
for i in range(origin_data):
check = False
j = 0
while not check:
if origin_data['A'].iloc[i] == dropped_data['A'].iloc[j] and origin_data['B'].iloc[i] == dropped_data['B'].iloc[j]:
origin_data['label'].iloc[i] = dropped_data['label'].iloc[j]
check = True
j+=1
As my code takes much more time, is there any way I can perform it more efficiently ?
You can merge the labeled dataset with the original one:
original.merge(labeled, how="left", on=["A", "B"])
result:
A B label
0 1 1 1
1 1 1 1
2 1 2 0
3 1 3 0
4 1 4 1
5 1 4 1
Full code:
import pandas as pd
original = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1},
'B': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4, 5: 4}}
)
labeled = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1},
'B': {0: 1, 1: 2, 2: 3, 3: 4},
'label': {0: 1, 1: 0, 2: 0, 3: 1}}
)
print(original.merge(labeled, how="left", on=["A", "B"]))
If the problem is just mapping the 'B' labels to the original dataframe, you can use map:
origin_data.B.map(dropped_data.set_index('B').label)

python pandas dataframe: Creating new column with default value, when default value is an iterable

I have a pandas dataframe as below:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','b'])
In [3]: print df
Out [3]:
a b
0 1 2
1 3 4
2 5 6
Now I want to add a new column 'c' with a default value as a dictionary. The resulting dataframe should look like this:
a b c
0 1 2 {1: 2, 3: 4}
1 3 4 {1: 2, 3: 4}
2 5 6 {1: 2, 3: 4}
I tried the following:
df.at[:, 'c'] = {1: 2, 3: 4}
ValueError: Length of values does not match length of index
and
df['c'] = {1: 2, 3: 4}
ValueError: Must have equal len keys and value when setting with an iterable
This one works for me
df['c'] = df.apply(lambda x: {1:2, 3:4}, axis=1)
but looks like a dirty approach.
Is there a cleaner way to do this?
It is possible, but not recommended store dicts in column of DataFrame, because all vectorized pandas functions cannot be used:
df['c'] = [{1: 2, 3: 4} for x in np.arange(len(df))]
print (df)
a b c
0 1 2 {1: 2, 3: 4}
1 3 4 {1: 2, 3: 4}
2 5 6 {1: 2, 3: 4}
You have three rows in your DF and only two elements in your dict, do:
c = {0:1,1:1,2:2}
df['c'] = c
Output:
a b c
0 1 2 0
1 3 4 1
2 5 6 2
To have the same dictionary repeated along your dataframe you need to create a list of such dicts
c = {1:2,3:4}
c = [c]*3
df['c'] = c
Output
a b c
0 1 2 {1: 2, 3: 4}
1 3 4 {1: 2, 3: 4}
2 5 6 {1: 2, 3: 4}

Pandas dataframe rearrangement stack to two value columns (for factorplots)

I have been trying to rearrange my dataframe to use it as input for a factorplot. The raw data would look like this:
A B C D
1 0 1 2 "T"
2 1 2 3 "F"
3 2 1 0 "F"
4 1 0 2 "T"
...
My question is how can I rearrange it into this form:
col val val2
1 A 0 "T"
1 B 1 "T"
1 C 2 "T"
2 A 1 "F"
...
I was trying:
df = DF.cumsum(axis=0).stack().reset_index(name="val")
However this produces only one value column not two.. thanks for your support
I would use melt, and you can sort it how ever you like
pd.melt(df.reset_index(),id_vars=['index','D'], value_vars=['A','B','C']).sort_values(by='index')
Out[40]:
index D variable value
0 1 T A 0
4 1 T B 1
8 1 T C 2
1 2 F A 1
5 2 F B 2
9 2 F C 3
2 3 F A 2
6 3 F B 1
10 3 F C 0
3 4 T A 1
7 4 T B 0
11 4 T C 2
then obviously you can name column as you like
df.set_index('index').rename(columns={'D': 'col', 'variable': 'val2', 'value': 'val'})
consider your dataframe df
df = pd.DataFrame([
[0, 1, 2, 'T'],
[1, 2, 3, 'F'],
[2, 1, 3, 'F'],
[1, 0, 2, 'T'],
], [1, 2, 3, 4], list('ABCD'))
solution
df.set_index('D', append=True) \
.rename_axis(['col'], 1) \
.rename_axis([None, 'val2']) \
.stack().to_frame('val') \
.reset_index(['col', 'val2']) \
[['col', 'val', 'val2']]

pandas DataFrame split the column and extend the rows

like:
A B C D
1 1 2 3 ['a','b']
2 4 6 7 ['b','c']
3 1 0 1 ['a']
4 2 1 1 ['b']
5 1 2 3 []
to:
A B C D
1 1 2 3 ['a']
2 1 2 3 ['b']
3 4 6 7 ['b']
4 4 6 7 ['c']
5 1 0 1 ['a']
6 2 1 1 ['b']
7 1 2 3 []
ps: split the row in "D" and extend the row
use: pandas dataframe deal with the data
One way would be to use a list comprehension with a doubly nest for-loop:
>>> [(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]]
# [(1, 2, 3, ['a']),
# (1, 2, 3, ['b']),
# (4, 6, 7, ['b']),
# (4, 6, 7, ['c']),
# (1, 0, 1, ['a']),
# (2, 1, 1, ['b']),
# (1, 2, 3, [])]
Passing the data in this form to pd.DataFrame produces the desired result:
import pandas as pd
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
result = pd.DataFrame(
[(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]])
yields
0 1 2 3
0 1 2 3 [a]
1 1 2 3 [b]
2 4 6 7 [b]
3 4 6 7 [c]
4 1 0 1 [a]
5 2 1 1 [b]
6 1 2 3 []
Another option is to use df['D'].apply to expand the items in the list into different columns, and then use stack to expand the rows:
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
df = df.set_index(['A', 'B', 'C'])
result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
# 0 1
# A B C
# 1 2 3 [a] [b]
# 4 6 7 [b] [c]
# 1 0 1 [a] NaN
# 2 1 1 [b] NaN
# 1 2 3 [] NaN
result = result.stack()
# A B C
# 1 2 3 0 [a]
# 1 [b]
# 4 6 7 0 [b]
# 1 [c]
# 1 0 1 0 [a]
# 2 1 1 0 [b]
# 1 2 3 0 []
# dtype: object
result.index = result.index.droplevel(-1)
result = result.reset_index()
# A B C 0
# 0 1 2 3 [a]
# 1 1 2 3 [b]
# 2 4 6 7 [b]
# 3 4 6 7 [c]
# 4 1 0 1 [a]
# 5 2 1 1 [b]
# 6 1 2 3 []
Although this does not use explicit for-loops or a list comprehension, there is an implicit for-loop hidden in the call to apply. In fact, it is much slower than using a list comprehension:
In [170]: df = pd.concat([df]*10)
In [171]: %%timeit
.....: result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
result = result.stack()
result.index = result.index.droplevel(-1)
result = result.reset_index()
100 loops, best of 3: 11.5 ms per loop
In [172]: %%timeit
.....: result = pd.DataFrame(
[(key + (item,))
for key, val in df['D'].iteritems()
for item in map(list, val) or [[]]])
1000 loops, best of 3: 618 µs per loop
Assuming your column D content is of type string:
print(type(df.loc[0, 'D']))
<class 'str'>
df = df.set_index(['A', 'B', 'C']).sortlevel()
df.loc[:, 'D'] = df.loc[:, 'D'].str.strip('[').str.strip(']')
df = df.loc[:, 'D'].str.split(',', expand=True).stack()
df = df.str.strip('').apply(lambda x: '[{}]'.format(x)).reset_index().drop('level_3', axis=1).rename(columns={0: 'D'})
A B C D
0 1 0 1 ['a']
1 1 2 3 ['a']
2 1 2 3 ['b']
3 1 2 3 []
4 2 1 1 ['b']
5 4 6 7 ['b']
6 4 6 7 ['c']

Use a different row as labels in pandas after read

I need to use the third row as the labels for a dataframe, but keep the first two rows for other uses. How can you change the labels on an existing dataframe to an existing row?
So basically this dataframe
A B C D
1 2 3 4
5 7 8 9
a b c d
6 4 2 1
becomes
a b c d
6 4 2 1
And I cannot just set the headers when the file is read in because I need the first two rows and labels for some processing
One way would be just to take a slice and then overwrite the columns:
In [71]:
df1 = df.loc[3:]
df1.columns = df.loc[2].values
df1
Out[71]:
a b c d
3 6 4 2 1
You can then assign back to df a slice of the rows of interest:
In [73]:
df = df[:2]
df
Out[73]:
A B C D
0 1 2 3 4
1 5 7 8 9
First copy the first two rows into a new DataFrame. Then rename the columns using the data contained in the second row. Finally, delete the first three rows of data.
import pandas as pd
df = pd.DataFrame({'A': {0: '1', 1: '5', 2: 'a', 3: '6'},
'B': {0: '2', 1: '7', 2: 'b', 3: '4'},
'C': {0: '3', 1: '8', 2: 'c', 3: '2'},
'D': {0: '4', 1: '9', 2: 'd', 3: '1'}})
df2 = df.loc[:1, :].copy()
df.columns = [c for c in df.loc[2, :]]
df.drop(df.index[:3], inplace=True)
>>> df
a b c d
3 6 4 2 1
>>> df2
A B C D
0 1 2 3 4
1 5 7 8 9

Categories