Column order in pandas.concat - python

I do as below:
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
frames = [data1, data2]
data = pd.concat(frames)
data
a b
0 2 1
1 2 1
2 2 1
0 2 1
1 2 1
2 2 1
The data column order is in alphabet order. Why is it so?
and how to keep the original order?

You are creating DataFrames out of dictionaries. Dictionaries are a unordered which means the keys do not have a specific order. So
d1 = {'key_a': 'val_a', 'key_b': 'val_b'}
and
d2 = {'key_b': 'val_b', 'key_a': 'val_a'}
are (probably) the same.
In addition to that I assume that pandas sorts the dictionary's keys descending by default (unfortunately I did not find any hint in the docs in order to prove that assumption) leading to the behavior you encountered.
So the basic motivation would be to resort / reorder the columns in your DataFrame. You can do this as follows:
import pandas as pd
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
frames = [data1, data2]
data = pd.concat(frames)
print(data)
cols = ['b' , 'a']
data = data[cols]
print(data)

Starting from version 0.23.0, you can prevent the concat() method to sort the returned DataFrame. For example:
df1 = pd.DataFrame({ 'a' : [1, 1, 1], 'b' : [2, 2, 2]})
df2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
df = pd.concat([df1, df2], sort=False)
A future version of pandas will change to not sort by default.

def concat_ordered_columns(frames):
columns_ordered = []
for frame in frames:
columns_ordered.extend(x for x in frame.columns if x not in columns_ordered)
final_df = pd.concat(frames)
return final_df[columns_ordered]
# Usage
dfs = [df_a,df_b,df_c]
full_df = concat_ordered_columns(dfs)
This should work.

You can create the original DataFrames with OrderedDicts
from collections import OrderedDict
odict = OrderedDict()
odict['b'] = [1, 1, 1]
odict['a'] = [2, 2, 2]
data1 = pd.DataFrame(odict)
data2 = pd.DataFrame(odict)
frames = [data1, data2]
data = pd.concat(frames)
data
b a
0 1 2
1 1 2
2 1 2
0 1 2
1 1 2
2 1 2

you can also specify the order like this :
import pandas as pd
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
listdf = [data1, data2]
data = pd.concat(listdf)
sequence = ['b','a']
data = data.reindex(columns=sequence)

Simplest way is firstly make the columns same order then concat:
df2=df2[df1.columns]
df=pd.concat((df1,df2),axis=0)

Related

How to generate numeric mapping for categorical columns in pandas?

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
To build volcab dictionary and compressed_categories_numeric, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])

I would like to know when we want to print only specific columns in pandas how to implement that

cols = list(ds.columns.values)
ds = ds[cols[1:3] + cols[5:6] + [cols[9]]]
print(ds)
Why did we convert into list in this line cols = list(ds.columns.values)?
If ds is a DataFrame from Pandas:
type(ds.columns.values)
>>> <class 'numpy.ndarray'>
If you sum two differences columns of string or char in numpy:
a1 = np.char.array(['a', 'b'])
a2 = np.char.array(['c', 'd'])
a1 + a2
>>> chararray(['ac', 'bd'], dtype='<U2')
and not:
np.char.array(['a', 'b', 'c', 'd'])
That why you should convert it in list because:
list1 = ['a','b']
list2 = ['c','d']
list1 + list2
>>> ['a','b','c','d']
Remember, pandas.DataFrame need a list of columns, that why you should feed DataFrame a list :
panda.DataFrame[[columns1,columns2,columns5,columns9]]
If you do slicing for a single numpy.ndarray or a single list, you would be able to get the dataframe:
cols = ds.columns.values #numpy.ndarray
ds = ds[cols[1:3]] #ok
cols = ds.columns.tolist() #list
ds = ds[cols[1:3]] #ok
However, if you use the + operator, the behavior is different between numpy.ndarray and list
cols = ds.columns.values #numpy.ndarray
ds = ds[cols[1:3] + cols[5:6]] #ERROR
cols = ds.columns.tolist() #list
ds = ds[cols[1:3] + cols[5:6]] #ok
That is because the + operator is "concatenation" for list,
whereas for numpy.ndarray, the + operator is numpy.add.
In other words, cols[1:3] + cols[5:6] is actually doing np.add(cols[1:3], cols[5:6])
Refer to documentation for more details.
ds.columns returns an ndarray, so slicing it will also produce ndarrays. + between ndarrays behave differently than in between lists
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [1, 2, 3, 4], 'col3': [1, 2, 3, 4], 'col4': [1, 2, 3, 4],
'col5': [1, 2, 3, 4], 'col6': [1, 2, 3, 4], 'col7': [1, 2, 3, 4], 'col8': [1, 2, 3, 4]})
cols_arr = df.columns.values
cols_list = list(df.columns.values)
print(cols_arr[0:2] + cols_arr[3:4] + [cols_arr[7]])
print(cols_list[0:2] + cols_list[3:4] + [cols_list[7]])
Output
['col1col4col8' 'col2col4col8']
['col1', 'col2', 'col4', 'col8']
When you try to get to access the dataframe df[cols_arr[0:2] + cols_arr[2:3] + [cols_arr[3]]] using the first result you will get
KeyError: "None of [Index(['col1col4col8', 'col2col4col8'], dtype='object')] are in the [columns]"
With the lists df[cols_list[0:2] + cols_list[3:4] + [cols_list[7]]] you will get the new dataframe
col1 col2 col4 col8
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
A simpler way to convert columns into a list:
ds.columns.tolist()
But this also seems unnecessary. ds.columns returns an Index. You can select values from Index just like from a normal list, and then append them to each other using .union:
cols = ds.columns
ds = ds[cols[1:3].union(cols[5:6]).union(cols[9])]
Note that you can use .iloc to reach your goal in a more idiomatic way:
ds = ds.iloc[:, [1, 2, 5, 9]]

Conditioning pandas Dataframe on an Array

I'm trying to figure out how to condition on an array I've created.
first6 = df["Tbl_Name_Dur"].unique()[0:6]
for element in first6:
print(element)
df_test = df[df['Tbl_Name_Dur'] for element in first6]
I've printed the elements and that works. How do I condition on selecting my dataframe based on first6. I've tried the following:
df_test = df[df['Tbl_Name_Dur'] in first6]
df_test = df[df['Tbl_Name_Dur'] == first6]
Any help would be much appreciated!
You can use the isin method. Here is an example:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8 ,8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.isin(first6)]
df.dropna(inplace=True)
print(df)
Alternatively, you can use a lambda function together with map:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8, 8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.col.map(lambda x : x in first6)]
print(df)
Output:
col
0 1
1 2
2 3
3 4
4 4
5 5
6 6

How to sort column names in pandas dataframe by specifying keywords

Specify any keyword in list or dict format as follows
Is it possible to sort columns in a data frame?
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2]})
# before
col_cc_7, col_aa_7, col_bb_7
0, 1, 2
0, 1, 2
0, 1, 2
# sort
custom_sort_key = ["aa", "bb", "cc"]
# ... sort codes ...
# after
col_aa_7, col_bb_7, col_cc_7
1, 2, 0
1, 2, 0
1, 2, 0
For me, your question is a little confusing.
If you only want to sort your columns values, a simple google search would do the trick, if not, I could not understand the question.
df= df.sort_values(by=['col','col2', "col3"],ascending=[True,True,False])
The by= sets the order of the sorting, and the ascending is self explanatory.
We can split by the middle value and create a dictionary of your columns, then apply a sort before we assign this back. I've added some extra columns not in your sort to show what will happen to them.
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2],
"col_ee_7": [2, 2, 2],
"col_dd_7": [2, 2, 2]})
custom_sort_key = ["bb", "cc", "aa"]
col_dict = dict(zip(df.columns,[x.split('_')[1] for x in df.columns.tolist()]))
#{'col_cc_7': 'cc',
# 'col_aa_7': 'aa',
# 'col_bb_7': 'bb',
# 'col_ee_7': 'ee',
# 'col_dd_7': 'dd'}
d = {v:k for k,v in enumerate(custom_sort_key)}
# this will only work on python 3.6 +
new_cols = dict(sorted(col_dict.items(), key=lambda x: d.get(x[1], float('inf'))))
df[new_cols.keys()]
col_bb_7 col_cc_7 col_aa_7 col_ee_7 col_dd_7
0 2 0 1 2 2
1 2 0 1 2 2
2 2 0 1 2 2

How to get n most column values from each column in pandas

I know how to get most frequent value of each column in dataframe using "mode". For example:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
A
0 2
But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:
A
0 2
1 1
Any pointer ?
One way is to use pd.Series.value_counts and extract the index:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})
# A
# 0 2
# 1 1
Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1]})
N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))
Or:
N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
For more general solution select only numeric columns first by select_dtypes:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1],
'C': list('abcdef')})
N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))
N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})
print (df)
A B C
0 2 1.0 d
1 1 NaN e

Categories