Input: hierarchical headered dataframe (multiindex columns).
Ask: select combination of specific column(s) [level0, level1] and broadcast [level0, :]
Example:
import numpy as np
import pandas as pd
index=pd.MultiIndex.from_product([["A", "B"], ["x", "y", "z"]])
df = pd.DataFrame(np.random.randn(8,6), columns=index)
The desire result is to select ('A','y') and everything for 'B'.
I've managed to achieve this using the solution below:
df[[x for x in df.columns if x == ('A','z') or x[0]=='B']]
I've tried to use .loc[] and slice(None) but this did not work.
Is there a more elegant solution to iterate over the columns' tuples?
Cheers.
Your list comprehension should work:
cols = [ent for ent in df if ent == ('A','y') or ent[0] == 'B']
df.loc[:, cols]
A B
y x y z
0 0.069915 2.563734 1.034784 0.659189
1 -0.240847 1.924626 1.241827 0.973155
2 -1.091353 -1.003005 1.648075 -1.162863
3 -0.747503 -0.211539 1.861991 -1.011261
4 -0.354648 -0.117533 0.524876 0.884997
5 0.786158 -2.073479 1.374893 1.428770
6 0.597740 0.853482 -0.187112 0.000626
7 -0.749839 -1.084576 -0.327888 -0.286908
Another option is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.select_columns(('A','y'), 'B')
A B
y x y z
0 0.069915 2.563734 1.034784 0.659189
1 -0.240847 1.924626 1.241827 0.973155
2 -1.091353 -1.003005 1.648075 -1.162863
3 -0.747503 -0.211539 1.861991 -1.011261
4 -0.354648 -0.117533 0.524876 0.884997
5 0.786158 -2.073479 1.374893 1.428770
6 0.597740 0.853482 -0.187112 0.000626
7 -0.749839 -1.084576 -0.327888 -0.286908
Related
I have a list of dataframes
new_list=[df1,df2,df3]
I know the follwoing command to merge
pd.concat(new_list, axis=1)
How to create a dataframe by selcting last columns of all dataframes from the list without using for loops
Pick the last columns using map:
import pandas as pd
# This is to have a test list to use
df1 = pd.DataFrame({"a":[1,2,3], "b":[2,3,4]})
df2 = pd.DataFrame({"c":[1,2,3], "d":[2,3,4]})
new_list = [df1, df2]
# logic
pd.concat(map(lambda x:x[x.columns[-1]], new_list), axis=1)
OUTPUT
b d
0 2 2
1 3 3
2 4 4
Do use this
import pandas as pd
new_list = [df1.iloc[:,-1:], df2.iloc[:,-1:], df3.iloc[:,-1:]]
pd.concat(new_list, axis = 1)
You can also use lambda :
import pandas as pd
func = lambda x: x.iloc[:, -1] #select all rows and last column
new_list = [func(df1), func(df2), func(df3)]
pd.concat(new_list, axis=1)
I have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array(([1,2,3], [1,2,3], [1,2,3], [4,5,6])),
columns=['one','two','three'])
#BelowI am sub setting by rows and columns. But I want to have more than just one column.
#In this case Column 'One' and 'two'
small=df[df.one==1].one
What is the alternative here?
You can use loc:
df = pd.DataFrame(np.array(([1,2,3], [1,2,3], [1,2,3], [4,5,6])),
columns=['one','two','three'])
small=df.loc[df.one==1, ["one", "two"]]
# > one two
# 0 1 2
# 1 1 2
# 2 1 2
The first element of loc is the wanted rows ; the second is the wanted columns. As demonstrated here, it allows both masking and indexing.
I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])
I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.
Is it possible to create a DataFrame from a list of series without duplicating their names?
Ex, creating the same DataFrame as:
>>> pd.DataFrame({ "foo": data["foo"], "bar": other_data["bar"] })
But without without needing to explicitly name the columns?
Try pandas.concat which takes a list of items to combine as its argument:
df1 = pd.DataFrame(np.random.randn(100, 4), columns=list('abcd'))
df2 = pd.DataFrame(np.random.randn(100, 3), columns=list('xyz'))
df3 = pd.concat([df1['a'], df2['y']], axis=1)
Note that you need to use axis=1 to stack things together side-by side and axis=0 (which is the default) to combine them one-over-the-other.
Seems like you want to join the dataframes (works similar to SQL):
import numpy as np
import pandas
df1 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['foo', 'bar'],
index=list('ABCDEFHIJK')
)
df2 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['bar', 'bax'],
index=list('DEFHIJKLMN')
)
df1[['foo']].join(df2['bar'], how='outer')
The on kwarg takes a list of columns or None. If None, it'll join on the indices of the two dataframes. You just need to make sure that you're using a dataframe for the left size -- hence the double brackets to force df[['foo']] to a dataframe (df['foo'] returns a series)
This gives me:
foo bar
A 4 NaN
B 0 NaN
C 10 NaN
D 8 3
E 2 0
F 3 3
H 9 10
I 0 9
J 5 6
K 2 9
L NaN 3
M NaN 1
N NaN 1
You can also do inner, left, and right joins.
I prefer the explicit way, as presented in your original post, but if you really want to write certain names once, you could try this:
import pandas as pd
import numpy as np
def dictify(*args):
return dict((i,n[i]) for i,n in args)
data = { 'foo': np.random.randn(5) }
other_data = { 'bar': np.random.randn(5) }
print pd.DataFrame(dictify(('foo', data), ('bar', other_data)))
The output is as expected:
bar foo
0 0.533973 -0.477521
1 0.027354 0.974038
2 -0.725991 0.350420
3 1.921215 0.648210
4 0.547640 1.652310
[5 rows x 2 columns]