Create a Pandas DataFrame from series without duplicating their names? - python

Is it possible to create a DataFrame from a list of series without duplicating their names?
Ex, creating the same DataFrame as:
>>> pd.DataFrame({ "foo": data["foo"], "bar": other_data["bar"] })
But without without needing to explicitly name the columns?

Try pandas.concat which takes a list of items to combine as its argument:
df1 = pd.DataFrame(np.random.randn(100, 4), columns=list('abcd'))
df2 = pd.DataFrame(np.random.randn(100, 3), columns=list('xyz'))
df3 = pd.concat([df1['a'], df2['y']], axis=1)
Note that you need to use axis=1 to stack things together side-by side and axis=0 (which is the default) to combine them one-over-the-other.

Seems like you want to join the dataframes (works similar to SQL):
import numpy as np
import pandas
df1 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['foo', 'bar'],
index=list('ABCDEFHIJK')
)
df2 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['bar', 'bax'],
index=list('DEFHIJKLMN')
)
df1[['foo']].join(df2['bar'], how='outer')
The on kwarg takes a list of columns or None. If None, it'll join on the indices of the two dataframes. You just need to make sure that you're using a dataframe for the left size -- hence the double brackets to force df[['foo']] to a dataframe (df['foo'] returns a series)
This gives me:
foo bar
A 4 NaN
B 0 NaN
C 10 NaN
D 8 3
E 2 0
F 3 3
H 9 10
I 0 9
J 5 6
K 2 9
L NaN 3
M NaN 1
N NaN 1
You can also do inner, left, and right joins.

I prefer the explicit way, as presented in your original post, but if you really want to write certain names once, you could try this:
import pandas as pd
import numpy as np
def dictify(*args):
return dict((i,n[i]) for i,n in args)
data = { 'foo': np.random.randn(5) }
other_data = { 'bar': np.random.randn(5) }
print pd.DataFrame(dictify(('foo', data), ('bar', other_data)))
The output is as expected:
bar foo
0 0.533973 -0.477521
1 0.027354 0.974038
2 -0.725991 0.350420
3 1.921215 0.648210
4 0.547640 1.652310
[5 rows x 2 columns]

Related

Neested loops in pandas python

I have two DataFrames One with many rows and another one with a few rows and I need to merge these two Dataframes according some conditions (in strings). I used nested loops in Pandas like this:
density = []
for row in df.itertuples():
for row1 in df2.itertuples():
if(row['a'].find(row1['b']))>0:
density.append(row1['c'])
But I receive the error message:
TypeError: tuple indices must be integers, not str
What's wrong?
Consider df and df2
df = pd.DataFrame(dict(
a=['abcd', 'stk', 'shij', 'dfffedeffj', 'abcdefghijk'],
))
df2 = pd.DataFrame(dict(
b=['abc', 'hij', 'def'],
c=[1, 2, 3]
))
You can produce decent-ish speed with get_value and set_value. And I'd store the values in a dataframe
density = pd.DataFrame(index=df.index, columns=df2.index)
for i in df.index:
for j in df2.index:
a = df.get_value(i, 'a')
b = df2.get_value(j, 'b')
if a.find(b) >= 0:
density.set_value(i, j, df2.get_value(j, 'c'))
print(density)
0 1 2
0 1 NaN NaN
1 NaN NaN NaN
2 NaN 2 NaN
3 NaN NaN 3
4 1 2 3
You can also use a composite numpy str functions
t = df2.b.apply(lambda x: df.a.str.contains(x)).values
c = df2.c.values[:, None]
density = pd.DataFrame(
np.where(t, np.hstack([c] * t.shape[1]), np.nan).T,
df.index, df2.index)
The method DataFrame.itertuples returns namedtuples and to access the values in a namedtuple you have to use the dot notation.
density = []
for row in df.itertuples():
for row1 in df2.itertuples():
if row.a.find(row1.b) > 0:
density.append(row1.c)
Nevertheless, this does not produce a merge of the two DataFrames.

How to add a hierarchically-named column to a Pandas DataFrame

I have an empty DataFrame:
import pandas as pd
df = pd.DataFrame()
I want to add a hierarchically-named column. I tried this:
df['foo', 'bar'] = [1,2,3]
But it gives a column whose name is a tuple:
(foo, bar)
0 1
1 2
2 3
I want this:
foo
bar
0 1
1 2
2 3
Which I can get if I construct a brand new DataFrame this way:
pd.DataFrame([1,2,3], columns=pd.MultiIndex.from_tuples([('foo', 'bar')]))
How can I create such a layout when adding new columns to an existing DataFrame? The number of levels is always 2...and I know all the possible values for the first level in advance.
If you are looking to build the multi-index DF one column at a time, you could append the frames and drop the Nan's introduced leaving you with the desired multi-index DF as shown:
Demo:
df = pd.DataFrame()
df['foo', 'bar'] = [1,2,3]
df['foo', 'baz'] = [3,4,5]
df
Taking one column at a time and build the corresponding headers.
pd.concat([df[[0]], df[[1]]]).apply(lambda x: x.dropna())
Due to the Nans produced, the values are typecasted into float dtype which could be re-casted back to integers with the help of DF.astype(int).
Note:
This assumes that the number of levels are matching during concatenation.
I'm not sure there is a way to get away with this without redefining the index of the columns to be a Multiindex. If I am not mistaken the levels of the MultiIndex class are actually made up of Index objects. While you can have DataFrames with Hierarchical indices that do not have values for one or more of the levels the index object itself still must be a MultiIndex. For example:
In [2]: df = pd.DataFrame({'foo': [1,2,3], 'bar': [4,5,6]})
In [3]: df
Out[3]:
bar foo
0 4 1
1 5 2
2 6 3
In [4]: df.columns
Out[4]: Index([u'bar', u'foo'], dtype='object')
In [5]: df.columns = pd.MultiIndex.from_tuples([('', 'foo'), ('foo','bar')])
In [6]: df.columns
Out[6]:
MultiIndex(levels=[[u'', u'foo'], [u'bar', u'foo']],
labels=[[0, 1], [1, 0]])
In [7]: df.columns.get_level_values(0)
Out[7]: Index([u'', u'foo'], dtype='object')
In [8]: df
Out[8]:
foo
foo bar
0 4 1
1 5 2
2 6 3
In [9]: df['bar', 'baz'] = [7,8,9]
In [10]: df
Out[10]:
foo bar
foo bar baz
0 4 1 7
1 5 2 8
2 6 3 9
So as you can see, once the MultiIndex is in place you can add columns as you thought, but unfortunately I am not aware of any way of coercing the DataFrame to adaptively adopt a MultiIndex.

Splitting multiple/all columns of a pandas dataframe

I have a pandas dataframe full of tuple (it could be the same with arrays) and I would like to split all the columns into even more columns (each array or tuple has the same length).
Let's take this as an example:
df=pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], df.columns=['column0', 'column1'])
which outputs:
column0 column1
0 (1, 2) (3, 4)
1 (5, 6) (7, 8)
I tried to build over this solution here(https://stackoverflow.com/a/16245109/4218755) using derivates off the expression:
df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})
like
df.column0.apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]})
which outputs:
feature1 feature2
0 1 2
1 5 6
This is the desired behavior. So it works well, but if I happen to try to use
df2=df[df.columns].apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]}))
then df2 is:
colonne0 colonne1
feature1 (1, 2) (3, 4)
feature2 (5, 6) (7, 8)
which is obviously wrong. I can't either apply on df, it output the same result as df2.
How to apply such splitting technique to a whole dataframe, and are there alternatives?
Thanks
You could extract the DataFrame values as a NumPy array, use IT.chain.from_iterable to extract the ints from the tuples, and then reshape and rebuild the array into a new DataFrame:
import itertools as IT
import numpy as np
import pandas as pd
df = pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], columns=['column0', 'column1'])
arr = df.values
arr = np.array(list(IT.chain.from_iterable(arr))).reshape(len(df), -1)
result = pd.DataFrame(arr)
yields
0 1 2 3
0 1 2 3 4
1 5 6 7 8
By the way, you might have fallen into an XY-trap -- you're asking for X when
you really should be looking for Y. Instead of trying to transform df into
result, it might be easier to build the desired DataFrame, result, from
the original data source.
For example, if your original data is a list of lists of tuples:
data = [[(1,2),(3,4)],[(5,6),(7,8)]]
Then the desired DataFrame could be built using
df = pd.DataFrame(np.array(data).reshape(2,-1))
# 0 1 2 3
# 0 1 2 3 4
# 1 5 6 7 8
Once you have non-NumPy-native data types in your DataFrame
(such as tuples), you are doomed to using at least one Python loop to extract
the ints from the tuples. (I'm regarding things like df.apply(func) and
list(IT.chain.from_iterable(arr)) as essentially Python loops since they work
at Python-loop speed.)
IIUC you can use:
df=pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], columns=['column0', 'column1'])
print (df)
column0 column1
0 (1, 2) (3, 4)
1 (5, 6) (7, 8)
for col in df.columns:
df[col]=df[col].apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]}))
print (df)
column0 column1
0 1 3
1 5 7
You may iterate over each column you want to split and assign the new columns to your DataFrame:
import pandas as pd
df=pd.DataFrame( [ [ (1,2), (3,4)],
[ (5,6), (7,8)] ], columns=['column0', 'column1'])
# empty DataFrame
df2 = pd.DataFrame()
for col in df.columns:
# names of new columns
feature_columns = [ "{col}_feature1".format(col=col), "{col}_feature2".format(col=col) ]
# split current column
df2[ feature_columns ] = df[ col ].apply(lambda s: pd.Series({ feature_columns[0]: s[0],
feature_columns[1]: s[1]} ) )
print df2
which gives
column0_feature1 column0_feature2 column1_feature1 column2_feature2
0 1 2 3 4
1 5 6 7 8

Include empty series when creating a pandas dataframe with .concat

UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN

Merge multiple data frames with different dimensions using Pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have the following data frames (in reality they are more than 3).
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
# Note that the value in column 'head' is always unique
What I want to do is to merge them based on head column. And whenever the value of a head does not exist in one data frame we would assign it with NA.
In the end it'll look like this:
head1 head2 head3
-------------------------------
foo 11 1 NA
bix 22 NA NA
bar 32 3 100
xoo NA 2 20
qux NA 10 NA
How can I achieve that using Pandas?
You can use pandas.concat selecting the axis=1 to concatenate your multiple DataFrames.
Note however that I've first set the index of the df1, df2, df3 to use the variables (foo, bar, etc) rather than the default integers.
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
df1 = df1.set_index('head1')
df2 = df2.set_index('head2')
df3 = df3.set_index('head3')
df = pd.concat([df1, df2, df3], axis = 1)
columns = ['head1', 'head2', 'head3']
df.columns = columns
print(df)
head1 head2 head3
bar 32 3 100
bix 22 NaN NaN
foo 11 1 NaN
qux NaN 10 NaN
xoo NaN 2 20

Categories