I am trying to convert a list of 2d-dataframes into one large dataframe. Lets assume I have the following example, where I create a set of dataframes, each one having the same columns / index:
import pandas as pd
import numpy as np
frames = []
names = []
frame_columns = ['DataPoint1', 'DataPoint2']
for i in range(5):
names.append("DataSet{0}".format(i))
frames.append(pd.DataFrame(np.random.randn(3, 2), columns=frame_columns))
I would like to convert this set of dataframes into one dataframe df which I can access using df['DataSet0']['DataPoint1'].
This dataset would have to have a multi-index consisting of the product of ['DataPoint1', 'DataPoint2'] and the index of the individual dataframes (which is of course the same for all individual frames).
Conversely, the columns would be given as the product of ['Dataset0', ...] and ['DataPoint1', 'DataPoint2'].
In either case, I can create a corresponding MultiIndex and derive an (empty) dataframe based on that:
mux = pd.MultiIndex.from_product([names, frames[0].columns])
frame = pd.DataFrame(index=mux).T
However, I would like to have the contents of the dataframes present rather than having to then add them.
Note that a similar question has been asked here. However, the answers seem to revolve around the Panel class, which is, as of now, deprecated.
Similarly, this thread suggests a join, which is not really what I need.
You can use concat with keys:
total_frame = pd.concat(frames, keys=names)
Output:
DataPoint1 DataPoint2
DataSet0 0 -0.656758 1.776027
1 -0.940759 1.355495
2 0.173670 0.274525
DataSet1 0 -0.744456 -1.057482
1 0.186901 0.806281
2 0.148567 -1.065477
DataSet2 0 -0.980312 -0.487479
1 2.117227 -0.511628
2 0.093718 -0.514379
DataSet3 0 0.046963 -0.563041
1 -0.663800 -1.130751
2 -1.446891 0.879479
DataSet4 0 1.586213 1.552048
1 0.196841 1.933362
2 -0.545256 0.387289
Then you can extract Dataset0 by:
total_frame.loc['DataSet0']
If you really want to use MultiIndex columns instead, you can add axis=1 to concat:
total_frame = pd.concat(frames, axis=1, keys=names)
Related
I've consulted a bunch of previous related SO posts, but I could not adapt them to solve my question.
Here is an example dataframe.
# Using pandas 0.24.2
data = {'customer_id': [1, 2, 3],
'prev_due_date':['Jun-2010', 'Apr-2019', 'Dec-1999'],
'current_due_date':['Aug-2019', 'Dec-2045', 'Jan-2000'],
'next_due_date':['Feb-2025', 'Nov-2065', 'Sep-2001']
}
df = pd.DataFrame(data)
Here is what the dataframe looks like, and there are many more such columns to parse in actual dataframe, hence my question.
customer_id prev_due_date current_due_date next_due_date
0 1 Jun-2010 Aug-2019 Feb-2025
1 2 Apr-2019 Dec-2045 Nov-2065
2 3 Dec-1999 Jan-2000 Sep-2001
I have created a function to parse one column (ie, this adds two parsed columns --- month and year columns --- to the supplied df)
def parse_column(df, col_parse):
col_parse_mmm = col_parse + '_mmm'
col_parse_yyyy = col_parse + '_yyyy'
df[[col_parse_mmm, col_parse_yyyy]] = df[col_parse].str.split('-', expand=True)
return df
Calling this function below does the job for the supplied column:
parse_column(df, 'prev_due_date')
Now, my question is:
How can I do this for an arbitrary number columns of my choosing (eg, list of of tens or hundreds columns that I want to parse), using apply?
Is it possible to avoid using apply?
for c in df.columns:
if c.endswith('_date'):
parse_column(df, c)
(you don't need return the df in your parse_column function)
If you already have the list with the column names you're interested in:
for c in my_columns_list:
parse_column(df, c)
You don't need any apply.
I am trying to create a pandas data frame using two lists and the output is erroneous for a given length of the lists.(this is not due to varying lengths)
Here I have two cases, one that works as expected and one that doesn't(commented out):
import string
d = dict.fromkeys(string.ascii_lowercase, 0).keys()
groups = sorted(d)[:3]
numList = range(0,4)
# groups = sorted(d)[:20]
# numList = range(0,25)
df = DataFrame({'Number':sorted(numList)*len(groups), 'Group':sorted(groups)*len(numList)})
df.sort_values(['Group', 'Number'])
Expected Output: every item in groups, to correspond to all items in numList
Group Number
a 0
a 1
a 2
a 3
b 0
b 1
b 2
b 3
c 0
c 1
c 2
c 3
Actual Results: Works for case in which lists are sized 3 and 4 but not 20 , and 25 (I have commented out that case in the above code)
Why is that? and how to fix that?
If I understand this correctly, you want to make dataframe which will have all pairs of groups and numbers. That operation is called cartesian product.
If the difference in lengths betweens those two arrays is exactly 1, it works with your approach, but this is more by pure accident. For general case, you want to do this.
df1 = DataFrame({'Number': sorted(numList)})
df2 = DataFrame({'Group': sorted(groups)})
df = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', 1)
And just note about dataframes sorting: You need to remember that in pandas, most of DataFrame operations return new DataFrame by default, don't modify the old one, unless you pass the inplace=True parameter.
So you should do
df = df.sort_values(['Group', 'Number'])
or
df.sort_values(['Group', 'Number'], inplace=True)
and it should work now.
I am trying to manipulate a bunch of multiindex Pandas array. Each column is a time series with different categorical groupings. I would like to sort through the data and then parse through all the categories and then do some additional data manipulation. Here is a sample code of what I was attempting but didn't work
import pandas as pd
import numpy as np
df=pd.DataFrame({'t': range(1,11)})
df.set_index(['t'],inplace=True)
for num in range(2):
labely = (str(num),'A','y')
labelx = (str(num),'A','x')
labelbx = (str(num),'B','x')
df[labelx]= np.random.randn(10)
df[labelbx]= np.random.randn(10)
df[labely]= np.random.randn(10)+range(1,11)
df.columns = pd.MultiIndex.from_tuples(df.columns, names=['ID','Location','Direction'])
df[('0','A','tot')]=df[('0','A','y')]+df[('0','A','x')]
df.sort_index(level='ID',inplace=True)
df.head()
This doesn't sort. This is the result with the total not being grouped with the other 0 ID and the Locations not being grouped together:
ID 0 ... 1 0
Location A B A ... B A A
Direction x x y ... x y tot
t ...
1 0.430386 -0.121109 0.263314 ... 0.243839 0.313505 0.693700
2 -1.262746 -0.678889 1.289814 ... -0.893230 0.373103 0.027068
3 0.245483 -0.565859 3.766628 ... 0.012933 1.652484 4.012111
4 1.518357 0.447032 5.649877 ... -1.205161 5.513507 7.168233
5 -0.095216 -0.571333 6.794958 ... -0.777933 4.073334 6.699741
I have 2 questions associated with this.
How to I sort the columns so that the organized by each of the
levels
How do I efficiently parse through the dataframe to do
additional data manipulation .
This is some sudo code for the second questions
for id in ID:
for loc in Location:
df[(id,loc,'tot')=df[(id,loc,'x')]+df[(id,loc,'y')]
To sort by columns, as Ian answered axis=1:
df.sort_index(level='ID',axis=1,inplace=True)
To get a list of the tuples of the unique column names to parse, I needed columns.values and then I resorted after calculations.
for id,loc,dir in df.columns.values:
df[(id,loc,'tot')]=(df[(id,loc,'x')]**2+df[(id,loc,'y')]**2)**.5
df.sort_index(level='ID',axis=1,inplace=True)
Since this is basic column calculations I think this method would be efficient.
To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72
I have a dataframe that looks likes this:
Sweep Index
Sweep0001 0 -70.434570
1 -67.626953
2 -68.725586
3 -70.556641
4 -71.899414
5 -69.946289
6 -63.964844
7 -73.974609
...
Sweep0039 79985 -63.964844
79986 -66.406250
79987 -67.993164
79988 -68.237305
79989 -66.894531
79990 -71.411133
I want to slice out different ranges of Sweeps.
So for example, I want Sweep0001 : Sweep0003, Sweep0009 : Sweep0015, etc.
I know I can do this in separate lines with ix, i.e.:
df.ix['Sweep0001':'Sweep0003']
df.ix['Sweep0009':'Sweep0015']
And then put those back together into one dataframe (I'm doing this so I can average sweeps together, but I need to select some of them and remove others).
Is there a way to do that selection in one line though? I.e. without having to slice each piece separately, followed by recombining all of it into one dataframe.
Use Pandas IndexSlice
import pandas as pd
idx = pd.IndexSlice
df.loc[idx[["Sweep0001", "Sweep0002", ..., "Sweep0003", "Sweep0009", ..., "Sweep0015"]]
You can retrieve the labels you want this way:
list1 = df.index.get_level_values(0).unique()
list2 = [x for x in list1]
list3 = list2[1:4] #For your Sweep0001:Sweep0003
list3.extend(list2[9:16]) #For you Sweep0009:Sweep0015
df.loc[idx[list3]] #Note that you need one set of "[]"
#less around "list3" as this list comes
#by default with its own set of "[]".
In case you want to also slice by columns you can use:
df.loc[idx[list3],:] #Same as above to include all columns.
df.loc[idx[list3],:"column label"] #Returns data up to that "column label".
More information on slicing is on the Pandas website (http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers) or in this similar Stackoverflow Q/A: Python Pandas slice multiindex by second level index (or any other level)