Python Pandas: Construct a DataFrame with MultiIndex and Dict

Python Pandas: Construct a DataFrame with MultiIndex and Dict - python

I have dict that I would like to turn into a DataFrame with MultiIndex.
The dict is:
dikt = {'bloomberg': Timestamp('2009-01-26 10:00:00'),
'investingcom': Timestamp('2009-01-01 09:00:00')}
I construct a MultiIndex such as follow:
MI= MultiIndex(levels=[['Existing Home Sales MoM'], ['investingcom', 'bloomberg']],
labels=[[0, 0], [0, 1]],
names=['indicator', 'source'])
Then a DataFrame as such:
df = pd.DataFrame(index = MI, columns=["datetime"],data =np.full((2,1),np.NaN))
Then lastly I fill the df with data stored in a dict such :
for key in ['indicator', 'source']:
df.loc[('Existing Home Sales MoM',key), "datetime"] = dikt[key]
and get the expected result:
But would there be a more concise way of doing so by passing the dikt directly into the construction of the df such as
df = pd.DataFrame(index = MI, columns=["datetime"],data =dikt)
so as to combine the 2 last steps in 1?

You can create a datframe from a dictionary using from_dict:
pd.DataFrame.from_dict(dikt, orient='index')
0
bloomberg 2009-01-26 10:00:00
investingcom 2009-01-01 09:00:00
You can chain the column and index definitions to get the result you're after in 1 step:
pd.DataFrame.from_dict(dikt, orient='index') \
.rename(columns={0: 'datetime'}) \
.set_index(MI)

Related

Where did my numbers go when adding index to DataFrame?

My integers become NaNs when I add the index to the DataFrame.
I run this:
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
and I get this:
guavas pears avocados
0 10 111 200
1 20 222 3000
guavas pears avocados
Store
Thriftway NaN NaN NaN
Meijer NaN NaN NaN

The "old" newDF has index [0, 1] while the "new" newDF has index ['Thriftway', 'Meijer']. When using the DataFrame-constructor with a DataFrame, i.e. pd.DataFrame(newDF, index=['Thriftway', 'Meijer']), pandas internally does a reindex with the list in the index-argument on the index of newDF.
Values in the new index that do not have corresponding records in the DataFrame are assigned NaN. The index [0, 1] and the index ['Thriftway', 'Meijer'] have no overlapping values thus result is a DataFrame with NaN as values.
To appreciate this try running the following:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer', 0, 1])
newDF.index.name = 'Store'
print(newDF)
and notice that the new DataFrame now contains the old data. To achieve what you want you can instead reindex the existing DataFrame with the new index like so:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print(newDF)
newDF = newDF.reindex(['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
You can even reproduce what pandas is doing internally by using the index-argument of reindex:
newDF.reindex(index=['Thriftway', 'Meijer'])
The result is, as before, a DataFrame where labels that were not in the DataFrame before have been assigned NaN:
guavas pears avocados
Thriftway NaN NaN NaN
Meijer NaN NaN NaN

newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
In above line, you are passing both dataframe and index to pd.DataFrame().
From the source code of pandas.DataFrame(), I pick some related codes as following with assumption that data is a dataframe:
def __init__(
self,
data=None,
index: Optional[Axes] = None,
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
):
if isinstance(data, BlockManager):
if index is None and columns is None and dtype is None and copy is False:
# GH#33357 fastpath
NDFrame.__init__(self, data)
return
mgr = self._init_mgr(
data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
)
If index is given, pandas.DataFrame() will create a dataframe with the same columns as the passed dataframe. Each cell is filled with NaN.
If index is not given, it will create a dataframe as same as the passed dataframe including index, columns and data.

As far as I understand you want to set the index in your dataframe to something else than 0,1. However,
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
This will actually set your newDF from given index (['Thriftway', 'Meijer']) in newDF. And since (currently) you don't have any values for these two index values in newDF it will write the column values as NaN for these index values.
Two possible solutions for setting up your custom index can be like this:
you specify index when you create your dataframe
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
you use set_index after
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
newDF = newDF.set_index(pd.Index(['Thriftway', 'Meijer']))
newDF.index.name = 'Store'
print(newDF)

Groupby, apply function and combine results in dataframe

I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?

Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0

Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()

Pivoting 3 Dates Per Personal-ID to Columns

I have a dataframe (DF1) as such - each Personal-ID will have 3 dates associated w/that ID:
I have created a dataframe (DF_ID) w/1 row for each Personal-ID & Column for Each Respective Date (which is currently blank) and would like load/loop the 3 dates/Personal-ID (DF1) into the respective date columns the final dataframe to look as such:
I am trying to learn python and have tried a number of codinging script to accomplish such as:
{for index, row in df_bnp_5.iterrows():
df_id['Date-1'] = (row.loc[0,'hv_lab_test_dt'])
df_id['Date-2'] = (row.loc[1,'hv_lab_test_dt'])
df_id['Date-3'] = (row.loc[2,'hv_lab_test_dt'])
for i in range(len(df_bnp_5)) :
df_id['Date-1'] = df1.iloc[i, 0], df_id['Date-2'] = df1.iloc[i, 2])}
Any assistance would be appreciated.
Thank You!

Here is one way. I created a 'helper' column to arrange the dates for each Personal-ID.
import pandas as pd
# create data frame
df = pd.DataFrame({'Personal-ID': [1, 1, 1, 5, 5, 5],
'Date': ['10/01/2019', '12/28/2019', '05/08/2020',
'01/19/2020', '06/05/2020', '07/19/2020']})
# change data type
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# create grouping key
df['x'] = df.groupby('Personal-ID')['Date'].rank().astype(int)
# convert to wide table
df = df.pivot(index='Personal-ID', columns='x', values='Date')
# change column names
df = df.rename(columns={1: 'Date-1', 2: 'Date-2', 3: 'Date-3'})
print(df)
x Date-1 Date-2 Date-3
Personal-ID
1 2019-10-01 2019-12-28 2020-05-08
5 2020-01-19 2020-06-05 2020-07-19

Select named index level from pandas DataFrame MultiIndex

I created a dataframe as :
df1 = pandas.read_csv(ifile_name, header=None, sep=r"\s+", usecols=[0,1,2,3,4],
index_col=[0,1,2], names=["year", "month", "day", "something1", "something2"])
now I would like to create another dataframe where year>2008. Hence I tried :
df2 = df1[df1.year>2008]
But getting error :
AttributeError: 'DataFrame' object has no attribute 'year'
I guess, it is not seeing the "year" among the columns because I defined it within index. But how can I get data based on year>2008 in that case?

Get the level by name using MultiIndex.get_level_values and create a boolean mask for row selection:
df2 = df1[df1.index.get_level_values('year') > 2008]
If you plan to make modifications, create a copy of df1 so as to not operate on the view.
df2 = df1[df1.index.get_level_values('year') > 2008].copy()

You are correct that year is an index rather than a column. One solution is to use pd.DataFrame.query, which lets you use index names directly:
df = pd.DataFrame({'year': [2005, 2010, 2015], 'value': [1, 2, 3]})
df = df.set_index('year')
res = df.query('year > 2008')
print(res)
value
year
2010 2
2015 3

Assuming your index is sorted
df.loc[2008:]
Out[259]:
value
year
2010 2
2015 3

Is there a way to copy only the structure (not the data) of a Pandas DataFrame?

I received a DataFrame from somewhere and want to create another DataFrame with the same number and names of columns and rows (indexes). For example, suppose that the original data frame was created as
import pandas as pd
df1 = pd.DataFrame([[11,12],[21,22]], columns=['c1','c2'], index=['i1','i2'])
I copied the structure by explicitly defining the columns and names:
df2 = pd.DataFrame(columns=df1.columns, index=df1.index)
I don't want to copy the data, otherwise I could just write df2 = df1.copy(). In other words, after df2 being created it must contain only NaN elements:
In [1]: df1
Out[1]:
c1 c2
i1 11 12
i2 21 22
In [2]: df2
Out[2]:
c1 c2
i1 NaN NaN
i2 NaN NaN
Is there a more idiomatic way of doing it?

That's a job for reindex_like. Start with the original:
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
Construct an empty DataFrame and reindex it like df1:
pd.DataFrame().reindex_like(df1)
Out:
c1 c2
i1 NaN NaN
i2 NaN NaN

In version 0.18 of pandas, the DataFrame constructor has no options for creating a dataframe like another dataframe with NaN instead of the values.
The code you use df2 = pd.DataFrame(columns=df1.columns, index=df1.index) is the most logical way, the only way to improve on it is to spell out even more what you are doing is to add data=None, so that other coders directly see that you intentionally leave out the data from this new DataFrame you are creating.
TLDR: So my suggestion is:
Explicit is better than implicit
df2 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
Very much like yours, but more spelled out.

Not exactly answering this question, but a similar one for people coming here via a search engine
My case was creating a copy of the data frame without data and without index. One can achieve this by doing the following. This will maintain the dtypes of the columns.
empty_copy = df.drop(df.index)

Let's start with some sample data
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']],
...: columns=['num', 'char'])
In [3]: df
Out[3]:
num char
0 1 a
1 2 b
2 3 c
In [4]: df.dtypes
Out[4]:
num int64
char object
dtype: object
Now let's use a simple DataFrame initialization using the columns of the original DataFrame but providing no data:
In [5]: empty_copy_1 = pd.DataFrame(data=None, columns=df.columns)
In [6]: empty_copy_1
Out[6]:
Empty DataFrame
Columns: [num, char]
Index: []
In [7]: empty_copy_1.dtypes
Out[7]:
num object
char object
dtype: object
As you can see, the column data types are not the same as in our original DataFrame.
So, if you want to preserve the column dtype...
If you want to preserve the column data types you need to construct the DataFrame one Series at a time
In [8]: empty_copy_2 = pd.DataFrame.from_items([
...: (name, pd.Series(data=None, dtype=series.dtype))
...: for name, series in df.iteritems()])
In [9]: empty_copy_2
Out[9]:
Empty DataFrame
Columns: [num, char]
Index: []
In [10]: empty_copy_2.dtypes
Out[10]:
num int64
char object
dtype: object

A simple alternative -- first copy the basic structure or indexes and columns with datatype from the original dataframe (df1) into df2
df2 = df1.iloc[0:0]
Then fill your dataframe with empty rows -- pseudocode that will need to be adapted to better match your actual structure:
s = pd.Series([Nan,Nan,Nan], index=['Col1', 'Col2', 'Col3'])
loop through the rows in df1
df2 = df2.append(s)

To preserve column type you can use the astype method,
like pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
import pandas as pd
df1 = pd.DataFrame(
[
[11, 12, 'Alice'],
[21, 22, 'Bob']
],
columns=['c1', 'c2', 'c3'],
index=['i1', 'i2']
)
df2 = pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
print(df2.shape)
print(df2.dtypes)
output:
(0, 3)
c1 int64
c2 int64
c3 object
dtype: object
Working example

You can simply mask by notna() i.e
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
df2 = df1.mask(df1.notna())
c1 c2
i1 NaN NaN
i2 NaN NaN

A simple way to copy df structure into df2 is:
df2 = pd.DataFrame(columns=df.columns)

This has worked for me in pandas 0.22:
df2 = pd.DataFrame(index=df.index.delete(slice(None)), columns=df.columns)
Convert types:
df2 = df2.astype(df.dtypes)
delete(slice(None))
In case you do not want to keep the values of the indexes.

I know this is an old question, but I thought I would add my two cents.
def df_cols_like(df):
"""
Returns an empty data frame with the same column names and types as df
"""
df2 = pd.DataFrame({i[0]: pd.Series(dtype=i[1])
for i in df.dtypes.iteritems()},
columns=df.dtypes.index)
return df2
This approach centers around the df.dtypes attribute of the input data frame, df, which is a pd.Series. A pd.DataFrame is constructed from a dictionary of empty pd.Series objects named using the input column names with the column order being taken from the input df.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas: Construct a DataFrame with MultiIndex and Dict - python

Related

Where did my numbers go when adding index to DataFrame?

Groupby, apply function and combine results in dataframe

Pivoting 3 Dates Per Personal-ID to Columns

Select named index level from pandas DataFrame MultiIndex

Is there a way to copy only the structure (not the data) of a Pandas DataFrame?

Categories

Resources