Where did my numbers go when adding index to DataFrame? - python

My integers become NaNs when I add the index to the DataFrame.
I run this:
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
and I get this:
guavas pears avocados
0 10 111 200
1 20 222 3000
guavas pears avocados
Store
Thriftway NaN NaN NaN
Meijer NaN NaN NaN

The "old" newDF has index [0, 1] while the "new" newDF has index ['Thriftway', 'Meijer']. When using the DataFrame-constructor with a DataFrame, i.e. pd.DataFrame(newDF, index=['Thriftway', 'Meijer']), pandas internally does a reindex with the list in the index-argument on the index of newDF.
Values in the new index that do not have corresponding records in the DataFrame are assigned NaN. The index [0, 1] and the index ['Thriftway', 'Meijer'] have no overlapping values thus result is a DataFrame with NaN as values.
To appreciate this try running the following:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer', 0, 1])
newDF.index.name = 'Store'
print(newDF)
and notice that the new DataFrame now contains the old data. To achieve what you want you can instead reindex the existing DataFrame with the new index like so:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print(newDF)
newDF = newDF.reindex(['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
You can even reproduce what pandas is doing internally by using the index-argument of reindex:
newDF.reindex(index=['Thriftway', 'Meijer'])
The result is, as before, a DataFrame where labels that were not in the DataFrame before have been assigned NaN:
guavas pears avocados
Thriftway NaN NaN NaN
Meijer NaN NaN NaN

newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
In above line, you are passing both dataframe and index to pd.DataFrame().
From the source code of pandas.DataFrame(), I pick some related codes as following with assumption that data is a dataframe:
def __init__(
self,
data=None,
index: Optional[Axes] = None,
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
):
if isinstance(data, BlockManager):
if index is None and columns is None and dtype is None and copy is False:
# GH#33357 fastpath
NDFrame.__init__(self, data)
return
mgr = self._init_mgr(
data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
)
If index is given, pandas.DataFrame() will create a dataframe with the same columns as the passed dataframe. Each cell is filled with NaN.
If index is not given, it will create a dataframe as same as the passed dataframe including index, columns and data.

As far as I understand you want to set the index in your dataframe to something else than 0,1. However,
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
This will actually set your newDF from given index (['Thriftway', 'Meijer']) in newDF. And since (currently) you don't have any values for these two index values in newDF it will write the column values as NaN for these index values.
Two possible solutions for setting up your custom index can be like this:
you specify index when you create your dataframe
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
you use set_index after
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
newDF = newDF.set_index(pd.Index(['Thriftway', 'Meijer']))
newDF.index.name = 'Store'
print(newDF)

Related

How to compare coordinates in two dataframes?

I have two dataframes
df1
x1
y1
x2
y2
label
0
0
1240
1755
label1
0
0
1240
2
label2
df2
x1
y1
x2
y2
text
992.0
943.0
1166.0
974.0
tex1
1110.0
864.0
1166.0
890.0
text2
Based on a condition like the following:
if df1['x1'] >= df2['x1'] or df1['y1'] >= df2['y1']:
# I want to add a new column 'text' in df1 with the text from df2.
df1['text'] = df2['text']
What's more, it is possible in df2 to have more than one row that makes the above-mentioned condition True, so I will need to add another if statement for df2 to get the best match.
My problem here is not the conditions but how am I supposed to approach the interaction between both data frames. Any help, or advice would be appreciated.
If you want to iterate from df1 through every row of df2 and return a match you can do it with the .apply() function in df1 and use the df2 as lookup table.
NOTE: In the above example I return the first match (by using the .iloc[0]) not all the matches.
Create two dummy dataframes
import pandas as pd
df1 = pd.DataFrame({'x1': [1, 2, 3], 'y1': [1, 5, 6]})
df2 = pd.DataFrame({'x1': [11, 1, 13], 'y1': [3, 52, 26], 'text': ['text1', 'text2', 'text3']})
Create a lookup function
def apply_condition(row, df):
condition = ((row['x1'] >= df['x1']) | (row['y1'] >= df['y1']))
return df[condition]['text'].iloc[0] # ATTENTION: Only the first match return
Create new column and print results
df1['text'] = df1.apply(lambda row: apply_condition(row, df2), axis=1)
df1.head()
Result:

Change column values based on other dataframe columns

I have two dataframes that look like this
df1 ==
IDLocation x-coord y-coord
1 -1.546 7.845
2 3.256 1.965
.
.
35 5.723 -2.724
df2 ==
PIDLocation DIDLocation
14 5
3 2
7 26
I want to replace the columns PIDLocation, DIDLocation with Px-coord, Py-coord, Dx-coord, Dy-coord such that the two columns PIDLocation, DIDLocation are IDLocation and each IDLocation corresponds to an x-coord and y-coord in the first dataframe.
If you set the ID column as the index of df1, you can get the coord values by indexing. I changed the values in df2 in the example below to avoid index errors that would result from not having the full dataset.
import pandas as pd
df1 = pd.DataFrame({'IDLocation': [1, 2, 35],
'x-coord': [-1.546, 3.256, 5.723],
'y-coord': [7.845, 1.965, -2.724]})
df2 = pd.DataFrame({'PIDLocation': [35, 1, 2],
'DIDLocation': [2, 1, 35]})
df1.set_index('IDLocation', inplace=True)
df2['Px-coord'] = [df1['x-coord'].loc[i] for i in df2.PIDLocation]
df2['Py-coord'] = [df1['y-coord'].loc[i] for i in df2.PIDLocation]
df2['Dx-coord'] = [df1['x-coord'].loc[i] for i in df2.DIDLocation]
df2['Dy-coord'] = [df1['y-coord'].loc[i] for i in df2.DIDLocation]
del df2['PIDLocation']
del df2['DIDLocation']
print(df2)
Px-coord Py-coord Dx-coord Dy-coord
0 5.723 -2.724 3.256 1.965
1 -1.546 7.845 -1.546 7.845
2 3.256 1.965 5.723 -2.724

extract values from a data frame

The first and the second data frames are as below:
import pandas as pd
d = {'0': [2154,799,1023,4724], '1': [27, 2981, 952,797],'2':[4905,569,4767,569]}
df1 = pd.DataFrame(data=d)
and
d={'PART_NO': ['J661-03982','661-08913', '922-8972','661-00352','661-06291',''], 'PART_NO_ENCODED': [2154,799,1023,27,569]}
df2 = pd.DataFrame(data=d)
I want to get the corresponding part_no for each row in df1 so the resulting data frame should look like this:
d={'PART_NO': ['J661-03982','661-00352',''], 'PART_NO_ENCODED': [2154,27,4905]}
df3 = pd.DataFrame(data=d)
This I can achieve like this:
df2.set_index('PART_NO_ENCODED').reindex(df1.iloc[0,:]).reset_index().rename(columns={0:'PART_NO_ENCODED'})
But instead of passing reindex(df1.iloc[0,:]) one value that's 0,1 at a Time I want to get for all the rows in df1 the corresponding part_no. Please help?
You can use the second dataframe as a dictionary of replacements:
df3 = df1.replace(df2.set_index('PART_NO_ENCODED').to_dict()['PART_NO'])
The values that are not in df2, will not be replaced. They have to be identified and discarded:
df3 = df3[df1.isin(df2['PART_NO_ENCODED'].tolist())]
# 0 1 2
#0 J661-03982 661-00352 NaN
#1 661-08913 NaN 661-06291
#2 922-8972 NaN NaN
#3 NaN NaN 661-06291
You can later replace the missing values with '' or any other value of your choice with fillna.

Python Pandas: Construct a DataFrame with MultiIndex and Dict

I have dict that I would like to turn into a DataFrame with MultiIndex.
The dict is:
dikt = {'bloomberg': Timestamp('2009-01-26 10:00:00'),
'investingcom': Timestamp('2009-01-01 09:00:00')}
I construct a MultiIndex such as follow:
MI= MultiIndex(levels=[['Existing Home Sales MoM'], ['investingcom', 'bloomberg']],
labels=[[0, 0], [0, 1]],
names=['indicator', 'source'])
Then a DataFrame as such:
df = pd.DataFrame(index = MI, columns=["datetime"],data =np.full((2,1),np.NaN))
Then lastly I fill the df with data stored in a dict such :
for key in ['indicator', 'source']:
df.loc[('Existing Home Sales MoM',key), "datetime"] = dikt[key]
and get the expected result:
But would there be a more concise way of doing so by passing the dikt directly into the construction of the df such as
df = pd.DataFrame(index = MI, columns=["datetime"],data =dikt)
so as to combine the 2 last steps in 1?
You can create a datframe from a dictionary using from_dict:
pd.DataFrame.from_dict(dikt, orient='index')
0
bloomberg 2009-01-26 10:00:00
investingcom 2009-01-01 09:00:00
You can chain the column and index definitions to get the result you're after in 1 step:
pd.DataFrame.from_dict(dikt, orient='index') \
.rename(columns={0: 'datetime'}) \
.set_index(MI)

Is there a way to copy only the structure (not the data) of a Pandas DataFrame?

I received a DataFrame from somewhere and want to create another DataFrame with the same number and names of columns and rows (indexes). For example, suppose that the original data frame was created as
import pandas as pd
df1 = pd.DataFrame([[11,12],[21,22]], columns=['c1','c2'], index=['i1','i2'])
I copied the structure by explicitly defining the columns and names:
df2 = pd.DataFrame(columns=df1.columns, index=df1.index)
I don't want to copy the data, otherwise I could just write df2 = df1.copy(). In other words, after df2 being created it must contain only NaN elements:
In [1]: df1
Out[1]:
c1 c2
i1 11 12
i2 21 22
In [2]: df2
Out[2]:
c1 c2
i1 NaN NaN
i2 NaN NaN
Is there a more idiomatic way of doing it?
That's a job for reindex_like. Start with the original:
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
Construct an empty DataFrame and reindex it like df1:
pd.DataFrame().reindex_like(df1)
Out:
c1 c2
i1 NaN NaN
i2 NaN NaN
In version 0.18 of pandas, the DataFrame constructor has no options for creating a dataframe like another dataframe with NaN instead of the values.
The code you use df2 = pd.DataFrame(columns=df1.columns, index=df1.index) is the most logical way, the only way to improve on it is to spell out even more what you are doing is to add data=None, so that other coders directly see that you intentionally leave out the data from this new DataFrame you are creating.
TLDR: So my suggestion is:
Explicit is better than implicit
df2 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
Very much like yours, but more spelled out.
Not exactly answering this question, but a similar one for people coming here via a search engine
My case was creating a copy of the data frame without data and without index. One can achieve this by doing the following. This will maintain the dtypes of the columns.
empty_copy = df.drop(df.index)
Let's start with some sample data
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']],
...: columns=['num', 'char'])
In [3]: df
Out[3]:
num char
0 1 a
1 2 b
2 3 c
In [4]: df.dtypes
Out[4]:
num int64
char object
dtype: object
Now let's use a simple DataFrame initialization using the columns of the original DataFrame but providing no data:
In [5]: empty_copy_1 = pd.DataFrame(data=None, columns=df.columns)
In [6]: empty_copy_1
Out[6]:
Empty DataFrame
Columns: [num, char]
Index: []
In [7]: empty_copy_1.dtypes
Out[7]:
num object
char object
dtype: object
As you can see, the column data types are not the same as in our original DataFrame.
So, if you want to preserve the column dtype...
If you want to preserve the column data types you need to construct the DataFrame one Series at a time
In [8]: empty_copy_2 = pd.DataFrame.from_items([
...: (name, pd.Series(data=None, dtype=series.dtype))
...: for name, series in df.iteritems()])
In [9]: empty_copy_2
Out[9]:
Empty DataFrame
Columns: [num, char]
Index: []
In [10]: empty_copy_2.dtypes
Out[10]:
num int64
char object
dtype: object
A simple alternative -- first copy the basic structure or indexes and columns with datatype from the original dataframe (df1) into df2
df2 = df1.iloc[0:0]
Then fill your dataframe with empty rows -- pseudocode that will need to be adapted to better match your actual structure:
s = pd.Series([Nan,Nan,Nan], index=['Col1', 'Col2', 'Col3'])
loop through the rows in df1
df2 = df2.append(s)
To preserve column type you can use the astype method,
like pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
import pandas as pd
df1 = pd.DataFrame(
[
[11, 12, 'Alice'],
[21, 22, 'Bob']
],
columns=['c1', 'c2', 'c3'],
index=['i1', 'i2']
)
df2 = pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
print(df2.shape)
print(df2.dtypes)
output:
(0, 3)
c1 int64
c2 int64
c3 object
dtype: object
Working example
You can simply mask by notna() i.e
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
df2 = df1.mask(df1.notna())
c1 c2
i1 NaN NaN
i2 NaN NaN
A simple way to copy df structure into df2 is:
df2 = pd.DataFrame(columns=df.columns)
This has worked for me in pandas 0.22:
df2 = pd.DataFrame(index=df.index.delete(slice(None)), columns=df.columns)
Convert types:
df2 = df2.astype(df.dtypes)
delete(slice(None))
In case you do not want to keep the values ​​of the indexes.
I know this is an old question, but I thought I would add my two cents.
def df_cols_like(df):
"""
Returns an empty data frame with the same column names and types as df
"""
df2 = pd.DataFrame({i[0]: pd.Series(dtype=i[1])
for i in df.dtypes.iteritems()},
columns=df.dtypes.index)
return df2
This approach centers around the df.dtypes attribute of the input data frame, df, which is a pd.Series. A pd.DataFrame is constructed from a dictionary of empty pd.Series objects named using the input column names with the column order being taken from the input df.

Categories