I tried the following code but the new column consists of only NAN values.
df['new'] = pd.Series(np.repeat(1, len(df)))
Can someone explain to me what the problem is here?
It is possible that the index of the DataFrame df does not match with the newly created Series'. For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [11, 22, 33, 44, 55]}, index=['r1','r2','r3','r4','r5'])
df['new'] = pd.Series(np.repeat(1, len(df)))
print df
and the output will be:
a new
r1 11 NaN
r2 22 NaN
r3 33 NaN
r4 44 NaN
r5 55 NaN
since the index of pd.Series(np.repeat(1, len(df))) is Int64Index([0, 1, 2, 3, 4], dtype='int64').
To prevent that, specify the index argument when creating the Series:
df['new'] = pd.Series(np.repeat(1, len(df)), index=df.index)
Alternatively, you can just pass a numpy array if the index is to be ignored:
df['new'] = np.repeat(1, len(df))
without needing to create a Series (in fact, df['new'] = 1 will do for this case). Using a Series is helpful when you need to align the new column with the existing DataFrame using the index.
Related
I need to join dataframes with different columns created in a for-loop.
So this is the question in a simplified version. As you can see in the picture, I have made two dataframes.
In this dataframe, we have 5 columns, and the columns are not continuous (0,2,5 and 7 are missing).
Here we have 6 columns, not continuous(0,6,7 missing) and the columns itself does not completely match the first df.
What I need to do is :
Step 1: create a new df with continuous column numbers 0,1,2,3,4,5,6,7,8.
Step 2: Add the rows of df1 and df2 corresponding to each column numbers. Whichever row whose column number does not have values should be a nan.
Note : This has to be done in a loop as I have thousands of dataframes to merge
So the resulting dataframe will be like this:
# store your dfs in an iterator
# df_list = [ ... ]
# the columns you want your final df to have
final_columns = range(9)
# add these columns with value None to your dfs if not there already
for df in df_list:
for i in final_columns:
if i not in df1.columns:
df[i] = None
# merge all of your dfs together
final_df = pd.concat(df_list, ignore_index=True)
final_df
Try concat + reindex:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[34, 56, 66, 77, 77]], columns=[1, 3, 4, 6, 7])
df2 = pd.DataFrame([[34, 56, 66, 77, 77, 66]], columns=[1, 2, 3, 4, 5, 8])
# Collection of all DataFrames
dfs = (df1, df2)
# Concat
new_df = pd.concat(dfs, ignore_index=True).reindex(columns=np.arange(0, 9))
print(new_df)
new_df:
0 1 2 3 4 5 6 7 8
0 NaN 34 NaN 56 66 NaN 77.0 77.0 NaN
1 NaN 34 56.0 66 77 77.0 NaN NaN 66.0
I need to apply a function that takes subcolumns (aka Series) of multiindexed columns as arguments. I have come up with a solution that works, but I was curious if there was a more pythonic/proper pandas way to do this.
Let's say we have a function that takes two series as arguments and performs some user-defined operation on those series and returns a single series.
import pandas as pd
def user_defined_function(series1, series2):
return 12 * (series1 * series2 / 3)
Lets create a dataframe with multindexed columns.
data = [[1, 2, 3, 4],
[5, 6, 7, 8],
[10, 11, 12, 13],
[14, 15, 16, 17]]
columns = (('A', 'sub_col_1'),
('A', 'sub_col_2'),
('B', 'sub_col_1'),
('B', 'sub_col_2'))
df = pd.DataFrame(data, columns=columns)
print(df)
A B
sub_col_1 sub_col_2 sub_col_1 sub_col_2
0 1 2 3 4
1 5 6 7 8
2 10 11 12 13
3 14 15 16 17
I want to appy my user_defined_function() to the sub columns of A and B.
Now if you try and use apply traditionally pandas will traverse each column individually returning a single series to the function. So you can't just do this.
df.apply(lambda x: user_defined_function(x['sub_col_1'], x['sub_col_2']))
You'll end up getting a key error because pandas is passing a series not a normally indexed "sub dataframe."
So this is the solution I came up with.
level1_labels = set(df.columns.get_level_values(0))
processed_df = pd.DataFrame()
for label in level1_labels:
data_to_apply_function_to = df[label]
processed_series = user_defined_function(data_to_apply_function_to['sub_col_1'],
data_to_apply_function_to['sub_col_2'])
processed_df[label] = processed_series
print(processed_df)
A B
0 8.0 48.0
1 120.0 224.0
2 440.0 624.0
3 840.0 1088.0
This returns what I want it to. However, I am curious if there is a cleaner, more pythonic, proper way to do this.
You can groupby over the columns axis. Your function requires a Series so we'll need to squeeze if we want to select by label.
(df.groupby(level=0, axis=1)
.apply(lambda gp: user_defined_function(gp.xs('sub_col_1', level=1, axis=1).squeeze(),
gp.xs('sub_col_2', level=1, axis=1).squeeze()))
)
# A B
#0 8.0 48.0
#1 120.0 224.0
#2 440.0 624.0
#3 840.0 1088.0
A bit more error prone, though fine if you know all groups have two Series in the same positions
(df.groupby(level=0, axis=1)
.apply(lambda gp: user_defined_function(gp.iloc[:, 0], gp.iloc[:, 1]))
)
It looks to me that this is a very custom case. It is actually possible to use apply within the 0 level columns as following
import pandas as pd
# I just renamed it cos was very long
def udf(series1, series2):
return 12 * (series1 * series2 / 3)
col = "A"
df[col].apply(lambda x: udf(x['sub_col_1'], x['sub_col_2']),axis=1)\
.to_frame()\
.rename(columns={0:col})
returns
A
0 8.0
1 120.0
2 440.0
3 840.0
But again for the output you are looking for you should still need to loop.
out = []
for col in set(df.columns.get_level_values(0)):
out.append(
df[col].apply(lambda x: udf(x['sub_col_1'],
x['sub_col_2']),
axis=1)\
.to_frame()\
.rename(columns={0:col}))
out = pd.concat(out, axis=1)
Given the following DataFrame:
A B
0 -10.0 NaN
1 NaN 20.0
2 -30.0 NaN
I want to merge columns A and B, filling the NaN cells in column A with the values from column B and then drop column B, resulting in a DataFrame like this:
A
0 -10.0
1 20.0
2 -30.0
I have managed to solve this problem by using the iterrows() function.
Complete code example:
import numpy as np
import pandas as pd
example_data = [[-10, np.NaN], [np.NaN, 20], [-30, np.NaN]]
example_df = pd.DataFrame(example_data, columns = ['A', 'B'])
for index, row in example_df.iterrows():
if pd.isnull(row['A']):
row['A'] = row['B']
example_df = example_df.drop(columns = ['B'])
example_df
This seems to work fine, but I find this information in the documentation for iterrows():
You should never modify something you are iterating over.
So it seems like I'm doing it wrong.
What would be a better/recommended approach for achieving the same result?
Use Series.fillna with Series.to_frame:
df = df['A'].fillna(df['B']).to_frame()
#alternative
#df = df['A'].combine_first(df['B']).to_frame()
print (df)
A
0 -10.0
1 20.0
2 -30.0
If more columns and need first non missing values per rows use back filling missing values with select first column by one element list for one column DataFrame:
df = df.bfill(axis=1).iloc[:, [0]]
print (df)
A
0 -10.0
1 20.0
2 -30.0
UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN
I received a DataFrame from somewhere and want to create another DataFrame with the same number and names of columns and rows (indexes). For example, suppose that the original data frame was created as
import pandas as pd
df1 = pd.DataFrame([[11,12],[21,22]], columns=['c1','c2'], index=['i1','i2'])
I copied the structure by explicitly defining the columns and names:
df2 = pd.DataFrame(columns=df1.columns, index=df1.index)
I don't want to copy the data, otherwise I could just write df2 = df1.copy(). In other words, after df2 being created it must contain only NaN elements:
In [1]: df1
Out[1]:
c1 c2
i1 11 12
i2 21 22
In [2]: df2
Out[2]:
c1 c2
i1 NaN NaN
i2 NaN NaN
Is there a more idiomatic way of doing it?
That's a job for reindex_like. Start with the original:
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
Construct an empty DataFrame and reindex it like df1:
pd.DataFrame().reindex_like(df1)
Out:
c1 c2
i1 NaN NaN
i2 NaN NaN
In version 0.18 of pandas, the DataFrame constructor has no options for creating a dataframe like another dataframe with NaN instead of the values.
The code you use df2 = pd.DataFrame(columns=df1.columns, index=df1.index) is the most logical way, the only way to improve on it is to spell out even more what you are doing is to add data=None, so that other coders directly see that you intentionally leave out the data from this new DataFrame you are creating.
TLDR: So my suggestion is:
Explicit is better than implicit
df2 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
Very much like yours, but more spelled out.
Not exactly answering this question, but a similar one for people coming here via a search engine
My case was creating a copy of the data frame without data and without index. One can achieve this by doing the following. This will maintain the dtypes of the columns.
empty_copy = df.drop(df.index)
Let's start with some sample data
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']],
...: columns=['num', 'char'])
In [3]: df
Out[3]:
num char
0 1 a
1 2 b
2 3 c
In [4]: df.dtypes
Out[4]:
num int64
char object
dtype: object
Now let's use a simple DataFrame initialization using the columns of the original DataFrame but providing no data:
In [5]: empty_copy_1 = pd.DataFrame(data=None, columns=df.columns)
In [6]: empty_copy_1
Out[6]:
Empty DataFrame
Columns: [num, char]
Index: []
In [7]: empty_copy_1.dtypes
Out[7]:
num object
char object
dtype: object
As you can see, the column data types are not the same as in our original DataFrame.
So, if you want to preserve the column dtype...
If you want to preserve the column data types you need to construct the DataFrame one Series at a time
In [8]: empty_copy_2 = pd.DataFrame.from_items([
...: (name, pd.Series(data=None, dtype=series.dtype))
...: for name, series in df.iteritems()])
In [9]: empty_copy_2
Out[9]:
Empty DataFrame
Columns: [num, char]
Index: []
In [10]: empty_copy_2.dtypes
Out[10]:
num int64
char object
dtype: object
A simple alternative -- first copy the basic structure or indexes and columns with datatype from the original dataframe (df1) into df2
df2 = df1.iloc[0:0]
Then fill your dataframe with empty rows -- pseudocode that will need to be adapted to better match your actual structure:
s = pd.Series([Nan,Nan,Nan], index=['Col1', 'Col2', 'Col3'])
loop through the rows in df1
df2 = df2.append(s)
To preserve column type you can use the astype method,
like pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
import pandas as pd
df1 = pd.DataFrame(
[
[11, 12, 'Alice'],
[21, 22, 'Bob']
],
columns=['c1', 'c2', 'c3'],
index=['i1', 'i2']
)
df2 = pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
print(df2.shape)
print(df2.dtypes)
output:
(0, 3)
c1 int64
c2 int64
c3 object
dtype: object
Working example
You can simply mask by notna() i.e
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
df2 = df1.mask(df1.notna())
c1 c2
i1 NaN NaN
i2 NaN NaN
A simple way to copy df structure into df2 is:
df2 = pd.DataFrame(columns=df.columns)
This has worked for me in pandas 0.22:
df2 = pd.DataFrame(index=df.index.delete(slice(None)), columns=df.columns)
Convert types:
df2 = df2.astype(df.dtypes)
delete(slice(None))
In case you do not want to keep the values of the indexes.
I know this is an old question, but I thought I would add my two cents.
def df_cols_like(df):
"""
Returns an empty data frame with the same column names and types as df
"""
df2 = pd.DataFrame({i[0]: pd.Series(dtype=i[1])
for i in df.dtypes.iteritems()},
columns=df.dtypes.index)
return df2
This approach centers around the df.dtypes attribute of the input data frame, df, which is a pd.Series. A pd.DataFrame is constructed from a dictionary of empty pd.Series objects named using the input column names with the column order being taken from the input df.