Copy a column to multiple columns of a DataFrame with Pandas - python

I have a DataFrame with multiple columns, a few columns being NaN. The dataframe is quite big having around 5,000 columns. Below is a sample from it:
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 NaN NaN 0.26 0.89 NaN
3 30-Jun-15 NaN NaN NaN 0.90 NaN
4 30-Sep-15 NaN NaN 0.31 0.90 NaN
5 31-Dec-15 NaN NaN 0.41 0.91 NaN
I want to copy the value of column 'EZ19' to all columns where all values for row 2 and below are NaN. I tried the following code and it works:
nan_cols = df.columns[df_macro[2:].isnull().all()].to_list()
for c in nan_cols:
df.loc[2:,c]= df.loc[2:,'EZ19']
But I was thinking there should be a way to assign value of column 'EZ19' to the target columns without using a loop and am surprised that there didn't seem to be a straight forward way to do this. Other questions here don't seem to handle the exact issue I have and couldn't find a solution that worked for me.
Given the size of my dataframe(and it is expected to grow larger overtime) I really want to avoid using a loop in my final code so any help with this will be greatly appreciated.

If you're interested in replacing values of columns that contain all nulls, you can take a shortcut and simply overwrite all values below row 2 after identifying those values are entirely null.
# Identify columns that contain null values from row 2 onwards
all_null_cols = df.loc[2:].isnull().all()
# overwrite row 2 onwards in only our null columns with values from "EZ19"
df.loc[2:, all_nulls] = df.loc[2:, ["EZ19"]].values
print(df)
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 0.89 0.89 0.26 0.89 0.89
3 30-Jun-15 0.90 0.90 NaN 0.90 0.90
4 30-Sep-15 0.90 0.90 0.31 0.90 0.90
5 31-Dec-15 0.91 0.91 0.41 0.91 0.91

Not sure if this is what you have in mind:
outcome = df.loc[2:, df.loc[2:].isna().all()].mask(
lambda df: df.isna(), df.loc[2:, "EZ19"], axis=0
)
outcome
ESP FIN PRT
2 0.89 0.89 0.89
3 0.90 0.90 0.90
4 0.90 0.90 0.90
5 0.91 0.91 0.91
df.update(outcome)
df
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 0.89 0.89 0.26 0.89 0.89
3 30-Jun-15 0.90 0.90 NaN 0.90 0.90
4 30-Sep-15 0.90 0.90 0.31 0.90 0.90
5 31-Dec-15 0.91 0.91 0.41 0.91 0.91
It only fills completely null rows from row 2 downwards; USA is not completely null from row 2, that's why it was not altered.

A simple oneliner that replaces all empty values in a row with the value in EZ19:
df = df.apply(lambda row: row.where(pd.notnull(row), row.EZ19), axis=1)
Output:
GeoCode ESP FIN USA EZ19 PRT
0 Geography Spain Finland USA EZ Portugal
1 31-Mar-15 0.89 0.89 0.26 0.89 0.89
2 30-Jun-15 0.90 0.90 0.90 0.90 0.90
3 30-Sep-15 0.90 0.90 0.31 0.90 0.90
4 31-Dec-15 0.91 0.91 0.41 0.91 0.91

Related

Formating a data frame reducing column and increasing rows

I have a pandas data frame like this
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
df
01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
probes
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
This is actually a minimal reproducible example. The real data frame is formed by many paired columns like the example 01_PLAGL1 02_PLAGL1, then 2 sets of three columns like the example H19 H19 H19 and 2 unique columns. With this explanation and the columns of my real dataset below, I think you will understand the input data of my problem.
data_no_control.columns.values
array(['PLAGL1', 'PLAGL1', 'GRB10', 'GRB10', 'MEST', 'MEST', 'H19', 'H19',
'H19', 'KCNQ1OT1', 'KCNQ1OT1', 'MEG3', 'MEG3', 'MEG8', 'MEG8',
'SNRPN', 'SNRPN', 'PEG3', 'PEG3', 'PEG3', 'NESP55', 'GNAS-AS1',
'GNASXL', 'GNASXL', 'GNAS_A/B'], dtype=object)
The final output I would like to achieve should be like this
01_PLAGL1 H19 GNAS A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
(One empty row)
(Second empty row)
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
(One empty row)
(Second empty row)
NGS_41 ...
I have tried this
df = data_no_control.reset_index(level=0)
empty_rows = 5
df.index = range(0, empty_rows*len(df), empty_rows)
new_df = df.reindex(index=range(empty_rows*len(df)))
new_df = new_df.set_index('index')
new_df
index 01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
Use:
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0 0.47 0.51 0.62
1 0.55 0.53 NaN
2 NaN 0.54 NaN
3 NaN NaN NaN
4 NaN NaN NaN
NGS_38 0 0.52 0.49 0.45
1 0.52 0.51 NaN
2 NaN 0.52 NaN
3 NaN NaN NaN
4 NaN NaN NaN
Last if need empty values instead misisng values and no counter level use:
df = df.droplevel(1).fillna('')
df.index = df.index.where(~df.index.duplicated(), '')
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
EDIT: In real data are not duplicates, so ouput is different:
d = {'PLAGL1': {'NGS_34': 0.55, 'NGS_38': 0.52}, 'GRB10': {'NGS_34': 0.48, 'NGS_38': 0.49}, 'MEST': {'NGS_34': 0.56, 'NGS_38': 0.5}, 'H19': {'NGS_34': 0.54, 'NGS_38': 0.52}, 'KCNQ1OT1': {'NGS_34': 0.41, 'NGS_38': 0.49}, 'MEG3': {'NGS_34': 0.5, 'NGS_38': 0.55}, 'MEG8': {'NGS_34': 0.46, 'NGS_38': 0.5}, 'SNRPN': {'NGS_34': 0.55, 'NGS_38': 0.46}, 'PEG3': {'NGS_34': 0.51, 'NGS_38': 0.51}, 'NESP55': {'NGS_34': 0.55, 'NGS_38': 0.53}, 'GNAS-AS1': {'NGS_34': 0.52, 'NGS_38': 0.48}, 'GNASXL': {'NGS_34': 0.49, 'NGS_38': 0.44}, 'GNAS A/B': {'NGS_34': 0.62, 'NGS_38': 0.45}}
df = pd.DataFrame(d)
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 NESP55 \
NGS_34 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51 0.55
NGS_38 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51 0.53
GNAS-AS1 GNASXL GNAS A/B
NGS_34 0.52 0.49 0.62
NGS_38 0.48 0.44 0.45
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 \
NGS_34 0 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NGS_38 0 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NESP55 GNAS-AS1 GNASXL GNAS A/B
NGS_34 0 0.55 0.52 0.49 0.62
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
NGS_38 0 0.53 0.48 0.44 0.45
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN

Python Pandas - Dataframe - Add column depending on another column, which has a mathematical operation from another two columns

I have a Pandas dataframe that looks something like this:
timestamp
Place
Data A
Data B
Data C
16508
France
0.03
0.06
0.15
16510
England
0.05
0.07
0.11
16515
England
0.04
0.03
0.87
What I would like to do is the following:
Add a new column for every different value in the column "Place".
In this new column, add the division between Data A and Data B in percentage (Data A / Data B * 100).
The expected output would be:
timestamp
Place
Data A
Data B
Data C
To France
To England
16508
France
0.03
0.06
0.15
50
0
16510
England
0.05
0.07
0.11
0
71.42
16515
England
0.04
0.03
0.87
0
133.33
I tried the following:
for column in data['Place'].unique():
column_name = f'To {Place}'
data[column_name] = data[data['Place'] == column]['Data A'].div(['Data B'])*100
data[column_name].fillna(method='ffill', inplace=True)
data[column_name].fillna(value=0, inplace=True)
But it's not working. I get a "'list' object has no attribute 'div'" error. I have tried other different things but they are not working either.
Could somebody give me a hand with this?
IIUC, you can try with pivot:
df["Ratio"] = df["Data A"].div(df["Data B"])
output = df.drop("Ratio", axis=1).join(df.pivot(None, "Place", "Ratio").mul(100).fillna(0).add_prefix("To "))
>>> output
Place Data A Data B Data C To England To France
0 France 0.03 0.06 0.15 0.000000 50.0
1 England 0.05 0.07 0.11 71.428571 0.0
2 England 0.04 0.03 0.87 133.333333 0.0
I'd do it like this:
df_ratio = ((df['Data A'].div(df['Data B'])*100).to_frame()
.assign(col='To '+df['Place'])
.set_index('col', append=True)[0]
.unstack(fill_value=0))
pd.concat([df, df_ratio], axis=1))
Output:
timestamp Place Data A Data B Data C To England To France
0 16508 France 0.03 0.06 0.15 0.000000 50.0
1 16510 England 0.05 0.07 0.11 71.428571 0.0
2 16515 England 0.04 0.03 0.87 133.333333 0.0

count values of each month, fill NaN if under certain limit

I am working with a dataframe, where every column represents a company. The index is a datetime index with daily frequency. My problem is the following: For each company, I would like to fill a month with NaN if there are less than 20 values in that month. In the example below, this would mean that Company_1's entry 0.91 on 2012-08-31 would be changed to NaN, while company_2 and 3 would be unchanged.
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
... ... ... ...
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 0.91 0.51 -0.33
Total Values: 1 22 21
I am struggling to find an efficient way to count the number of values for each month of each stock. I could theoretically write a function which creates a new dataframe, which reports the number of values for each month (and for each stock), to then use that dataframe for the original company information, but I am sure that there has to be an easier way. Any help is highly appreciated. Thanks in advance.
groupby the dataframe on monthly freq and transform using count then using Series.lt create a boolean mask and use this mask to fill NaN values in dataframe:
df1 = df.mask(df.groupby(pd.Grouper(freq='M')).transform('count').lt(20))
print(df1)
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
....
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33
IIUC:
df.loc[:, df.apply(lambda d: d.notnull().sum()<20)] = np.NaN
print (df)
Company 1 Company 2 Company 3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

python pandas dataframe Melt multiindex multi-levels

I have a DF with the following structure:
| Level | Rate |
Indicator | AAA | BBB | CCC | XXX | YYY |
location variable |
One 2017 0.69 0.22 0.71 0.02 0.98
2018 0.31 0.15 0.78 0.03 0.96
2019 0.55 0.19 0.82 0.04 0.83
Two 2017 0.31 0.33 0.93 0.11 0.21
2018 0.24 0.35 0.01 0.12 0.14
2019 0.16 0.25 0.12 0.14 0.17
Three 2017 0.58 0.11 0.55 0.21 0.27
2018 0.75 0.10 0.68 0.22 0.25
2019 0.42 0.08 0.71 0.23 0.41
I need to get a DF the following structure (with only one level):
location | variable | Indicator | Level | Rate |
------------------------------------------------
One | 2017 | AAA | 0.69 | NaN |
...
Three | 2019 | YYY | NaN | 0.41 |
I've made several attempts like this below but they don't work:
df.melt(col_level=0, id_vars = ['Location','Indicator','variable'] , value_vars = ['Level', 'Rate'])
Any help would be highly appreciated
Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack().rename_axis(('location','variable','indicator')).reset_index()
print (df.head(10))
location variable indicator Level Rate
0 One 2017 AAA 0.69 NaN
1 One 2017 BBB 0.22 NaN
2 One 2017 CCC 0.71 NaN
3 One 2017 XXX NaN 0.02
4 One 2017 YYY NaN 0.98
5 One 2018 AAA 0.31 NaN
6 One 2018 BBB 0.15 NaN
7 One 2018 CCC 0.78 NaN
8 One 2018 XXX NaN 0.03
9 One 2018 YYY NaN 0.96

splitting a dataframe into chunks and naming each new chunk into a dataframe

is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.
Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)
I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.

Categories