Formating a data frame reducing column and increasing rows

Formating a data frame reducing column and increasing rows - python

I have a pandas data frame like this
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
df
01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
probes
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
This is actually a minimal reproducible example. The real data frame is formed by many paired columns like the example 01_PLAGL1 02_PLAGL1, then 2 sets of three columns like the example H19 H19 H19 and 2 unique columns. With this explanation and the columns of my real dataset below, I think you will understand the input data of my problem.
data_no_control.columns.values
array(['PLAGL1', 'PLAGL1', 'GRB10', 'GRB10', 'MEST', 'MEST', 'H19', 'H19',
'H19', 'KCNQ1OT1', 'KCNQ1OT1', 'MEG3', 'MEG3', 'MEG8', 'MEG8',
'SNRPN', 'SNRPN', 'PEG3', 'PEG3', 'PEG3', 'NESP55', 'GNAS-AS1',
'GNASXL', 'GNASXL', 'GNAS_A/B'], dtype=object)
The final output I would like to achieve should be like this
01_PLAGL1 H19 GNAS A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
(One empty row)
(Second empty row)
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
(One empty row)
(Second empty row)
NGS_41 ...
I have tried this
df = data_no_control.reset_index(level=0)
empty_rows = 5
df.index = range(0, empty_rows*len(df), empty_rows)
new_df = df.reindex(index=range(empty_rows*len(df)))
new_df = new_df.set_index('index')
new_df
index 01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN

Use:
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0 0.47 0.51 0.62
1 0.55 0.53 NaN
2 NaN 0.54 NaN
3 NaN NaN NaN
4 NaN NaN NaN
NGS_38 0 0.52 0.49 0.45
1 0.52 0.51 NaN
2 NaN 0.52 NaN
3 NaN NaN NaN
4 NaN NaN NaN
Last if need empty values instead misisng values and no counter level use:
df = df.droplevel(1).fillna('')
df.index = df.index.where(~df.index.duplicated(), '')
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
EDIT: In real data are not duplicates, so ouput is different:
d = {'PLAGL1': {'NGS_34': 0.55, 'NGS_38': 0.52}, 'GRB10': {'NGS_34': 0.48, 'NGS_38': 0.49}, 'MEST': {'NGS_34': 0.56, 'NGS_38': 0.5}, 'H19': {'NGS_34': 0.54, 'NGS_38': 0.52}, 'KCNQ1OT1': {'NGS_34': 0.41, 'NGS_38': 0.49}, 'MEG3': {'NGS_34': 0.5, 'NGS_38': 0.55}, 'MEG8': {'NGS_34': 0.46, 'NGS_38': 0.5}, 'SNRPN': {'NGS_34': 0.55, 'NGS_38': 0.46}, 'PEG3': {'NGS_34': 0.51, 'NGS_38': 0.51}, 'NESP55': {'NGS_34': 0.55, 'NGS_38': 0.53}, 'GNAS-AS1': {'NGS_34': 0.52, 'NGS_38': 0.48}, 'GNASXL': {'NGS_34': 0.49, 'NGS_38': 0.44}, 'GNAS A/B': {'NGS_34': 0.62, 'NGS_38': 0.45}}
df = pd.DataFrame(d)
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 NESP55 \
NGS_34 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51 0.55
NGS_38 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51 0.53
GNAS-AS1 GNASXL GNAS A/B
NGS_34 0.52 0.49 0.62
NGS_38 0.48 0.44 0.45
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 \
NGS_34 0 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NGS_38 0 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NESP55 GNAS-AS1 GNASXL GNAS A/B
NGS_34 0 0.55 0.52 0.49 0.62
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
NGS_38 0 0.53 0.48 0.44 0.45
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN

Related

Pandas dataframe column subtraction, handling NaN

I have a data frame for example
df = pd.DataFrame([(np.nan, .32), (.01, np.nan), (np.nan, np.nan), (.21, .18)],
columns=['A', 'B'])
A B
0 NaN 0.32
1 0.01 NaN
2 NaN NaN
3 0.21 0.18
And I want to subtract column B from A
df['diff'] = df['A'] - df['B']
A B diff
0 NaN 0.32 NaN
1 0.01 NaN NaN
2 NaN NaN NaN
3 0.21 0.18 0.03
Difference returns NaN if one of the columns is NaN. To overcome this I use fillna
df['diff'] = df['A'].fillna(0) - df['B'].fillna(0)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03
This solves NaN coming in the diff column, but for index 2 the result is coming to 0, while I want the difference as NaN since columns A and B are NaN.
Is there a way to explicitly tell pandas to output NaN if both columns are NaN?

Use Series.sub with fill_value=0 parameter:
df['diff'] = df['A'].sub(df['B'], fill_value=0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN NaN
3 0.21 0.18 0.03
If need replace NaNs to 0 add Series.fillna:
df['diff'] = df['A'].sub(df['B'], fill_value=0).fillna(0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03

count values of each month, fill NaN if under certain limit

I am working with a dataframe, where every column represents a company. The index is a datetime index with daily frequency. My problem is the following: For each company, I would like to fill a month with NaN if there are less than 20 values in that month. In the example below, this would mean that Company_1's entry 0.91 on 2012-08-31 would be changed to NaN, while company_2 and 3 would be unchanged.
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
... ... ... ...
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 0.91 0.51 -0.33
Total Values: 1 22 21
I am struggling to find an efficient way to count the number of values for each month of each stock. I could theoretically write a function which creates a new dataframe, which reports the number of values for each month (and for each stock), to then use that dataframe for the original company information, but I am sure that there has to be an easier way. Any help is highly appreciated. Thanks in advance.

groupby the dataframe on monthly freq and transform using count then using Series.lt create a boolean mask and use this mask to fill NaN values in dataframe:
df1 = df.mask(df.groupby(pd.Grouper(freq='M')).transform('count').lt(20))
print(df1)
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
....
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

IIUC:
df.loc[:, df.apply(lambda d: d.notnull().sum()<20)] = np.NaN
print (df)
Company 1 Company 2 Company 3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

splitting a dataframe into chunks and naming each new chunk into a dataframe

is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.

Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)

I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.

calculate mean only when the number of values in each rows is higher then certain number in python pandas

I have a daily time series dataframe with nine columns. Each columns represent the measurement from different methods. I want to calculate daily mean only when there are more than two measurements otherwise want to assign as NaN. How to do that with pandas dataframe?
suppose my df looks like:
0 1 2 3 4 5 6 7 8
2000-02-25 NaN 0.22 0.54 NaN NaN NaN NaN NaN NaN
2000-02-26 0.57 NaN 0.91 0.21 NaN 0.22 NaN 0.51 NaN
2000-02-27 0.10 0.14 0.09 NaN 0.17 NaN 0.05 NaN NaN
2000-02-28 NaN NaN NaN NaN NaN NaN NaN NaN 0.14
2000-02-29 0.82 NaN 0.75 NaN NaN NaN 0.14 NaN NaN
and I'm expecting mean values like:
0
2000-02-25 NaN
2000-02-26 0.48
2000-02-27 0.11
2000-02-28 NaN
2000-02-29 0.57

Use where for NaNs values by condition created by DataFrame.count for count with exclude NaNs and comparing by Series.gt (>):
s = df.where(df.count(axis=1).gt(2)).mean(axis=1)
#alternative soluton with changed order
#s = df.mean(axis=1).where(df.count(axis=1).gt(2))
print (s)
2000-02-25 NaN
2000-02-26 0.484
2000-02-27 0.110
2000-02-28 NaN
2000-02-29 0.570
dtype: float64

How to merge pandas DataFrame with same column name?

The index is a timestamp and column name, and also the ability to replace NaN to value. It does not seem to be working.
sample:
import pandas as pd
times = pd.to_datetime(pd.Series(['2014-07-4',
'2014-07-15','2014-08-24','2014-08-25','2014-09-10','2014-09-17']))
valuea = [0.01, 0.02, -0.03, 0.4 ,0.5,np.NaN]
times2 = pd.to_datetime(pd.Series(['2014-07-6',
'2014-07-16','2014-08-27','2014-09-5','2014-09-11','2014-09-17']))
valuea2 = [1, 2, 3, 4,5,-6]
df1 = pd.DataFrame({'value A': valuea}, index=times)
df2 = pd.DataFrame({'value A': valuea2}, index=times2)
df3=pd.merge(df1,df2, left_index=True, right_index=True)
df3.head()

Assuming you need outer join
pd.concat([df1,df2],axis=1)
Out[321]:
value A value A
2014-07-04 0.01 NaN
2014-07-06 NaN 1.0
2014-07-15 0.02 NaN
2014-07-16 NaN 2.0
2014-08-24 -0.03 NaN
2014-08-25 0.40 NaN
2014-08-27 NaN 3.0
2014-09-05 NaN 4.0
2014-09-10 0.50 NaN
2014-09-11 NaN 5.0
2014-09-17 NaN -6.0
Update
df1.combine_first(df2)
Out[324]:
value A
2014-07-04 0.01
2014-07-06 1.00
2014-07-15 0.02
2014-07-16 2.00
2014-08-24 -0.03
2014-08-25 0.40
2014-08-27 3.00
2014-09-05 4.00
2014-09-10 0.50
2014-09-11 5.00
2014-09-17 -6.00

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Formating a data frame reducing column and increasing rows - python

Related

Pandas dataframe column subtraction, handling NaN

count values of each month, fill NaN if under certain limit

splitting a dataframe into chunks and naming each new chunk into a dataframe

calculate mean only when the number of values in each rows is higher then certain number in python pandas

How to merge pandas DataFrame with same column name?

Categories

Resources