Python Pandas partial slicing and sorting on dataframe columns

Python Pandas partial slicing and sorting on dataframe columns - python

I have an input file of the wavelengths and absorbance from a spectrometer. In this file the data is recorded and just added as the last two columns of the dataframe. The columns are needed to specify the wavelength at which a specific absorbance (=data) was measured.
Wavelength1
Data1
Wavelength2
Data2
Wavelength3
Data3
and so on
800
0.1
798
0.02
798.5
0.6
and so on
799
0.15
797
0.03
798.0
0.2
and so on
798
0.133
796
0.2
797.5
0.4
and so on
797
0.14
795
0.052
797.0
0.34
and so on
and so on
and so on
and so on
and so on
and so on
and so on
and so on
I would like to have a dataframe that makes my analysis a bit easier. Something like that:
Wavelength1
Data1
Wavelength2
Data2
Wavelength3
Data3
and so on
800
0.1
NaN
NaN
798.5
0.6
and so on
799
0.15
NaN
NaN
798.0
0.2
and so on
NaN
NaN
NaN
NaN
798.5
0.6
and so on
798
0.133
798
0.02
798.0
0.2
and so on
NaN
NaN
NaN
NaN
797.5
0.4
and so on
797
0.14
797
0.03
797.0
0.34
and so on
and so on
and so on
and so on
and so on
and so on
and so on
and so on
With my quite basic python skill set, I know, that I could probably store each wavelength-data pair as a list of tuples and make some complicated sorting magic happening. But every since trying to learn more about the pandas module, I was wondering if I can tackle this problem with more ease. However, while I have found pandas shift function, I have not found a way of making it conditional nor shifting and sorting each column individually.

this is classic wide to long
your sample data has two readings for wavelength 796 in second set of data. This effectively means duplicates, dealt with this by putting a subreading in place
finally transform back to long, where values line up on Wavelength with provision for duplicates
clearly steps to convert back to wide and flattening columns are optional based on how you want to run your analysis
df = pd.DataFrame({'Wavelength1': [800, 799, 798, 797],
'Data1': [0.1, 0.15, 0.133, 0.14],
'Wavelength2': [798, 797, 796, 796],
'Data2': [0.02, 0.03, 0.2, 0.052],
'Wavelength3': [798.5, 798.0, 797.5, 797.0],
'Data3': [0.6, 0.2, 0.4, 0.34]})
# wide to long
df2 = (
pd.wide_to_long(df.reset_index(), ["Wavelength", "Data"], i="index", j="reading")
.droplevel(0)
.reset_index()
.set_index(["Wavelength", "reading"])
)
long data
Data
Wavelength reading
800.0 1 0.100
799.0 1 0.150
798.0 1 0.133
797.0 1 0.140
798.0 2 0.020
797.0 2 0.030
796.0 2 0.200
2 0.052
798.5 3 0.600
798.0 3 0.200
797.5 3 0.400
797.0 3 0.340
back to wide with wavelength lined up
# long back to wide, delating with duplicate "Wavelength"
df2 = df2.set_index(
pd.Series(df2.groupby(level=[0, 1]).cumcount().values, name="subreading"),
append=True,
).unstack("reading")
# flatten the columns..
df2.columns = ["".join(map(str, c)) for c in df2.columns]
final output
Data1 Data2 Data3
Wavelength subreading
796.0 0 NaN 0.200 NaN
1 NaN 0.052 NaN
797.0 0 0.140 0.030 0.34
797.5 0 NaN NaN 0.40
798.0 0 0.133 0.020 0.20
798.5 0 NaN NaN 0.60
799.0 0 0.150 NaN NaN
800.0 0 0.100 NaN NaN

Related

Formating a data frame reducing column and increasing rows

I have a pandas data frame like this
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
df
01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
probes
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
This is actually a minimal reproducible example. The real data frame is formed by many paired columns like the example 01_PLAGL1 02_PLAGL1, then 2 sets of three columns like the example H19 H19 H19 and 2 unique columns. With this explanation and the columns of my real dataset below, I think you will understand the input data of my problem.
data_no_control.columns.values
array(['PLAGL1', 'PLAGL1', 'GRB10', 'GRB10', 'MEST', 'MEST', 'H19', 'H19',
'H19', 'KCNQ1OT1', 'KCNQ1OT1', 'MEG3', 'MEG3', 'MEG8', 'MEG8',
'SNRPN', 'SNRPN', 'PEG3', 'PEG3', 'PEG3', 'NESP55', 'GNAS-AS1',
'GNASXL', 'GNASXL', 'GNAS_A/B'], dtype=object)
The final output I would like to achieve should be like this
01_PLAGL1 H19 GNAS A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
(One empty row)
(Second empty row)
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
(One empty row)
(Second empty row)
NGS_41 ...
I have tried this
df = data_no_control.reset_index(level=0)
empty_rows = 5
df.index = range(0, empty_rows*len(df), empty_rows)
new_df = df.reindex(index=range(empty_rows*len(df)))
new_df = new_df.set_index('index')
new_df
index 01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN

Use:
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0 0.47 0.51 0.62
1 0.55 0.53 NaN
2 NaN 0.54 NaN
3 NaN NaN NaN
4 NaN NaN NaN
NGS_38 0 0.52 0.49 0.45
1 0.52 0.51 NaN
2 NaN 0.52 NaN
3 NaN NaN NaN
4 NaN NaN NaN
Last if need empty values instead misisng values and no counter level use:
df = df.droplevel(1).fillna('')
df.index = df.index.where(~df.index.duplicated(), '')
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
EDIT: In real data are not duplicates, so ouput is different:
d = {'PLAGL1': {'NGS_34': 0.55, 'NGS_38': 0.52}, 'GRB10': {'NGS_34': 0.48, 'NGS_38': 0.49}, 'MEST': {'NGS_34': 0.56, 'NGS_38': 0.5}, 'H19': {'NGS_34': 0.54, 'NGS_38': 0.52}, 'KCNQ1OT1': {'NGS_34': 0.41, 'NGS_38': 0.49}, 'MEG3': {'NGS_34': 0.5, 'NGS_38': 0.55}, 'MEG8': {'NGS_34': 0.46, 'NGS_38': 0.5}, 'SNRPN': {'NGS_34': 0.55, 'NGS_38': 0.46}, 'PEG3': {'NGS_34': 0.51, 'NGS_38': 0.51}, 'NESP55': {'NGS_34': 0.55, 'NGS_38': 0.53}, 'GNAS-AS1': {'NGS_34': 0.52, 'NGS_38': 0.48}, 'GNASXL': {'NGS_34': 0.49, 'NGS_38': 0.44}, 'GNAS A/B': {'NGS_34': 0.62, 'NGS_38': 0.45}}
df = pd.DataFrame(d)
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 NESP55 \
NGS_34 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51 0.55
NGS_38 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51 0.53
GNAS-AS1 GNASXL GNAS A/B
NGS_34 0.52 0.49 0.62
NGS_38 0.48 0.44 0.45
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 \
NGS_34 0 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NGS_38 0 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NESP55 GNAS-AS1 GNASXL GNAS A/B
NGS_34 0 0.55 0.52 0.49 0.62
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
NGS_38 0 0.53 0.48 0.44 0.45
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN

pandas.DataFrame.append is adding new columns to dataframe in python

First, I created a csv file from data0 as shown:
data0 = ["car", 0.82, 0.0026, 0.914, 0.59]
test_df = pd.DataFrame([data0])
test_df.to_csv("testfile1.csv")
Output of that "testfile1.csv" appears like this:
0 1 2 3 4
0 car 0.82 0.0026 0.914 0.59
I want to append new data (data1 = ["bus", 0.9, 0.123, 12.907, 42], data2 = ["van", 0.23, 0.41, .031, 0.894, 6.16, 4.104]) to old csv file so that it appears in new row but exactly under previous rows as shown below:
0 1 2 3 4 5 6
0 car 0.82 0.0026 0.914 0.590 NaN NaN
1 bus 0.90 0.1230 12.907 42.000 NaN NaN
2 van 0.23 0.4100 0.031 0.894 6.16 4.104
I tried program and other similar methods using .append() or .to_csv() with mode="a":
test_df = pd.read_csv("testfile1.csv", index_col=0)
test_df = test_df.append([data1], ignore_index=True)
test_df = test_df.append([data2], ignore_index=True)
test_df.to_csv("testfile1.csv")
However, every time new data is appended in new columns and not under previous columns:
0 1 2 3 4 ... 0 1 2 3 4
0 NaN NaN NaN NaN NaN ... car 0.82 0.0026 0.914 0.59
1 bus 0.90 0.123 12.907 42.000 ... NaN NaN NaN NaN NaN
2 van 0.23 0.410 0.031 0.894 ... NaN NaN NaN NaN NaN
My project is to read existing CSV file, appending and saving back to same file. I even tried type casting to pandas.Series

You probably want to turn the list into pandas series first:
data1 = ['bus', 0.90, 0.1230, 12.907, 42.000]
row1 = pd.Series(data1)
data2 = ["van", 0.23, 0.41, .031, 0.894, 6.16, 4.104]
row2 = pd.Series(data2)
Then append it:
test_df = test_df.append(row1, ignore_index=True)
test_df = test_df.append(row2, ignore_index=True)
Output:
0 1 2 3 4 5 6
0 car 0.82 0.0026 0.914 0.590 NaN NaN
1 bus 0.90 0.1230 12.907 42.000 NaN NaN
2 van 0.23 0.4100 0.031 0.894 6.16 4.104

count values of each month, fill NaN if under certain limit

I am working with a dataframe, where every column represents a company. The index is a datetime index with daily frequency. My problem is the following: For each company, I would like to fill a month with NaN if there are less than 20 values in that month. In the example below, this would mean that Company_1's entry 0.91 on 2012-08-31 would be changed to NaN, while company_2 and 3 would be unchanged.
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
... ... ... ...
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 0.91 0.51 -0.33
Total Values: 1 22 21
I am struggling to find an efficient way to count the number of values for each month of each stock. I could theoretically write a function which creates a new dataframe, which reports the number of values for each month (and for each stock), to then use that dataframe for the original company information, but I am sure that there has to be an easier way. Any help is highly appreciated. Thanks in advance.

groupby the dataframe on monthly freq and transform using count then using Series.lt create a boolean mask and use this mask to fill NaN values in dataframe:
df1 = df.mask(df.groupby(pd.Grouper(freq='M')).transform('count').lt(20))
print(df1)
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
....
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

IIUC:
df.loc[:, df.apply(lambda d: d.notnull().sum()<20)] = np.NaN
print (df)
Company 1 Company 2 Company 3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

Pandas combine two columns into one and exclude NaN values

I have a 5k x 2 column dataframe called "both".
I want to create a new 5k x 1 DataFrame or column (doesn't matter) by replacing any NaN value in one column with the value of the adjacent column.
ex:
Gains Loss
0 NaN NaN
1 NaN -0.17
2 NaN -0.13
3 NaN -0.75
4 NaN -0.17
5 NaN -0.99
6 1.06 NaN
7 NaN -1.29
8 NaN -0.42
9 0.14 NaN
so for example, I need to swap the NaNs in the first column in rows 1 through 5 with the values in the same rows, in second column to get a new df of the following form:
Change
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
how do I tell python to do this??

You may fill the NaN values with zeroes and then simply add your columns:
both["Change"] = both["Gains"].fillna(0) + both["Loss"].fillna(0)
Then — if you need it — you may return the resulting zeroes back to NaNs:
both["Change"].replace(0, np.nan, inplace=True)
The result:
Gains Loss Change
0 NaN NaN NaN
1 NaN -0.17 -0.17
2 NaN -0.13 -0.13
3 NaN -0.75 -0.75
4 NaN -0.17 -0.17
5 NaN -0.99 -0.99
6 1.06 NaN 1.06
7 NaN -1.29 -1.29
8 NaN -0.42 -0.42
9 0.14 NaN 0.14
Finally, if you want to get rid of your original columns, you may drop them:
both.drop(columns=["Gains", "Loss"], inplace=True)

There are many ways to achieve this. One is using the loc property:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price1': [np.nan,np.nan,np.nan,np.nan,
np.nan,np.nan,1.06,np.nan,np.nan],
'Price2': [np.nan,-0.17,-0.13,-0.75,-0.17,
-0.99,np.nan,-1.29,-0.42]})
df.loc[df['Price1'].isnull(), 'Price1'] = df['Price2']
df = df.loc[:6,'Price1']
print(df)
Output:
Price1
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
You can see more complex recipes in the Cookbook

IIUC, we can filter for null values and just sum the columns to make your new dataframe.
cols = ['Gains','Loss']
s = df.isnull().cumsum(axis=1).eq(len(df.columns)).any(axis=1)
# add df[cols].isnull() if you only want to measure the price columns for nulls.
df['prices'] = df[cols].loc[~s].sum(axis=1)
df = df.drop(cols,axis=1)
print(df)
prices
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
7 -1.29
8 -0.42

splitting a dataframe into chunks and naming each new chunk into a dataframe

is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.

Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)

I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas partial slicing and sorting on dataframe columns - python

Related

Formating a data frame reducing column and increasing rows

pandas.DataFrame.append is adding new columns to dataframe in python

count values of each month, fill NaN if under certain limit

Pandas combine two columns into one and exclude NaN values

splitting a dataframe into chunks and naming each new chunk into a dataframe

Categories

Resources