Pandas: interpolation where first and last data point in column is NaN - python

I would like to use the interpolate function, but only between known data values in a pandas DataFrame column. The issue is that the first and last values in the column are often NaN and sometimes it can be many rows before a value is not NaN:
col 1 col 2
0 NaN NaN
1 NaN NaN
...
1000 1 NaN
1001 NaN 1 <-----
1002 3 NaN <----- only want to fill in these 'in between value' rows
1003 4 3
...
3999 NaN NaN
4000 NaN NaN
I am tying together a dataset which is updated 'on event' but separately for each column, and is indexed via Timestamp. This means that there are often rows where no data is recorded for some columns, hence a lot of NaNs!

I select by min and max value of column by function idxmin and idxmax and use function fillna with method forward filling.
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 NaN 1
#1002 3 NaN
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()] = df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()].fillna(method='ffill')
df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()] = df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()].fillna(method='ffill')
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 1 1
#1002 3 1
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
Added different solution, thanks HStro.
df['col 1'].loc[df['col 1'].first_valid_index() : df['col 1'].last_valid_index()] = df['col 1'].loc[df['col 1'].first_valid_index(): df['col 1'].last_valid_index()].astype(float).interpolate()

Related

How to groupby and count since the first non nan value?

I have this df:
CODE TMAX
0 000130 NaN
1 000130 NaN
2 000130 32.0
3 000130 32.2
4 000130 NaN
5 158328 NaN
6 158328 8.8
7 158328 NaN
8 158328 NaN
9 158328 9.2
... ... ...
I want to count the number of non nan values and the number of nan values in the 'TMAX' column. But i want to count since the first non NaN value and by code.
Expected result in code 000130: 2 non nan values and 1 NaN values.
Expected result in code 158328: 2 non nan values and 2 NaN values.
Same with the other codes...
How can i do this?
Thanks in advance.
If need CODEs too add GroupBy.cummax and count values by crosstab:
m = df.TMAX.notna()
s = m[m.groupby(df['CODE']).cummax()]
df1 = pd.crosstab(df['CODE'], s).rename(columns={True:'non NaNs',False:'NaNs'})
print (df1)
TMAX NaNs non NaNs
CODE
130 1 2
158328 2 2
If need explicitely filter also column CODE by mask:
m = df.TMAX.notna()
mask = m.groupby(df['CODE']).cummax()
df1 = pd.crosstab(df.loc[mask, 'CODE'], m[mask]).rename(columns={True:'non NaNs',False:'NaNs'})
Use first_valid_index to find the first non-NaN index and filter. Then use isna to create a boolean mask and count the values.
def countNaN(s):
return (
s.loc[s.first_valid_index():]
.isna()
.value_counts()
.rename({True: 'NaN', False: 'notNaN'})
)
df.groupby('CODE').apply(countNaN)
Output
CODE
000130 notNaN 2
NaN 1
158328 notNaN 2
NaN 2

How to split the regex result into multiple column (Python)

I am trying to do the Regex in Python dataframe using this script
import pandas as pd
df1 = {'data':['1gsmxx,2gsm','abc10gsm','10gsm','18gsm hhh4gsm','Abc:10gsm','5gsmaaab3gsmABC55gsm','abc - 15gsm','3gsm,,ff40gsm','9gsm','VV - fg 8gsm','kk 5gsm 00g','001….abc..5gsm']}
df1 = pd.DataFrame(df1)
df1
df1['Result']=df1['Data'].str.findall('(\d{1,3}\s?gsm)')
OR
df2=df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
However, it turnout into multiple results in one column.
Is it possible I could have a result like the attached below?
Use pandas.Series.str.extractall with unstack.
If you want your original series, use pandas.concat.
df2 = df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
df = pd.concat([df1, df2.droplevel(0, 1)], 1)
print(df)
Output:
data 0 1 2
0 1gsmxx,2gsm 1gsm 2gsm NaN
1 abc10gsm 10gsm NaN NaN
2 10gsm 10gsm NaN NaN
3 18gsm hhh4gsm 18gsm 4gsm NaN
4 Abc:10gsm 10gsm NaN NaN
5 5gsmaaab3gsmABC55gsm 5gsm 3gsm 55gsm
6 abc - 15gsm 15gsm NaN NaN
7 3gsm,,ff40gsm 3gsm 40gsm NaN
8 9gsm 9gsm NaN NaN
9 VV - fg 8gsm 8gsm NaN NaN
10 kk 5gsm 00g 5gsm NaN NaN
11 001….abc..5gsm 5gsm NaN NaN

Python pandas DataFrame, sum row's value which data is Tru

I am very new to pandas and even new to programming.
I have DataFrame of [500 rows x 24 columns]
500 rows are rank of data and 24 columns are years and months.
What I want is
select data from df
get all data's row value by int
sum all row value
I did DATAF = df1[df1.isin(['MYDATA'])]
DATAF is something like below
19_01 19_02 19_03 19_04 19_05
0 NaN MYDATA NaN NaN NaN
1 MYDATA NaN MYDATA NaN NaN
2 NaN NaN NaN MYDATA NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
so I want to sum all the row value
which would be like 1 + 0 + 1 + 2
it would be nicer if sum is like 2 + 1 +2 + 3. because rows are rank of data
is there any way to do this?
You can use np.where:
rows, cols = np.where(DATAF .notna())
# rows: array([0, 1, 1, 2], dtype=int64)
print((rows+1).sum())
# 8

Pandas combines empty rows in Excel file to a single row in dataframe

I have different excel files that I am processing with Pandas. I need to remove a certain number of rows from the top of each file. These extra rows could be empty or they could contain text. Pandas is combining some of the rows so I am not sure how many need to be removed. For example:
Here is an example excel file (represented as csv):
,,
,,
some text,,
,,
,,
,,
name, date, task
Jason,1-Jan,swim
Aem,2-Jan,workout
Here is my current python script:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(xl.sheet_names[0])
print ("dfs: ", dfs)
Here is the results when I print the dataframe:
dfs: Unnamed: 0 Unnamed: 1 Unnamed: 2
0 some other text NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 name date task
5 Jason 2016-01-01 00:00:00 swim
6 Aem 2016-01-02 00:00:00 workout
From the file, I would remove the first 6 rows. However, from the dataframe I would only remove 4. Is there a way to read in the Excel file with the data in its raw state so the number of rows remains consistent?
I used python3 and pandas-0.18.1. The Excel load function is pandas.read_csv. You can try set the parameter header=None to achieve. Here are sample codes:
(1) With default parameters, result will ignore leading blank lines:
In [12]: pd.read_excel('test.xlsx')
Out[12]:
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 text1 NaN NaN
1 NaN NaN NaN
2 n1 t2 c3
3 NaN NaN NaN
4 NaN NaN NaN
5 jim sum tim
(2) With header=None, result will keep leading blank lines.
In [13]: pd.read_excel('test.xlsx', header=None)
Out[13]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 text1 NaN NaN
3 NaN NaN NaN
4 n1 t2 c3
5 NaN NaN NaN
6 NaN NaN NaN
7 jim sum tim
Here is what you are looking for:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(skiprows=6)
print ("dfs: ", dfs)
Check the docs on ExcelFile for more details.
If you read your file in with pd.read_excel and pass header=None, the blank rows should be included:
In [286]: df = pd.read_excel("test.xlsx", header=None)
In [287]: df
Out[287]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 something NaN NaN
3 NaN NaN NaN
4 name date other
5 1 2 3

Best way to eliminate columns with only one value from pandas dataframe

i'm trying to build a function to eliminate from my dataset the columns with only one value. I used this function:
def oneCatElimination(dataframe):
columns=dataframe.columns.values
for column in columns:
if len(dataframe[column].value_counts().unique())==1:
del dataframe[column]
return dataframe
the problem is that the function eliminates even column with more the one distinct value, i.e. a index column with integer number..
Just
df.dropna(thresh=2, axis=1)
will work. No need for anything else. It will keep all columns with 2 or more non-NA values (controlled by the value passed to thresh). The axis kwarg will let you work with rows or columns. It is rows by default, so you need to pass axis=1 explicitly to work on columns (I forgot this at the time I answered, hence this edit). See dropna() for more information.
A couple of assumptions went into this:
Null/NA values don't count
You need multiple non-NA values to keep a column
Those values need to be different in some way (e.g., a column full of 1's and only 1's should be dropped)
All that said, I would use a select statement on the columns.
If you start with this dataframe:
import pandas
N = 15
df = pandas.DataFrame(index=range(10), columns=list('ABCD'))
df.loc[2, 'A'] = 23
df.loc[3, 'B'] = 52
df.loc[4, 'B'] = 36
df.loc[5, 'C'] = 11
df.loc[6, 'C'] = 11
df.loc[7, 'D'] = 43
df.loc[8, 'D'] = 63
df.loc[9, 'D'] = 97
df
Which creates:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 23 NaN NaN NaN
3 NaN 52 NaN NaN
4 NaN 36 NaN NaN
5 NaN NaN 11 NaN
6 NaN NaN 11 NaN
7 NaN NaN NaN 43
8 NaN NaN NaN 63
9 NaN NaN NaN 97
Given my assumptions above, columns A and C should be dropped since A only has one value and both of C's values are the same. You can then do:
df.select(lambda c: df[c].dropna().unique().shape[0] > 1, axis=1)
And that gives me:
B D
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 52 NaN
4 36 NaN
5 NaN NaN
6 NaN NaN
7 NaN 43
8 NaN 63
9 NaN 97
This will work for both text and numbers:
for col in dataframe:
if(len(dataframe.loc[:,col].unique()) == 1):
dataframe.pop(col)
Note: This will remove the columns having only one value from the original dataframe.

Categories