check if column is blank in pandas dataframe

check if column is blank in pandas dataframe - python

I have the next csv file:
A|B|C
1100|8718|2021-11-21
1104|21|
I want to create a dataframe that gives me the date output as follows:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
This means
if C is empty:
put doublequotes
else:
format date to yyyymmddhhmmss (adding 0s to hhmmss)
My code:
df['C'] = np.where(df['C'].empty, df['C'].str.replace('', '""'), df['C'] + '000000')
but it gives me the next:
A B C
0 1100 8718 2021-11-21
1 1104 21 0
I have tried another piece of code:
if df['C'].empty:
df['C'] = df['C'].str.replace('', '""')
else:
df['C'] = df['C'].str.replace('-', '') + '000000'
OUTPUT:
A B C
0 1100 8718 20211121000000
1 1104 21 0000000

Use dt.strftime:
df = pd.read_csv('data.csv', sep='|', parse_dates=['C'])
df['C'] = df['C'].dt.strftime('%Y%m%d%H%M%S').fillna('""')
print(df)
# Output:
A B C
0 1100 8718 20211121000000
1 1104 21 ""

A good way would be to convert the column into datetime using pd.to_datetime with parameter errors='coerce' then dropping None values.
import pandas as pd
x = pd.DataFrame({
'one': 20211121000000,
'two': 'not true',
'three': '20211230'
}, index = [1])
x.apply(lambda x: pd.to_datetime(x, errors='coerce')).T.dropna()
# Output:
1
one 1970-01-01 05:36:51.121
three 2021-12-30 00:00:00.000

Related

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you

Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8

A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)

You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val

Pandas replace() on all masked values

I want to replace 'bee' with 'ass' on all masked values m in df.
import pandas as pd
data = {'Data1':[899, 900, 901, 902],
'Data2':['as-bee', 'be-bee', 'bee-be', 'bee-as']}
df = pd.DataFrame(data)
Data1 Data2
0 899 as-bee
1 900 be-bee
2 901 bee-be
3 902 bee-as
wrong = {'Data1':[900,901]}
df1 = pd.DataFrame(wrong)
Data1
0 900
1 901
m = df['Data1'].isin(wrong['Data1'])
df[m]['Data2'].apply(lambda x: x.replace('bee','aas'))
1 be-aas
2 aas-be
Name: Data2, dtype: object
It returns the desired changes, but the values in df does not change. Doing df[m]['Data2']=df[m]['Data2'].apply(lambda x: x.replace('bee','aas')) does not help either as it returns an error.

IIUC, you can do this using
Method1 : df.loc[]:
m=df.Data1.isin(df1.Data1) # boolean mask
df.loc[m,'Data2']=df.loc[m,'Data2'].replace('bee','ass',regex=True)
print(df)
Method2: np.where()
m=df.Data1.isin(df1.Data1)
df.Data2=np.where(m,df.Data2.replace('bee','ass',regex=True),df.Data2)
print(df)
Data1 Data2
0 899 as-bee
1 900 be-ass
2 901 ass-be
3 902 bee-as

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?

Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2

"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2

Displaying only the intersection of date range rows in pandas

Following from here
import pandas as pd
data = {'date': ['1998-03-01 00:00:01', '2001-04-01 00:00:01','1998-06-01 00:00:01','2001-08-01 00:00:01','2001-05-03 00:00:01','1994-03-01 00:00:01'],
'node1': [1, 1, 2,2,3,2],
'node2': [8,316,26,35,44,56],
'weight': [1,1,1,1,1,1], }
df = pd.DataFrame(data, columns = ['date', 'node1','node2','weight'])
df['date'] = pd.to_datetime(df['date'])
mask = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([1998,1999,2000])).any())
mask2 = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([2001,2002,2003])).any())
print df[df['node1'].isin(mask[mask & mask2].index)]
The output I require are the nodes which are in the year range (98-00) and (01-03) but also it should only display the rows which are in both the ranges.
Expected Output-
node1 node2 date
1 8 1998-03-01
1 316 2001-04-01
2 26 1998-06-01
2 35 2001-08-01
right now this code is also printing this row: 2 56 1994-03-01 too.

One simple solution is to first remove the dates that are not in both the date ranges then apply mask i.e
l1 = [1998,1999,2000]
l2 = [2001,2002,2003]
ndf = df[df['date'].dt.year.isin(l1+l2)]
After getting the ndf:
Option 1: You can go for dual groupby mask based approach i.e
mask = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l1)).any())
mask2 = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l2)).any())
new = ndf[ndf['node1'].isin(mask[mask & mask2].index)]
Thank you #Zero
Option 2: You can go for groupby transform
new = ndf[ndf.groupby('node1')['date'].transform(lambda x: x.dt.year.isin(l1).any() & x.dt.year.isin(l2).any())]
Option 3: groupby filter
new = ndf.groupby('node1').filter(lambda x: x['date'].dt.year.isin(l1).any() & x['date'].dt.year.isin(l2).any())
Output :
date node1 node2 weight
0 1998-03-01 00:00:01 1 8 1
1 2001-04-01 00:00:01 1 316 1
2 1998-06-01 00:00:01 2 26 1
3 2001-08-01 00:00:01 2 35 1

Iterating through pandas string index turned them into floats

I have a csv file:
SID done good_ecg good_gsr good_resp comment
436 0 1 1
2411 1 1 1
3858 0 1 1
4517 0 1 1 117 min diff between files
9458 1 0 1 ######### error in my script
9754 0 1 1 trigger fehler
#REF!
88.8888888889
Which I load in a pandas dataframe it like this:
df = pandas.read_csv(f ,delimiter="\t", dtype="str", index_col='SID')
I want to iterate through the index and print each one. But when I try
for subj in df.index:
print subj
I get
436.0
2411.0
...
Now there is this '.0' at the end of each number. What am I doing wrong?
I have also tried iterating with iterrows() and have the same problem.
Thank you for any help!
EDIT: Here is the whole code I am using:
import pandas
def write():
df = pandas.read_csv("overview.csv" ,delimiter="\t", dtype="str", index_col='SID')
for subj in df.index:
print subj
write()

Ah. The dtype parameter doesn't apply to the index_col:
>>> !cat sindex.csv
a,b,c
123,50,R
234,51,R
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col="a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Int64Index([123, 234], dtype='int64', name='a')
Instead, read it in without an index_col (None is actually the default, so you don't need index_col=None at all, but here I'll be explicit) and then set the index:
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col=None)
>>> df = df.set_index("a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Index(['123', '234'], dtype='object', name='a')
(I can't think of circumstances under which df.index would have dtype object but when you iterate over it you'd get integers, but you didn't actually show any self-contained code that generated that problem.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

check if column is blank in pandas dataframe - python

Use dt.strftime: df = pd.read_csv('data.csv', sep='|', parse_dates=['C']) df['C'] = df['C'].dt.strftime('%Y%m%d%H%M%S').fillna('""') print(df) # Output: A B C 0 1100 8718 20211121000000 1 1104 21 ""

Related

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Pandas replace() on all masked values

Why does concat Series to DataFrame with index matching columns not work?

Displaying only the intersection of date range rows in pandas

Iterating through pandas string index turned them into floats

Categories

Resources