check if column is blank in pandas dataframe - python

I have the next csv file:
A|B|C
1100|8718|2021-11-21
1104|21|
I want to create a dataframe that gives me the date output as follows:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
This means
if C is empty:
put doublequotes
else:
format date to yyyymmddhhmmss (adding 0s to hhmmss)
My code:
df['C'] = np.where(df['C'].empty, df['C'].str.replace('', '""'), df['C'] + '000000')
but it gives me the next:
A B C
0 1100 8718 2021-11-21
1 1104 21 0
I have tried another piece of code:
if df['C'].empty:
df['C'] = df['C'].str.replace('', '""')
else:
df['C'] = df['C'].str.replace('-', '') + '000000'
OUTPUT:
A B C
0 1100 8718 20211121000000
1 1104 21 0000000

Use dt.strftime:
df = pd.read_csv('data.csv', sep='|', parse_dates=['C'])
df['C'] = df['C'].dt.strftime('%Y%m%d%H%M%S').fillna('""')
print(df)
# Output:
A B C
0 1100 8718 20211121000000
1 1104 21 ""

A good way would be to convert the column into datetime using pd.to_datetime with parameter errors='coerce' then dropping None values.
import pandas as pd
x = pd.DataFrame({
'one': 20211121000000,
'two': 'not true',
'three': '20211230'
}, index = [1])
x.apply(lambda x: pd.to_datetime(x, errors='coerce')).T.dropna()
# Output:
1
one 1970-01-01 05:36:51.121
three 2021-12-30 00:00:00.000

Related

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val

Pandas replace() on all masked values

I want to replace 'bee' with 'ass' on all masked values m in df.
import pandas as pd
data = {'Data1':[899, 900, 901, 902],
'Data2':['as-bee', 'be-bee', 'bee-be', 'bee-as']}
df = pd.DataFrame(data)
Data1 Data2
0 899 as-bee
1 900 be-bee
2 901 bee-be
3 902 bee-as
wrong = {'Data1':[900,901]}
df1 = pd.DataFrame(wrong)
Data1
0 900
1 901
m = df['Data1'].isin(wrong['Data1'])
df[m]['Data2'].apply(lambda x: x.replace('bee','aas'))
1 be-aas
2 aas-be
Name: Data2, dtype: object
It returns the desired changes, but the values in df does not change. Doing df[m]['Data2']=df[m]['Data2'].apply(lambda x: x.replace('bee','aas')) does not help either as it returns an error.
IIUC, you can do this using
Method1 : df.loc[]:
m=df.Data1.isin(df1.Data1) # boolean mask
df.loc[m,'Data2']=df.loc[m,'Data2'].replace('bee','ass',regex=True)
print(df)
Method2: np.where()
m=df.Data1.isin(df1.Data1)
df.Data2=np.where(m,df.Data2.replace('bee','ass',regex=True),df.Data2)
print(df)
Data1 Data2
0 899 as-bee
1 900 be-ass
2 901 ass-be
3 902 bee-as

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2

Displaying only the intersection of date range rows in pandas

Following from here
import pandas as pd
data = {'date': ['1998-03-01 00:00:01', '2001-04-01 00:00:01','1998-06-01 00:00:01','2001-08-01 00:00:01','2001-05-03 00:00:01','1994-03-01 00:00:01'],
'node1': [1, 1, 2,2,3,2],
'node2': [8,316,26,35,44,56],
'weight': [1,1,1,1,1,1], }
df = pd.DataFrame(data, columns = ['date', 'node1','node2','weight'])
df['date'] = pd.to_datetime(df['date'])
mask = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([1998,1999,2000])).any())
mask2 = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([2001,2002,2003])).any())
print df[df['node1'].isin(mask[mask & mask2].index)]
The output I require are the nodes which are in the year range (98-00) and (01-03) but also it should only display the rows which are in both the ranges.
Expected Output-
node1 node2 date
1 8 1998-03-01
1 316 2001-04-01
2 26 1998-06-01
2 35 2001-08-01
right now this code is also printing this row: 2 56 1994-03-01 too.
One simple solution is to first remove the dates that are not in both the date ranges then apply mask i.e
l1 = [1998,1999,2000]
l2 = [2001,2002,2003]
ndf = df[df['date'].dt.year.isin(l1+l2)]
After getting the ndf:
Option 1: You can go for dual groupby mask based approach i.e
mask = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l1)).any())
mask2 = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l2)).any())
new = ndf[ndf['node1'].isin(mask[mask & mask2].index)]
Thank you #Zero
Option 2: You can go for groupby transform
new = ndf[ndf.groupby('node1')['date'].transform(lambda x: x.dt.year.isin(l1).any() & x.dt.year.isin(l2).any())]
Option 3: groupby filter
new = ndf.groupby('node1').filter(lambda x: x['date'].dt.year.isin(l1).any() & x['date'].dt.year.isin(l2).any())
Output :
date node1 node2 weight
0 1998-03-01 00:00:01 1 8 1
1 2001-04-01 00:00:01 1 316 1
2 1998-06-01 00:00:01 2 26 1
3 2001-08-01 00:00:01 2 35 1

Iterating through pandas string index turned them into floats

I have a csv file:
SID done good_ecg good_gsr good_resp comment
436 0 1 1
2411 1 1 1
3858 0 1 1
4517 0 1 1 117 min diff between files
9458 1 0 1 ######### error in my script
9754 0 1 1 trigger fehler
#REF!
88.8888888889
Which I load in a pandas dataframe it like this:
df = pandas.read_csv(f ,delimiter="\t", dtype="str", index_col='SID')
I want to iterate through the index and print each one. But when I try
for subj in df.index:
print subj
I get
436.0
2411.0
...
Now there is this '.0' at the end of each number. What am I doing wrong?
I have also tried iterating with iterrows() and have the same problem.
Thank you for any help!
EDIT: Here is the whole code I am using:
import pandas
def write():
df = pandas.read_csv("overview.csv" ,delimiter="\t", dtype="str", index_col='SID')
for subj in df.index:
print subj
write()
Ah. The dtype parameter doesn't apply to the index_col:
>>> !cat sindex.csv
a,b,c
123,50,R
234,51,R
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col="a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Int64Index([123, 234], dtype='int64', name='a')
Instead, read it in without an index_col (None is actually the default, so you don't need index_col=None at all, but here I'll be explicit) and then set the index:
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col=None)
>>> df = df.set_index("a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Index(['123', '234'], dtype='object', name='a')
(I can't think of circumstances under which df.index would have dtype object but when you iterate over it you'd get integers, but you didn't actually show any self-contained code that generated that problem.)

Categories