Pandas replace() on all masked values - python

I want to replace 'bee' with 'ass' on all masked values m in df.
import pandas as pd
data = {'Data1':[899, 900, 901, 902],
'Data2':['as-bee', 'be-bee', 'bee-be', 'bee-as']}
df = pd.DataFrame(data)
Data1 Data2
0 899 as-bee
1 900 be-bee
2 901 bee-be
3 902 bee-as
wrong = {'Data1':[900,901]}
df1 = pd.DataFrame(wrong)
Data1
0 900
1 901
m = df['Data1'].isin(wrong['Data1'])
df[m]['Data2'].apply(lambda x: x.replace('bee','aas'))
1 be-aas
2 aas-be
Name: Data2, dtype: object
It returns the desired changes, but the values in df does not change. Doing df[m]['Data2']=df[m]['Data2'].apply(lambda x: x.replace('bee','aas')) does not help either as it returns an error.

IIUC, you can do this using
Method1 : df.loc[]:
m=df.Data1.isin(df1.Data1) # boolean mask
df.loc[m,'Data2']=df.loc[m,'Data2'].replace('bee','ass',regex=True)
print(df)
Method2: np.where()
m=df.Data1.isin(df1.Data1)
df.Data2=np.where(m,df.Data2.replace('bee','ass',regex=True),df.Data2)
print(df)
Data1 Data2
0 899 as-bee
1 900 be-ass
2 901 ass-be
3 902 bee-as

Related

Erroneous column concatenation Python

I have a data frame where in the first column I have to concatenate the other two if this record is empty.
Cuenta CeCo GLAccount CeCoCeBe
123 A 123 A
234 S 234 S
NaN 345 B
NaN 987 A
for x in df1["Cuenta CeCo"].isna():
if x:
df1["Cuenta CeCo"]=df1["GLAccount"].apply(str)+" "+df1["CeCoCeBe"]
else :
df1["Cuenta CeCo"]
TYPES:
df1["Cuenta CeCo"] = dtype('O')
df1["GLAccount"] = dtype('float64')
df1["CeCoCeBe"] = dtype('O')
expected output:
Cuenta CeCo GLAccount CeCoCeBe
123 A 123 A
234 S 234 S
345 B 345 B
987 A 987 A
however it seems that when concatenating it does something strange and throws me other numbers and letters
Cuenta CeCo
251 O
471 B
791 R
341 O
Could someone support me to know why this happens and how to correct it to have my expected exit?
Iterating over dataframes is typically bad practice and not what you intend. As you have done it, you are actually iterating over the columns. Try
for x in df:
print(x)
and you will see it print the column headings.
As for what you're trying to do, try this:
cols = ['Cuenta CeCo', 'GLAccount', 'CeCoCeBe']
mask = df[cols[0]].isna()
df.loc[mask, cols[0]] = df.loc[mask, cols[1]].map(str) + " " + df.loc[mask, cols[2]]
This generates a mask (in this case a series of True and False) that we use to get a series of just the NaN rows, then replace them by getting the string of the second column and concatenating with the third, using the mask again to get only the rows we need.
import pandas as pd
import numpy as np
df = pd.DataFrame([
['123 A', 123, 'A'],
['234 S', 234, 'S'],
[np.NaN, 345, 'B'],
[np.NaN, 987, 'A']
], columns = ['Cuenta CeCo', 'GLAccount', 'CeCoCeBe']
)
def f(r):
if pd.notna(r['Cuenta CeCo']):
return r['Cuenta CeCo']
else:
return f"{r['GLAccount']} {r['CeCoCeBe']}"
df['Cuenta CeCo'] = df.apply(f, axis=1)
df
prints
index
Cuenta CeCo
GLAccount
CeCoCeBe
0
123 A
123
A
1
234 S
234
S
2
345 B
345
B
3
987 A
987
A

check if column is blank in pandas dataframe

I have the next csv file:
A|B|C
1100|8718|2021-11-21
1104|21|
I want to create a dataframe that gives me the date output as follows:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
This means
if C is empty:
put doublequotes
else:
format date to yyyymmddhhmmss (adding 0s to hhmmss)
My code:
df['C'] = np.where(df['C'].empty, df['C'].str.replace('', '""'), df['C'] + '000000')
but it gives me the next:
A B C
0 1100 8718 2021-11-21
1 1104 21 0
I have tried another piece of code:
if df['C'].empty:
df['C'] = df['C'].str.replace('', '""')
else:
df['C'] = df['C'].str.replace('-', '') + '000000'
OUTPUT:
A B C
0 1100 8718 20211121000000
1 1104 21 0000000
Use dt.strftime:
df = pd.read_csv('data.csv', sep='|', parse_dates=['C'])
df['C'] = df['C'].dt.strftime('%Y%m%d%H%M%S').fillna('""')
print(df)
# Output:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
A good way would be to convert the column into datetime using pd.to_datetime with parameter errors='coerce' then dropping None values.
import pandas as pd
x = pd.DataFrame({
'one': 20211121000000,
'two': 'not true',
'three': '20211230'
}, index = [1])
x.apply(lambda x: pd.to_datetime(x, errors='coerce')).T.dropna()
# Output:
1
one 1970-01-01 05:36:51.121
three 2021-12-30 00:00:00.000

Iterating through pandas string index turned them into floats

I have a csv file:
SID done good_ecg good_gsr good_resp comment
436 0 1 1
2411 1 1 1
3858 0 1 1
4517 0 1 1 117 min diff between files
9458 1 0 1 ######### error in my script
9754 0 1 1 trigger fehler
#REF!
88.8888888889
Which I load in a pandas dataframe it like this:
df = pandas.read_csv(f ,delimiter="\t", dtype="str", index_col='SID')
I want to iterate through the index and print each one. But when I try
for subj in df.index:
print subj
I get
436.0
2411.0
...
Now there is this '.0' at the end of each number. What am I doing wrong?
I have also tried iterating with iterrows() and have the same problem.
Thank you for any help!
EDIT: Here is the whole code I am using:
import pandas
def write():
df = pandas.read_csv("overview.csv" ,delimiter="\t", dtype="str", index_col='SID')
for subj in df.index:
print subj
write()
Ah. The dtype parameter doesn't apply to the index_col:
>>> !cat sindex.csv
a,b,c
123,50,R
234,51,R
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col="a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Int64Index([123, 234], dtype='int64', name='a')
Instead, read it in without an index_col (None is actually the default, so you don't need index_col=None at all, but here I'll be explicit) and then set the index:
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col=None)
>>> df = df.set_index("a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Index(['123', '234'], dtype='object', name='a')
(I can't think of circumstances under which df.index would have dtype object but when you iterate over it you'd get integers, but you didn't actually show any self-contained code that generated that problem.)

How to join a Series to a DataFrame?

Is there any way to join a Series to a DataFrame directly?
The join would be on a field of the dataframe and on the index of the series.
The only way I found was to convert the series to a dataframe first, as in the code below.
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = np.arange(0, 4)
df['b'] = np.arange(100, 104)
s = pd.Series(data=np.arange(100, 103))
# this doesn't work
# myjoin = pd.merge(df, s, how='left', left_on='a', right_index=True)
# this does
s = s.reset_index()
# s becomes a Dataframe
# note you cannot reset the index of a series inplace
myjoin = pd.merge(df, s, how='left', left_on='a', right_on='index')
print myjoin
I guess http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html might help.
For example inner/outer join.
pd.concat((df,s), axis=1)
Out[26]:
a b 0
0 0 100 100
1 1 101 101
2 2 102 102
3 3 103 NaN
In [27]: pd.concat((df,s), axis=1, join='inner')
Out[27]:
a b 0
0 0 100 100
1 1 101 101
2 2 102 102
That's a very late answer, but what worked for me was building a dataframe with the columns you want to retrieve in your series, name this series as the index you need,
append the series to the dataframe (if you have supplementary elements in the series they are added to the dataframe, which in some application may be convenient), then join the final dataframe by this index to the original dataframe you want to expand. Agreed it is not direct, but that's still the most convenient way if you have a lot of series, instead of transforming each in a dataframe first.
Try concat():
import numpy as np
import pandas as pd
df= pd.DataFrame()
df['a']= np.arange(0,4)
df['b']= np.arange(100,104)
s =pd.Series(data = np.arange(100,103))
new_df = pd.concat((df, s), axis=1)
print new_df
This prints:
a b 0
0 0 100 100
1 1 101 101
2 2 102 102
3 3 103 NaN

Pandas / SQLITE DataFrame plot

I try to plot data from sqlite but i can't achieve this :-/
p2 = sql.read_sql('select DT_COMPUTE_FORCAST,VALUE_DEMANDE,VALUE_FORCAST from PCE', cnx)
# Data frame p2 show the datas
DT_COMPUTE_FORCAST VALUE_DEMANDE VALUE_FORCAST
0 27/06/2014 06:00 5.128 5.324
1 27/06/2014 07:00 5.779 5.334
2 27/06/2014 08:00 5.539 5.354
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'],'Forcast' :p2['VALUE_FORCAST']},index=p2['DT_COMPUTE_FORCAST'])
df.plot(title='Title Here')
=> My chart is showing but with no values, could you give me a hint ?!
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 20109 entries, 27/06/2014 06:00 to 11/05/2015 05:00
Data columns (total 2 columns):
Demande 0 non-null float64
Forcast 0 non-null float64
dtypes: float64(2)
memory usage: 392.8+ KB
the followinf sentence is the correct or i miss something ?:
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'],'Forcast' : p2['VALUE_FORCAST']},index=p2['DT_COMPUTE_FORCAST'])
I think what happens here is that because you pass the data from p2 and using one of the columns as the index, the index values no longer align so you end up with 0 values. You can get around this by assigning the index after the df creation:
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'],'Forcast' :p2['VALUE_FORCAST']})
and then
df.index = p2['DT_COMPUTE_FORCAST']
Example:
In [160]:
df = pd.DataFrame({'a':np.arange(5), 'b':list('abcde')})
df
Out[160]:
a b
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
In [161]:
df1 = pd.DataFrame({'a_copy':df['a']}, index=df['b'])
df1
Out[161]:
a_copy
b
a NaN
b NaN
c NaN
d NaN
e NaN
Another way to get around this is to access the .values attribute so that the data is anonymous:
In [162]:
df1 = pd.DataFrame({'a_copy':df['a'].values}, index=df['b'])
df1
Out[162]:
a_copy
b
a 0
b 1
c 2
d 3
e 4
So the following should work:
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'].values,'Forcast' : p2['VALUE_FORCAST'].values},index=p2['DT_COMPUTE_FORCAST'])

Categories