Erroneous column concatenation Python - python

I have a data frame where in the first column I have to concatenate the other two if this record is empty.
Cuenta CeCo GLAccount CeCoCeBe
123 A 123 A
234 S 234 S
NaN 345 B
NaN 987 A
for x in df1["Cuenta CeCo"].isna():
if x:
df1["Cuenta CeCo"]=df1["GLAccount"].apply(str)+" "+df1["CeCoCeBe"]
else :
df1["Cuenta CeCo"]
TYPES:
df1["Cuenta CeCo"] = dtype('O')
df1["GLAccount"] = dtype('float64')
df1["CeCoCeBe"] = dtype('O')
expected output:
Cuenta CeCo GLAccount CeCoCeBe
123 A 123 A
234 S 234 S
345 B 345 B
987 A 987 A
however it seems that when concatenating it does something strange and throws me other numbers and letters
Cuenta CeCo
251 O
471 B
791 R
341 O
Could someone support me to know why this happens and how to correct it to have my expected exit?

Iterating over dataframes is typically bad practice and not what you intend. As you have done it, you are actually iterating over the columns. Try
for x in df:
print(x)
and you will see it print the column headings.
As for what you're trying to do, try this:
cols = ['Cuenta CeCo', 'GLAccount', 'CeCoCeBe']
mask = df[cols[0]].isna()
df.loc[mask, cols[0]] = df.loc[mask, cols[1]].map(str) + " " + df.loc[mask, cols[2]]
This generates a mask (in this case a series of True and False) that we use to get a series of just the NaN rows, then replace them by getting the string of the second column and concatenating with the third, using the mask again to get only the rows we need.

import pandas as pd
import numpy as np
df = pd.DataFrame([
['123 A', 123, 'A'],
['234 S', 234, 'S'],
[np.NaN, 345, 'B'],
[np.NaN, 987, 'A']
], columns = ['Cuenta CeCo', 'GLAccount', 'CeCoCeBe']
)
def f(r):
if pd.notna(r['Cuenta CeCo']):
return r['Cuenta CeCo']
else:
return f"{r['GLAccount']} {r['CeCoCeBe']}"
df['Cuenta CeCo'] = df.apply(f, axis=1)
df
prints
index
Cuenta CeCo
GLAccount
CeCoCeBe
0
123 A
123
A
1
234 S
234
S
2
345 B
345
B
3
987 A
987
A

Related

Translating a dataframe with a second dataframe

I have two text files:
One with translations/aliases of the form:
123 456
2 278
456 99999
...
and another one with three entries per line:
34 456 9900
111 333 444
234 2 562
...
I want to translate the second column, if possible, so for example I would like the output dataframe to have the rows:
34, 99999, 9900
111, 333, 444
234, 278, 562
Reading in the text files works fine. However, I do have problems with translating the column b.
This is my basic code structure right now:
translation = sc.textFile("transl.txt")\
.map(lambda line: line.split(" "))
def translate(string):
x = translation.filter(lambda x: x[0] == string).collect()
if x == []:
return string
return x[0][1]
d = sc.textFile("text.txt")\
.map(lambda line: line.split(" "))\
.toDF(["a", "b", "c"])\
.withColumn("b", translate(d.b))\
Everything works fine exept for the last line.
I know that applying functions to a column in spark doesn't work that easy, however I am out of ideas how else to do it.
You can achieve that with a left join. Please have a look at the commented code below:
import pyspark.sql.functions as F
l1 = [
(123, 456)
,(2, 278)
,(456, 99999)
]
l2 = [
(34, 456, 9900)
,(111, 333, 444)
,(234, 2, 562)
]
df1=spark.createDataFrame(l1, ['one1', 'two1'])
df2=spark.createDataFrame(l2, ['one2', 'two2', 'three2'])
#creates an dataframe with five columns one1, two1, one2, two2, three2
df = df2.join(df1, df2.two2 == df1.one1 , 'left')
#checks if a value in your dictionary dataframe is avaiable, if not it will keep the current value
#otherwise the value will be translated
df = df.withColumn('two2', F.when(F.col('two1').isNull(), F.col('two2') ).otherwise(F.col('two1')))
df = df.drop('one1', 'two1')
df.show()
Output:
+----+-----+------+
|one2| two2|three2|
+----+-----+------+
| 111| 333| 444|
| 234| 278| 562|
| 34|99999| 9900|
+----+-----+------+
A slightly different approach would be to join the two files, if you imported them as dataframes. I've shown an example below:
# Sample DataFrame's from provided example
import pandas as pd
translations = pd.DataFrame({
'Key': [123,2,456],
'Translation': [456,278,99999]
})
entries = pd.DataFrame({
'A': [34,11,234],
'B': [456,333,2],
'C': [9900,444,562]
})
After importing the files we can merge them by the lookup keys, using a left join
df = pd.merge(entries, translations, left_on='B', right_on='Key', how='left')
However, this will leave us with a columns with NaN's where a lookup couldn't be found. To resolve this, we take the value from 'B', and at the same time overwrite the original 'B' column with our lookup value.
df['B'] = df['Translation'].mask(pd.isna, df['B'])
Now we need to drop the additional columns to arrive at the result that you requested:
df.drop(columns=['Key', 'Translation'])
df will now look like this:
A B C
0 34 99999 9900
1 11 333 444
2 234 278 562

Pandas replace() on all masked values

I want to replace 'bee' with 'ass' on all masked values m in df.
import pandas as pd
data = {'Data1':[899, 900, 901, 902],
'Data2':['as-bee', 'be-bee', 'bee-be', 'bee-as']}
df = pd.DataFrame(data)
Data1 Data2
0 899 as-bee
1 900 be-bee
2 901 bee-be
3 902 bee-as
wrong = {'Data1':[900,901]}
df1 = pd.DataFrame(wrong)
Data1
0 900
1 901
m = df['Data1'].isin(wrong['Data1'])
df[m]['Data2'].apply(lambda x: x.replace('bee','aas'))
1 be-aas
2 aas-be
Name: Data2, dtype: object
It returns the desired changes, but the values in df does not change. Doing df[m]['Data2']=df[m]['Data2'].apply(lambda x: x.replace('bee','aas')) does not help either as it returns an error.
IIUC, you can do this using
Method1 : df.loc[]:
m=df.Data1.isin(df1.Data1) # boolean mask
df.loc[m,'Data2']=df.loc[m,'Data2'].replace('bee','ass',regex=True)
print(df)
Method2: np.where()
m=df.Data1.isin(df1.Data1)
df.Data2=np.where(m,df.Data2.replace('bee','ass',regex=True),df.Data2)
print(df)
Data1 Data2
0 899 as-bee
1 900 be-ass
2 901 ass-be
3 902 bee-as

Checking data match and mismatch between two columns using python pandas

Sample data below
enter image description here
input of file A and File B is given and the output format also given . can someone help me on this
I'd also be curious to see a clever/pythonic solution to this. My "ugly" solution iterating over index is as follows:
dfa, dfb are the two dataframes, columns named as in example.
dfa = pd.DataFrame({'c1':['v','f','h','m','s','d'],'c2':['100','110','235','999','333','39'],'c3':['tech','jjj',None,'iii','mnp','lf'],'c4':['hhh','scb','kkk','lop','sos','kdk']})
dfb = pd.DataFrame({'c1':['v','h','m','f','L','s'],'c2':['100','235','999','110','777','333'],'c3':['tech',None,'iii','jkl','9kdf','mnp1'],'c4':['hhh','mckkk','lok','scb','ooo','sos1']})
Now let's create lists of indexes to identify the rows that don't match between dfa and dfb
dfa, dfb = dfa.set_index(['c1','c2']), dfb.set_index(['c1','c2'])
mismatch3, mismatch4 = [],[]
for i in dfa.index:
if i in dfb.index:
if dfa.loc[i,'c3']!=dfb.loc[i,'c3']:
mismatch3.append(i)
if dfa.loc[i,'c4']!=dfb.loc[i,'c4']:
mismatch4.append(i)
mismatch = list(set(mismatch3+mismatch4))
Now that this is done, we want to rename dfb, perform the join operation on the mismatched indexes, and add the "status" columns based on mismatch3 and mismatch4.
dfb = dfb.rename(index=str, columns={'c3':'b_c3','c4':'b_c4'})
df = dfa.loc[mismatch].join(dfb)
df['c3_status'] = 'match'
df['c4_status'] = 'match'
df.loc[mismatch3, 'c3_status'] = 'mismatch'
df.loc[mismatch4, 'c4_status'] = 'mismatch'
Finally, let's get those columns in the right order :)
result = df[['c3','b_c3','c3_status','c4','b_c4','c4_status']]
Once again, I'd love to see a prettier solution. I hope this helps!
Here are four lines of code that may do what you are looking for:
columns_to_compare =['c2','c3']
dfa['Combo'] = dfa[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
dfb['Combo1'] = dfb[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
explanation
Assume that you want to see what dfb rows are not in dfa, for columns c2 and c3.
To do this, consider the following approach:
Create a column "Combo" in dfa where each row of "Combo" contains a comma separated string, representing the values of the chosen columns to compare (for the row concerned)
dfa['Combo'] = dfa[dfa.columns].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
c1 c2 c3 c4 Combo
0 v 100 tech hhh 100, tech
1 f 110 jjj scb 110, jjj
2 h 235 None kkk 235
3 m 999 iii lop 999, iii
4 s 333 mnp sos 333, mnp
5 d 39 lf kdk 39, lf
Apply the same logic to dfb
c1 c2 c3 c4 Combo1
0 v 100 tech hhh 100, tech
1 h 235 None mckkk 235
2 m 999 iii lok 999, iii
3 f 110 jkl scb 110, jkl
4 L 777 9kdf ooo 777, 9kdf
5 s 333 mnp1 sos1 333, mnp1
Create a list containing the required indices from dfb:
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
or to show the actual row values (not indices):
[[x] for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
Row Index Result
[3, 4, 5]
Row Value Result
[['110, jkl'], ['777, 9kdf'], ['333, mnp1']]

Python 3: Construct a variable by parsing a pandas dataframe

I have the following dataframe with the columns id, start, end, name:
A 7 340 string1
B 12 113 string2
B 139 287 string3
B 301 348 string4
B 379 434 string5
C 41 73 string6
C 105 159 string7
and I am reading this into python3 using pandas:
import pandas
df = pandas.read_csv("table", comment="#", header=None, names=["id", "start", "end", "name"])
Now I need to parse the df and extract for each id the start, end and name into a list of the following format:
mylist = [GraphicFeature(start=XXX, end=YYY, color="#ffffff", label="ZZZ")]
XXX here is the start, YYY is the end, ZZZ is the "name". The list has therefore as many items as number of rows per id.
GraphicFeature is just a member name of a module.
I thought of looping over the dataframe like this:
uniq_val = list(df["id"].unique())
for i in uniq_val:
extracted = df.loc[df["id"] == i]
But how do I construct mylist? (There will be some other plotting commands after constructing the list).
My expected "output" in the loop therefore is:
for id A:
mylist = [GraphicFeature(start=7, end=340, color="#ffffff", label="string1")]
for id B:
mylist = [GraphicFeature(start=12, end=113, color="#ffffff", label="string2"), GraphicFeature(start=139, end=287, color="#ffffff", label="string3"), GraphicFeature(start=301, end=348, color="#ffffff", label="string4"), GraphicFeature(start=379, end=434, color="#ffffff", label="string5")]
for id C:
mylist = [GraphicFeature(start=41, end=73, color="#ffffff", label="string6"), GraphicFeature(start=105, end=159, color="#ffffff", label="string7")]
Using for loop
l=[[GraphicFeature(start=x[0], end=x[1], color="#ffffff", label=x[2])for x in zip(y.start,y.end,y.name) ] for _,y in df.groupby('id')]
One approach would be to let
mylists = df.groupby('id').apply(lambda group: group.apply(lambda row: GraphicFeature(start=row['start'], end=row['end'], color='#ffffff', label=row['name']), axis=1).tolist())
Spelling this out a little bit, note that pandas operations tends to fit together most tidily if one takes a functional programming approach; we want to turn each row into a GraphicFeature, and in turn we want to turn each group of rows with the same id into a list of GraphicFeature. As such, the above could also be expanded to
def row_to_graphic_feature(row):
return GraphicFeature(start=row['start'], end=row['end'], color='#ffffff', label=row['name'])
def id_group_to_list(group):
return group.apply(row_to_graphic_feature, axis=1).tolist()
mylists = df.groupby('id').apply(id_group_to_list)
With your example data:
In [38]: df
Out[38]:
id start end name
0 A 7 340 string1
1 B 12 113 string2
2 B 139 287 string3
3 B 301 348 string4
4 B 379 434 string5
5 C 41 73 string6
6 C 105 159 string7
In [39]: mylists = df.groupby('id').apply(id_group_to_list)
In [40]: mylists['A']
Out[40]: [GraphicFeature(start=7, end=340, color='#ffffff', label='string1')]
In [41]: mylists['B']
Out[41]:
[GraphicFeature(start=12, end=113, color='#ffffff', label='string2'),
GraphicFeature(start=139, end=287, color='#ffffff', label='string3'),
GraphicFeature(start=301, end=348, color='#ffffff', label='string4'),
GraphicFeature(start=379, end=434, color='#ffffff', label='string5')]
In [42]: mylists['C']
Out[42]:
[GraphicFeature(start=41, end=73, color='#ffffff', label='string6'),
GraphicFeature(start=105, end=159, color='#ffffff', label='string7')]

Iterating through pandas string index turned them into floats

I have a csv file:
SID done good_ecg good_gsr good_resp comment
436 0 1 1
2411 1 1 1
3858 0 1 1
4517 0 1 1 117 min diff between files
9458 1 0 1 ######### error in my script
9754 0 1 1 trigger fehler
#REF!
88.8888888889
Which I load in a pandas dataframe it like this:
df = pandas.read_csv(f ,delimiter="\t", dtype="str", index_col='SID')
I want to iterate through the index and print each one. But when I try
for subj in df.index:
print subj
I get
436.0
2411.0
...
Now there is this '.0' at the end of each number. What am I doing wrong?
I have also tried iterating with iterrows() and have the same problem.
Thank you for any help!
EDIT: Here is the whole code I am using:
import pandas
def write():
df = pandas.read_csv("overview.csv" ,delimiter="\t", dtype="str", index_col='SID')
for subj in df.index:
print subj
write()
Ah. The dtype parameter doesn't apply to the index_col:
>>> !cat sindex.csv
a,b,c
123,50,R
234,51,R
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col="a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Int64Index([123, 234], dtype='int64', name='a')
Instead, read it in without an index_col (None is actually the default, so you don't need index_col=None at all, but here I'll be explicit) and then set the index:
>>> df = pd.read_csv("sindex.csv", dtype="str", index_col=None)
>>> df = df.set_index("a")
>>> df
b c
a
123 50 R
234 51 R
>>> df.index
Index(['123', '234'], dtype='object', name='a')
(I can't think of circumstances under which df.index would have dtype object but when you iterate over it you'd get integers, but you didn't actually show any self-contained code that generated that problem.)

Categories