Adding spaces between strings after sum() - python

Assuming that I have the following pandas dataframe:
>>> data = pd.DataFrame({ 'X':['a','b'], 'Y':['c','d'], 'Z':['e','f']})
X Y Z
0 a c e
1 b d f
The desired output is:
0 a c e
1 b d f
When I run the following code, I get:
>>> data.sum(axis=1)
0 ace
1 bdf
So how do I add columns of strings with space between them?

Use apply per rows by axis=1 and join:
a = data.apply(' '.join, axis=1)
print (a)
0 a c e
1 b d f
dtype: object
Another solution with add spaces, sum and last str.rstrip:
a = data.add(' ').sum(axis=1).str.rstrip()
#same as
#a = (data + ' ').sum(axis=1).str.rstrip()
print (a)
0 a c e
1 b d f
dtype: object

You can do as follow :
data.apply(lambda x:x + ' ').sum(axis=1)
The output is :
0 a c e
1 b d f
dtype: object

Related

How to combine string from one column to another column at same index in pandas DataFrame?

I was doing a project in nlp.
My input is:
index name lst
0 a c
0 d
0 e
1 f
1 b g
I need output like this:
index name lst combine
0 a c a c
0 d a d
0 e a e
1 f b f
1 b g b g
How can I achieve this?
You can use groupby+transform('max') to replace the empty cells with the letter per group as the letters have precedence over space. The rest is a simple string concatenation per column:
df['combine'] = df.groupby('index')['name'].transform('max') + ' ' + df['lst']
Used input:
df = pd.DataFrame({'index': [0,0,0,1,1],
'name': ['a','','','','b'],
'lst': list('cdefg'),
})
NB. I considered "index" to be a column here, if this is the index you should use df.index in the groupby
Output:
index name lst combine
0 0 a c a c
1 0 d a d
2 0 e a e
3 1 f b f
4 1 b g b g

perform df.loc to groupby df

I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B

How to apply function on checking the specific column Null Values

I am trying to apply a function on the dataframe by checking for NULL values on each rows of an specific column.
However i have created the function but , i am not getting how to use the function on the rows having the values.
Input:
A B C D E F
0 f e b a d a
1 c b a c b
2 f f a b c c
3 d c c d c d
4 f b b b e b
5 b a f c d a
Expected Output
A B C D E F MATCHES Comments
0 f e b a d a AD, BC Unmatched
1 c b a c b BC Unmatched F is having blank values
2 f f a b c c AD, BC Unmatched
3 d c c d c d ALL MATCHED
4 f b b b e b AD Unmatched
5 b a f c d a AD, BC Unmatched
The script created is working when we don't have to check for the NaN values in df['F'] column, BUt when we check for the empty rows in df['F'] , It gives Error.
Code i have been trying:
def test(x):
try:
for idx in df.index:
unmatch_list = []
if not df.loc[idx, 'A'] == df.loc[idx, 'D']:
unmatch_list.append('AD')
if not df.loc[idx, 'B'] == df.loc[idx, 'C']:
unmatch_list.append('BC')
# etcetera...
if len(unmatch_list):
unmatch_string = ', '.join(unmatch_list) + ' Unmatched'
else:
unmatch_string = 'ALL MATCHED'
df.loc[idx, 'MATCHES'] = unmatch_string
except ValueError:
print ('error')
return df
## df = df.apply(lambda x: test(x) if(pd.notna(df['F'])) else x)
for row in df:
if row['F'].isna() == True:
row['Comments'] = "F is having blank values"
else:
df = test(df)
Please Suggest how can i use to function .
You could try something like this:
# get combis
df1 = df.copy().reset_index().melt(id_vars=['index'])
df1 = df1.merge(df1, on=['index', 'value'], how='inner')
df1 = df1[df1['variable_x'] != df1['variable_y']]
df1['combis'] = df1['variable_x'] + ':' + df1['variable_y']
df1 = df1.groupby(['index'])['combis'].apply(list)
# get empty rows
df2 = df.copy().reset_index().melt(id_vars=['index'])
df2 = df2[df2['value'].isna()]
df2 = df2.groupby(['index'])['variable'].apply(list)
# combine
df.join(df1).join(df2)
# A B C ... F combis variable
# 0 f e b ... a [D:F, F:D] NaN
# 1 c b a ... None [A:D, D:A, B:E, E:B] [F]
# 2 f f a ... c [A:B, B:A, E:F, F:E] NaN
# 3 d c c ... d [A:D, A:F, D:A, D:F, F:A, F:D, B:C, B:E, C:B, ... NaN
# 4 f b b ... b [B:C, B:D, B:F, C:B, C:D, C:F, D:B, D:C, D:F, ... NaN
# 5 b a f ... a [B:F, F:B] NaN
# [6 rows x 8 columns]
If you are only interested in the unmatched combinations you can use this:
import itertools
combis = [x+':'+y for x,y in itertools.permutations(df.columns, 2)]
df.join(df1).join(df2)['combis'].map(lambda lst: list(set(combis) - set(lst)))

Add a column results in difference of rows

Let's say I have a data frame:
A B
0 a b
1 c d
2 e f
and what I am aiming for is to find the difference between the following rows from column A
Like this:
A B Ic
0 a b (a-a)
1 c d (a-c)
2 e f (a-e)
This is what I tried:
df['dA'] = df['A'] - df['A']
But it doesn't give me the result I needed. Any help would be greatly appreciated.
Select first value by loc by index and column name or iat by column name and position and subtract:
df['Ic'] = df.loc[0,'A'] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
df['Ic'] = df['A'].iat[0] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
Detail:
print (df.loc[0,'A'])
4
print (df['A'].iat[0])
4

Count the number of occurence per column with Pandas

I want to count the number of occurrences per column of two different values the first one is the number of null and the second one is the number of \N in my dataframe. Example I've got:
A B C D E D
1 \N 1 \N 12 1
2 4 \N 3 0 \N
3 4 M \N 1
I'm expected the following result:
A 2
B 1
C 2
D 1
E 1
F 2
I already succeed to count the number of missing value with the following code:
df = pd.read_csv("mypath/myFile", sep=',')
null_value = df.isnull().sum()
But the following code doesn't work:
break_line = df[df == '\N'].count()
return break_line + null_value
I get the following error
TypeError: Could not compare ['\N'] with block values
one liner:
ns = df.applymap(lambda x: x == '\N').sum(axis = 0)
null_value + ns
A 2
B 1
C 2
D 1
E 1
F 2
You can simply do the following using applymap:
df.applymap(lambda x: x == '\N').sum() + df.isnull().sum()
which gives you the desired output:
A 2
B 1
C 2
D 1
E 1
F 2
dtype: int64
Note: You use D twice; I now replaced that by F.
I assume you only want to count values where the string ends with '\N'. If not, you can use str.contains instead.
I use a dictionary comprehension to loop through the columns of the dataframe and a vectorized str function to count the number of rows with \N at the end.
df = pd.DataFrame({'A': ['\N', 4, None],
'B': [1, None, 4],
'C': ['\N', '\N', 'M'],
'D': [12, 3, None],
'E': [1, 0, '\N'],
'F': [None, '\N', 1]})
>>> df
A B C D E F
0 \N 1 \N 12 1 None
1 4 NaN \N 3 0 \N
2 None 4 M NaN \N 1
>>> pd.Series({col: df[col].str.endswith('\N').sum()
if df[col].dtype == 'object' else 0
for col in df}) + df.isnull().sum()
A 2
B 1
C 2
D 1
E 1
F 2
dtype: int64
A solution which uses only vectorized calculations:
df.isna().sum() + (df == '\\N').sum()
Output:
A 2
B 1
C 2
D 1
E 1
F 2

Categories