Looping lambda function across multiple panda columns - python

I am struggling to loop a lambda function across multiple columns.
samp = pd.DataFrame({'ID':['1','2','3'], 'A':['1C22', '3X35', '2C77'],
'B': ['1C35', '2C88', '3X99'], 'C':['3X56', '2C73', '1X91']})
Essentially, I am trying to add three columns to this dataframe with a 1 if there is a 'C' in the string and a 0 if not (i.e. an 'X').
This function works fine when I apply it as a lambda function to each column individually, but I'm doing so to 40 differnt columns and the code is (I'm assuming) unnecessarily clunky:
def is_correct(str):
correct = len(re.findall('C', str))
return correct
samp.A_correct=samp.A.apply(lambda x: is_correct(x))
samp.B_correct=samp.B.apply(lambda x: is_correct(x))
samp.C_correct=samp.C.apply(lambda x: is_correct(x))
I'm confident there is a way to loop this, but I have been unsuccessful thus far.

You can iterate over the columns:
import pandas as pd
import re
df = pd.DataFrame({'ID':['1','2','3'], 'A':['1C22', '3X35', '2C77'],
'B': ['1C35', '2C88', '3X99'], 'C':['3X56', '2C73', '1X91']})
def is_correct(str):
correct = len(re.findall('C', str))
return correct
for col in df.columns:
df[col + '_correct'] = df[col].apply(lambda x: is_correct(x))

Let's try apply and join:
samp.join(samp[['A','B','C']].add_suffix('_correct')
.apply(lambda x: x.str.contains('C'))
.astype(int)
)
Output:
ID A B C A_correct B_correct C_correct
0 1 1C22 1C35 3X56 1 1 0
1 2 3X35 2C88 2C73 0 1 1
2 3 2C77 3X99 1X91 1 0 0

Related

Split column in a Dask Dataframe into n number of columns

In a column in a Dask Dataframe, I have strings like this:
column_name_1
column_name_2
a^b^c
j
e^f^g
k^l
h^i
m
I need to split these strings into columns in the same data frame, like this
column_name_1
column_name_2
column_name_1_1
column_name_1_2
column_name_1_3
column_name_2_1
column_name_2_2
a^b^c
j
a
b
c
j
e^f^g
k^l
e
f
g
k
l
h^i
m
h
i
m
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it fails with a
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here are 2 solutions working without stack but with loop for selected columns names:
cols = ['column_name_1','column_name_2']
for c in cols:
df = df.join(df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna(''))
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Or modify another solution:
cols = ['column_name_1','column_name_2']
dfs = [df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna('') for c in cols]
df = pd.concat([df] + dfs, axis=1)
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Unfortunately using dask.dataframe.Series.str.split with expand=True and an unknown number of splits is not yet supported in Dask, the following returns a NotImplementedError:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
# returns NotImplementedError
ddf['column_name_1'].str.split('^', expand=True).compute()
Usually when a pandas equivalent has not yet been implemented in Dask, map_partitions can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with a meta argument. This makes using Dask for this task challenging. Relatedly, the ValueError occurs because column_name_2 requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.
Here is one solution (building from #Fontanka16's answer) if you do know the number of splits ahead of time:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
ddf_list = []
num_split_dict = {'column_name_1': 2, 'column_name_2': 1}
for col, num_splits in num_split_dict.items():
split_df = ddf[col].str.split('^', n=num_splits, expand=True).add_prefix(f'{col}_')
ddf_list.append(split_df)
new_ddf = dd.concat([ddf] + ddf_list, axis=1)
new_ddf.compute()

How to separate characters of a column based on its intersection with another column?

There are two columns in my df, the second column includes data of the other column+other characters (alphabets and/or numbers):
values = {
'number': [2830, 8457, 9234],
'nums': ['2830S', '8457M', '923442']
}
df = pd.DataFrame(values, columns=['number', 'nums'])
The extra characters are always after the common characters! How can I separate the characters that are not common between the two columns? I am looking for a simple solution, not a loop to check every character.
Replace common characters by empty string:
f_diff = lambda x: x['nums'].replace(x['number'], '')
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff, axis=1)
print(df)
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
Update
If number values are always the first characters of nums column, you can use a simpler function:
f_diff2 = lambda x: x['nums'][len(x['number']):]
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff2, axis=1)
print(df)
# Output
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
I would delete the prefix of the string. For this you can the method apply() to apply following function on each row:
def remove_prefix(text, prefix):
if text.startswith(prefix):
return text[len(prefix):]
return text
df['nums'] = df.apply(lambda x: remove_prefix(x['nums'], str(x['number'])), axis=1)
df
Output:
number nums
0 2830 S
1 8457 M
2 9234 42
If you have python version >= 3.9 you only need this:
df['nums'] = df.apply(lambda x: x['nums'].removeprefix(x['number']), axis=1)

How to import a DataFrame of mixed type and organize into columns in Python

I am importing a .txt file via read_table and get a DataFrame similar to
d = ['89278 5857', '1.000e-02', '1.591184e-02', '2.100053e-02', '89300 5857', '4.038443e-01', '4.037924e-01', '4.037336e-01']
df = pd.DataFrame(data = d)
and would like to reorganize it into
r = {'89278 5857': [1.000e-02, 1.591184e-02, 2.100053e-02], '89300 5857': [4.038443e-01, 4.037924e-01, 4.037336e-01]}
rf = pd.DataFrame(data = r)
The .txt file is typically 50k+ rows with an unknown number of '89278 5857' type values.
Thanks!
You can use itertools.groupby:
from itertools import groupby
data, cur_group = {}, None
for v, g in groupby(df[0], lambda k: " " in k):
if v:
cur_group = []
data[next(g)] = cur_group
else:
cur_group.extend(g)
df = pd.DataFrame(data)
print(df)
Prints:
89278 5857 89300 5857
0 1.000e-02 4.038443e-01
1 1.591184e-02 4.037924e-01
2 2.100053e-02 4.037336e-01
Assuming what delineates the start of the next group is a space, here what I would:
df.assign(
key=lambda df: numpy.where(
df['value'].str.contains(' '), # what defines each group
df['value'],
numpy.nan
),
).fillna(
method='ffill' # copy the group label down until the next group starts
).loc[
lambda df: df['value'] != df['key'] # remove the rows that kicked off each group
].assign(
idx=lambda df: df.groupby('key').cumcount() # get a row number for each group
).pivot(
index='idx', # pivot into the wide format
columns='key',
values='value'
).astype(float) # turn values into numbers instead of strings
And I get:
key 89278 5857 89300 5857
idx
0 0.010000 0.403844
1 0.015912 0.403792
2 0.021001 0.403734

Speed Up Pandas DataFrame Groupby Apply

I have the following code that I found on another post here (and modified slightly). It works great and the output is just as I expect, however I am wondering if anyone has suggestions on speed improvements. I am comparing two dataframes with about 93,000 rows and 110 columns. It takes about 20 minutes for the groupby to complete. I have tried to think of ways to speed if up but haven't come across anything. I am trying to think of anything now before my data sizes increase in the future. I am also open to other ways of doing this!
###Function that is called to check values in dataframe groupby
def report_diff(x):
return 'SAME' if x[0] == x[1] else '{} | {}'.format(*x)
#return '' if x[0] == x[1] else '{} | {}'.format(*x)
print("Concatening CSV and XML data together...")
###Concat the dataframes together
df_all = pd.concat(
[df_csv, df_xml],
axis='columns',
keys=['df_csv', 'df_xml'],
join='outer',
)
print("Done")
print("Swapping column levels...")
###Display keys at the top of each column
df_final = df_all.swaplevel(axis='columns')[df_xml.columns[0:]]
print("Done")
df_final = df_final.fillna('None')
print("Grouping data and checking for matches...")
###Apply report_diff function to each row
df_excel = df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
You can use np.where and check where df_csv[df_xml.columns] is equal to df_xml, if True then the value is 'SAME' else you can join the values of both dataframes like you do.
SETUP
df_csv = pd.DataFrame({'a':range(4),'b':[0,0,1,1],'c':list('abcd')})
df_xml = pd.DataFrame({'b':[0,2,3,1],'c':list('bbce')})
METHOD
df_excel = pd.DataFrame( np.where( df_csv[df_xml.columns] == df_xml, #find where
'SAME', #True
df_csv[df_xml.columns].astype(str) + ' | ' + df_xml.astype(str)), #False
columns=df_xml.columns
index=df_xml.index)
print (df_excel)
b c
0 SAME a | b
1 0 | 2 SAME
2 1 | 3 SAME
3 SAME d | e
Which is the same result that I got with your method.

Given a pandas dataframe, is there an easy way to print out a command to generate it?

After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.

Categories