I have list of string within my dataframe columns:
data = [{'column A': '3 item X; 4 item Y; item E of size 7', 'column B': 'item I of size 10; item X has 5 specificities; characteristic W'},
{'column A': '13 item X; item F of size 0; 9 item Y', 'column B': 'item J of size 11; item Y has 8 specificities'}]
df = pd.DataFrame(data)
I want to extract numerical information from strings that contains integers, for each row.
For instance, I need to create a new column named Size item E that takes the value 7 for the first row of df in column A, since the list contains item E of size 7.
If the value in the list of strings does not contain number, I just want to encode them as 1 or 0 if it is present in the original list.
Here is a summary of my desired output:
This is what I have coded so far, applying only 1 rule:
import pandas
import re
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
def transform(df):
columns = ['column A', 'column B']
for col in columns:
temp = df[col].apply(lambda x : str(x).split(';'))
tokens = set([l for j in temp for l in j])
for token in tokens:
try:
integer = int(re.search(r'\d+', token).group())
except:
pass
if token[0].isdigit():
df['Nb ' + token.replace('{} '.format(integer), '')] = integer
# if ...:
# ...other rules
elif hasNumbers(token) == False:
df[token] = df[col].apply(lambda x : 1 if token in str(x) else 0)
df = df.drop(col, axis=1)
return df
df3 = transform(df)
Which is returning me the following dataframe:
As you can see, I cannot apply my feature extraction by row, it updates the whole pandas series. Is there any to update new column values for each row step by step?
Don't go for complex functions pandas has great string manipulation functions.
Check this code to get the desired output.
data = [{'column A': '3 item X; 4 item Y; item E of size 7', 'column B': 'item I of size 10; item X has 5 specificities; characteristic W'},
{'column A': '13 item X; item F of size 0; 9 item Y', 'column B': 'item J of size 11; item Y has 8 specificities'}]
df = pd.DataFrame(data)
#joining 2 columns with ';'
df['All Columns joined'] = df[['column A','column B']].apply(lambda x: ';'.join(x), axis=1)
#creating empty dataframe
df_new = pd.DataFrame([])
#Desired output logic using string extract function
df_new['Nb item X'] = df['All Columns joined'].str.extract(r'([0-9]+) item X',expand = False)
df_new['Nb item Y'] = df['All Columns joined'].str.extract(r'([0-9]+) item Y',expand = False)
df_new['Nb specificities item X'] = df['All Columns joined'].str.extract(r'item X has ([0-9]+) specificities',expand = False)
df_new['Nb specificities item Y'] = df['All Columns joined'].str.extract(r'item Y has ([0-9]+) specificities',expand = False)
df_new['Size item E'] = df['All Columns joined'].str.extract(r'item E of size ([0-9]+)',expand = False)
df_new['Size item F'] = df['All Columns joined'].str.extract(r'item F of size ([0-9]+)',expand = False)
df_new['Size item I'] = df['All Columns joined'].str.extract(r'item I of size ([0-9]+)',expand = False)
df_new['Size item J'] = df['All Columns joined'].str.extract(r'item J of size ([0-9]+)',expand = False)
df_new['characteristic W'] = df['All Columns joined'].str.extract(r'(characteristic W)',expand = False).notnull().astype(int)
df_new
Nb item X Nb item Y Nb specificities item X Nb specificities item Y Size item E Size item F Size item I Size item J characteristic W
0 3 4 5 NaN 7 NaN 10 NaN 1
1 13 9 NaN 8 NaN 0 NaN 11 0
Ouput of the df_new dataframe.
Related
I would like to loop into some variable name and the equivalent column with an added suffix "_plus"
#original dataset
raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
df
#desired dataset
df['time']=df['time']*df['time_plus']
df['zone']=df['zone']*df['zone_plus']
df
I would like to do the multiplication in a more elegant way, through a loop, since I have many variables with this pattern: original name * transformed variable with the _plus suffix
something similar to this or better
my_list=['time','zone']
for i in my_list:
df[i]=df[i]*df[i+"_plus"]
Try:
for c in df.filter(regex=r".*(?<!_plus)$", axis=1):
df[c] *= df[c + "_plus"]
print(df)
Prints:
time zone time_plus zone_plus
0 10 0 5 0
1 6 9 6 9
2 8 18 2 6
3 6 0 3 5
Or:
for c in df.columns:
if not c.endswith("_plus"):
df[c] *= df[c + "_plus"]
raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
# Take every column that doesn't have a "_plus" suffix
cols = [i for i in list(df.columns) if "_plus" not in i]
# Calculate new columns
for col in cols:
df[str(col+"_2")] = df[col]*df[str(col+"_plus")]
I decided to create the new columns with a "_2" suffix, this way we don't mess up the original data.
for c in df.columns:
if f"{c}_plus" in df.columns:
df[c] *= df[f"{c}_plus"]
I have a dataframe of len(df) = 143213 of 47 columns.
I want to add a new column which can have multiple comma-separated values.
Currently, I am doing:
df['1'] = np.where((df.column_1 == ‘VALUE’) & (df.column_2 != df.column_3), 'AMOUNT ', '')
df['2'] = np.where((df.column_1 == ‘VALUE’) & (df.column_4 != df.column_5), 'QTY ', '')
df['3'] = np.where((df.column_1 == ‘VALUE’) & (df.column_2 != df.column_6), 'CC ', '')
df['4'] = np.where((df.column_1 == ‘VALUE’) & (df.column_12 <=0), 'MT ' , '')
……….
df['9'] = np.where((df.column_1 == ‘VALUE_2’) & (df.column_7 != df.column_9), 'SPP ', '')
df['10'] = np.where((df.column_1 == ‘VALUE_2’) & (df.column_7 != df.column_10), 'TC', '')
df['11'] = np.where((df. column_1 == ' VALUE_2') & (df.column_11 <=df.column_13) & (df.column_11>df.column_14) & (df.column_12 <=0 ), 'R_AMT ', '')
… and so on
I have created around 33 df[‘33’] based on different conditions.
After which, I am doing:
df['new'] = df['1'] + df['2'] +df['3']+ df['4']+ df['5']+ df['6'] + df['7'] + df['8'] + df['9']+ df['10'] + df['11']+ ………d[‘33’]
df['new1'] = df['new'].str.strip(' ')
df['required'] = df['new1'].apply(lambda x: x.replace(' ', ','))
and then dropping the columns.
Is there a better way to do this? I tried np.select but it doesn’t seem to be a good option when multiple conditions might get satisfied.
I'm not sure whether there is a way to vectorize this using Pandas primitives, but it can be done a bit more efficiently and cleanly than in your example. For example, chained additions like df['1']+df['2']+df['3']+... are not a good idea because every + will allocate a new Series object; also, string additions are not very efficient in Python.
import pandas as pd
import numpy as np
# Create example dataframe
df = pd.DataFrame(dict(col1=[1, 2, 3, 4, 5], col2=[1, 4, 3, 3, 7]))
# Store the output of the conditions as boolean values in an empty
# dataframe.
tmp_df = pd.DataFrame(index=df.index)
tmp_df['EQUAL'] = df['col1'] == df['col2']
tmp_df['EVENSUM'] = ((df['col1'] + df['col2']) % 2 == 0)
print(f'tmp_df:\n{tmp_df}')
# This is te slow part.
# iterrows() returns tuples (index, row_as_series).
df['required'] = [
','.join(r[r].index)
for _, r in tmp_df.iterrows()
]
print(f'df:\n{df}')
Output:
tmp_df:
EQUAL EVENSUM
0 True True
1 False True
2 True True
3 False False
4 False True
df:
col1 col2 required
0 1 1 EQUAL,EVENSUM
1 2 4 EVENSUM
2 3 3 EQUAL,EVENSUM
3 4 3
4 5 7 EVENSUM
df =
df.index[df.item == 'alcohol'][0]
it gives me 45
I want 2
Please suggest.
use pandas.Index.get_loc
i.e.
import pandas as pd
df = pd.DataFrame(columns = ['x'])
df.loc[10] = None
df.loc[20] = None
df.loc[30] = 1
print(df.index.get_loc(30))
>> 2
If possible create default index values by reset_index:
df = df.reset_index(drop=True)
out = df.index[df.item == 'alcohol'][0]
#generla solution if possible not matched values
out = next(iter(df.index[df.item == 'alcohol']), 'not matched')
Solution working with any index values:
out = next(iter(np.where(df.item == 'alcohol')[0]), 'not matched')
Sample:
df = pd.DataFrame({'item': ['food','alcohol','drinks']}, index=[23,45,89])
print (df)
item
23 food
45 alcohol
89 drinks
#test your output
print (df.index[df.item == 'alcohol'][0])
45
#python counts from 0, so for second value get 1
out = next(iter(np.where(df.item == 'alcohol')[0]), 'not matched')
print (out)
1
#condition not matched, so returned empty DataFrame
out = next(iter(np.where(df.item == 'a')[0]), 'not matched')
print (out)
not matched
Use index after filtering:
df[df.item == 'alcohol'].index
Index(['row 2'], dtype='object')
If you want the output to be 2 then:
indices = df[df.item == 'alcohol'].index
indices.str[-1:]
Index(['2'], dtype='object')
If want a list:
indices.str[-1:].tolist()
['2']
If the row number is beyond 1 digit then use:
indices.extract(r'(\d+)',expand=False)
Initial setup:
df = pd.DataFrame({"index":[23,45,89],"item":['food','alcohol','drinks']},
index=['row 1','row 2','row 3'])
df
index item
row 1 23 food
row 2 45 alcohol
row 3 89 drinks
df.loc[df['item']== 'alcohol'].index
it gives you:
Index(['row 2'], dtype='object')
If you want the "iloc" position:
value_to_find = df.loc[df['item']== 'alcohol'].index.tolist()[0]
row_indexes = df.index.tolist()
position = row_indexes.index(value)
print(position)
Note: index start in 0 you are finding 1, right? If you want counting rows
position = row_indexes.index(value) + 1
Imagine a pandas data frame given by
df = pd.DataFrame({
'id': range(5),
'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})
which yields
id desc mfr
0 0 This is text ABC
1 1 John Doe ABC DEF
2 2 John Doe DEF
3 3 Something JKL GHI
4 4 Something more JKL
I wish to determine which id's belong to eachother. Either they are matched by mfrcolumn or if mfrvalue are contained in desccolumn. E.g. id = 1 and 2 are the same group because mfr are equal but id = 0 and 1 are also the same group since ABC from mfr in id = 0 are part of desc in id = 1.
The resulting data frame should be
id desc mfr group
0 0 This is text ABC 0
1 1 John Doe ABC DEF 0
2 2 John Doe DEF 0
3 3 Something JKL GHI 1
4 4 Something more JKL 1
Are there anyone out there with a good solution for this? I imagine that there are no really simple ones so any is welcome.
I'm assuming 'desc' does not contain multiple 'mfr' values
Solution1:
import numpy as np
import pandas as pd
# original dataframe
df = pd.DataFrame({
'id': range(5),
'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})
# for final merge
ori = df.copy()
# max words used in 'desc'
max_len = max(df.desc.apply(lambda x: len(x.split(' '))))
# unique 'mfr' values
uniq_mfr = df.mfr.unique().tolist()
# if list is less than max len, then pad with nan
def padding(lst, mx):
for i in range(mx):
if len(lst) < mx:
lst.append(np.nan)
return lst
df['desc'] = df.desc.apply(lambda x: x.split(' ')).apply(padding, args=(max_len,))
# each word makes 1 column
for i in range(max_len):
newcol = 'desc{}'.format(i)
df[newcol] = df.desc.apply(lambda x: x[i])
df.loc[~df[newcol].isin(uniq_mfr), newcol] = np.nan
# merge created columns into 1 by taking 'mfr' values only
df['desc'] = df[df.columns[3:]].fillna('').sum(axis=1).replace('', np.nan)
# create [ABC, ABC] type of column by merging two columns (desc & mfr)
df = df[df.columns[:3]]
df.desc.fillna(df.mfr, inplace=True)
df.desc = [[x, y] for x, y in zip(df.desc.tolist(), df.mfr.tolist())]
df = df[['id', 'desc']]
df = df.sort_values('desc').reset_index(drop=True)
# BELOW IS COMMON WITH SOLUTION2
# from here I borrowed the solution by #mimomu from below URL (slightly modified)
# try to get merged tuple based on the common elements
# https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
import itertools
L = df.desc.tolist()
LL = set(itertools.chain.from_iterable(L))
for each in LL:
components = [x for x in L if each in x]
for i in components:
L.remove(i)
L += [tuple(set(itertools.chain.from_iterable(components)))]
# allocate merged tuple to 'desc'
df['desc'] = sorted(L)
# grouping by 'desc' value (tuple can be key list cannot be fyi...)
df['group'] = df.groupby('desc').grouper.group_info[0]
# merge with the original
df = df.drop('desc', axis=1).merge(ori, on='id', how='left')
df = df[['id', 'desc', 'mfr', 'group']]
Solution2 (2nd half is common with Solution1):
import numpy as np
import pandas as pd
# original dataframe
df = pd.DataFrame({
'id': range(5),
'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})
# for final merge
ori = df.copy()
# unique 'mfr' values
uniq_mfr = df.mfr.unique().tolist()
# make desc entries as lists
df['desc'] = df.desc.apply(lambda x: x.split(' '))
# pick up mfr values in desc column otherwise nan
mfr_in_descs = []
for ds, ms in zip(df.desc, df.mfr):
for i, d in enumerate(ds):
if d in uniq_mfr:
mfr_in_descs.append(d)
continue
if i == (len(ds) - 1):
mfr_in_descs.append(np.nan)
# create column whose element is like [ABC, ABC]
df['desc'] = mfr_in_descs
df['desc'].fillna(df.mfr, inplace=True)
df['desc'] = [[x, y] for x, y in zip(df.desc.tolist(), df.mfr.tolist())]
df = df[['id', 'desc']]
df = df.sort_values('desc').reset_index(drop=True)
# BELOW IS COMMON WITH SOLUTION1
# from here I borrowed the solution by #mimomu from below URL (slightly modified)
# try to get merged tuple based on the common elements
# https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
import itertools
L = df.desc.tolist()
LL = set(itertools.chain.from_iterable(L))
for each in LL:
components = [x for x in L if each in x]
for i in components:
L.remove(i)
L += [tuple(set(itertools.chain.from_iterable(components)))]
# allocate merged tuple to 'desc'
df['desc'] = sorted(L)
# grouping by 'desc' value (tuple can be key list cannot be fyi...)
df['group'] = df.groupby('desc').grouper.group_info[0]
# merge with the original
df = df.drop('desc', axis=1).merge(ori, on='id', how='left')
df = df[['id', 'desc', 'mfr', 'group']]
From 2 solutions above, I get the same results df:
id desc mfr group
0 0 This is text ABC 0
1 1 John Doe ABC DEF 0
2 2 John Doe DEF 0
3 3 Something JKL GHI 1
4 4 Something more JKL 1
I have a list of lists:
list = [
['Row 1','Value 1'],
['Row 2', 'Value 2'],
['Row 3', 'Value 3', 'Value 4']
]
And I have a list for dataframe header:
header_list = ['RowID', 'Value']
If I create the DataFrame using df = pd.DataFrame(list, columns = header_list), then python will through me an error says Row3 has more than 2 columns, which is inconsistent with the header_list.
So how can I skip Row 3 when creating the DataFrame. And how to achieve this with "in-place" calculation, which means NOT creating a new list which loops through the original list and append the item with length=2.
Thanks for the help!
First change variable list to L, because list is python code reserved word.
Then for filter use list comprehension:
L = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3', 'Value 4']]
#for omit all rows != 2
df = pd.DataFrame([x for x in L if len(x) == 2], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
#filter last 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[-2:] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Value 3 Value 4
Or:
#filter first 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[:2] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Row 3 Value 3
try below code:
list1 = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3']]
dff = pd.DataFrame(list1)
dff = dff[[x for x in range(len(header_list))]]
dff.columns = header_list