I would like to replace the letters by their order number in the alphabet
import string
import pandas as pd
new_vals = {c: ord(c)-96 for c in string.ascii_lowercase}
df = pd.DataFrame({'Values': ['aaa', 'abc', 'def']})
df['Values_new'] = [''.join(str(new_vals[c]) for c in row) for row in df['Values']]
df is now:
>>> df
Values Values_new
0 aaa 111
1 abc 123
2 def 456
Then you can go in and add your what-seems-like-decimal notation, although the logic there seems a little unclear to me (you have a comma listed above):
df['Values_new'] = [v[0] + '.' + v[1:] for v in df['Values_new']]
Result:
>>> df
Values Values_new
0 aaa 1.11
1 abc 1.23
2 def 4.56
Related
I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)
Say I have a dataframe like below:
df = pd.DataFrame({0:['Hello World!']}) # here df could have more than one column of data as shown below
df = pd.DataFrame({0:['Hello World!'], 1:['Hello Mars!']}) # or df could have more than one row of data as shown below
df = pd.DataFrame({0:['Hello World!', 'Hello Mars!']})
and I also have a list of column names like below:
new_col_names = ['a','b','c','d'] # here, len(new_col_names) might vary like below
new_col_names = ['a','b','c','d','e'] # but we can always be sure that the len(new_col_names) >= len(df.columns)
Given that, how could I replace the column names in df such that it results something like below:
df = pd.DataFrame({0:['Hello World!']})
new_col_names = ['a','b','c','d']
# result would be like this
a b c d
Hello World! (empty string) (empty string) (empty string)
df = pd.DataFrame({0:['Hello World!'], 1:['Hello Mars!']})
new_col_names = ['a','b','c','d']
# result would be like this
a b c d
Hello World! Hello Mars! (empty string) (empty string)
df = pd.DataFrame({0:['Hello World!', 'Hello Mars!']})
new_col_names = ['a','b','c','d','e']
a b c d e
Hello World! (empty string) (empty string) (empty string) (empty string)
Hellow Mars! (empty string) (empty string) (empty string) (empty string)
From reading around StackOverflow answers such as this, I have a vague idea that it could be something like below:
df[new_col_names] = '' # but this returns KeyError
# or this
df.columns=new_col_names # but this returns ValueError: Length mismatch (of course)
If someone could show me, a way to overwrite existing dataframe column name and at the same time add new data columns with empty string values in the rows, I'd greatly appreciate the help.
Idea is create dictionary by existing columns names by zip, rename only existing columns and then add all new one by DataFrame.reindex:
df = pd.DataFrame({0:['Hello World!', 'Hello Mars!']})
new_col_names = ['a','b','c','d','e']
df1 = (df.rename(columns=dict(zip(df.columns, new_col_names)))
.reindex(new_col_names, axis=1, fill_value=''))
print (df1)
a b c d e
0 Hello World!
1 Hello Mars!
df1 = (df.rename(columns=dict(zip(df.columns, new_col_names)))
.reindex(new_col_names, axis=1))
print (df1)
a b c d e
0 Hello World! NaN NaN NaN NaN
1 Hello Mars! NaN NaN NaN NaN
Here is a function that will do what you want
I couldn't find a 1-liner, but jezrael did: his answer
import pandas as pd
# function
def rename_add_col(df: pd.DataFrame, cols: list) -> pd.DataFrame:
c_len = len(df.columns)
if c_len == len(cols):
df.columns = cols
else:
df.columns = cols[:c_len]
df = pd.concat([df, pd.DataFrame(columns=cols[c_len:])])
return df
# create dataframe
t1 = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', '5', '6'], 'c': ['7', '8', '9']})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
# call function
cols = ['d', 'e', 'f']
t1 = rename_add_col(t1, cols)
d e f
0 1 4 7
1 2 5 8
2 3 6 9
# call function
cols = ['g', 'h', 'i', 'new1', 'new2']
t1 = rename_add_col(t1, cols)
g h i new1 new2
0 1 4 7 NaN NaN
1 2 5 8 NaN NaN
2 3 6 9 NaN NaN
This might help you do it all at once
Use your old Dataframe to recreate another dataframe with the pd.DataFrame() method and then add new columns in the columns paramater by list addition.
Note : This would add new columns as per index length, but with NaN values, workaround for which would be doing a df.fillna(' ')
pd.DataFrame(df.to_dict() , columns = list(df.columns)+['b','c'])
Hope this Helps! Cheers !
I am reading a data from the csv file like :
import pandas as pd
data_1=pd.read_csv("sample.csv")
data_1.head(10)
It has two columns :
ID detail
1 [{'a': 1, 'b': 1.85, 'c': 'aaaa', 'd': 6}, {'a': 2, 'b': 3.89, 'c': 'bbbb', 'd': 10}]
the detail column is not a json but it is a dict and I want to flatten the dict and want the result something like this :
ID a b c d
1 1 1.85 aaaa 6
1 2 3.89 bbbb 10
I always get a,b,c,d in the detail column and want to move the final results to a sql table.
Can someone please help me as how to solve it.
Use dictionary comprehension with ast.literal for convert strings repr to list of dicts and convert it to DataFrame, then use concat and convert first level of MultiIndex to ID column:
import ast
d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].to_numpy()}
#for oldier pandas version use .values
#d = {i: pd.DataFrame(ast.literal_eval(d)) for i, d in df[['ID','detail']].values)}
df = pd.concat(d).reset_index(level=1, drop=True).rename_axis('ID').reset_index()
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
Or use lsit comprehension with DataFrame.assign for ID column, only necessary change order of columns - last column to first:
import ast
L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].to_numpy()]
#for oldier pandas versions use .values
#L = [pd.DataFrame(ast.literal_eval(d)).assign(ID=i) for i, d in df[['ID','detail']].values]
df = pd.concat(L, ignore_index=True)
df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]
print (df)
ID a b c d
0 1 1 1.85 aaaa 6
1 1 2 3.89 bbbb 10
EDIT:
For 2 IDs change second solution:
d = [pd.DataFrame(ast.literal_eval(d)).assign(ID1=i1, ID2=i2) for i1, i2, d in df[['ID1','ID2','detail']].to_numpy()]
df = pd.concat(d)
df = df[df.columns[-2:].tolist() + df.columns[:-2].tolist()]
Imagine a pandas data frame given by
df = pd.DataFrame({
'id': range(5),
'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})
which yields
id desc mfr
0 0 This is text ABC
1 1 John Doe ABC DEF
2 2 John Doe DEF
3 3 Something JKL GHI
4 4 Something more JKL
I wish to determine which id's belong to eachother. Either they are matched by mfrcolumn or if mfrvalue are contained in desccolumn. E.g. id = 1 and 2 are the same group because mfr are equal but id = 0 and 1 are also the same group since ABC from mfr in id = 0 are part of desc in id = 1.
The resulting data frame should be
id desc mfr group
0 0 This is text ABC 0
1 1 John Doe ABC DEF 0
2 2 John Doe DEF 0
3 3 Something JKL GHI 1
4 4 Something more JKL 1
Are there anyone out there with a good solution for this? I imagine that there are no really simple ones so any is welcome.
I'm assuming 'desc' does not contain multiple 'mfr' values
Solution1:
import numpy as np
import pandas as pd
# original dataframe
df = pd.DataFrame({
'id': range(5),
'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})
# for final merge
ori = df.copy()
# max words used in 'desc'
max_len = max(df.desc.apply(lambda x: len(x.split(' '))))
# unique 'mfr' values
uniq_mfr = df.mfr.unique().tolist()
# if list is less than max len, then pad with nan
def padding(lst, mx):
for i in range(mx):
if len(lst) < mx:
lst.append(np.nan)
return lst
df['desc'] = df.desc.apply(lambda x: x.split(' ')).apply(padding, args=(max_len,))
# each word makes 1 column
for i in range(max_len):
newcol = 'desc{}'.format(i)
df[newcol] = df.desc.apply(lambda x: x[i])
df.loc[~df[newcol].isin(uniq_mfr), newcol] = np.nan
# merge created columns into 1 by taking 'mfr' values only
df['desc'] = df[df.columns[3:]].fillna('').sum(axis=1).replace('', np.nan)
# create [ABC, ABC] type of column by merging two columns (desc & mfr)
df = df[df.columns[:3]]
df.desc.fillna(df.mfr, inplace=True)
df.desc = [[x, y] for x, y in zip(df.desc.tolist(), df.mfr.tolist())]
df = df[['id', 'desc']]
df = df.sort_values('desc').reset_index(drop=True)
# BELOW IS COMMON WITH SOLUTION2
# from here I borrowed the solution by #mimomu from below URL (slightly modified)
# try to get merged tuple based on the common elements
# https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
import itertools
L = df.desc.tolist()
LL = set(itertools.chain.from_iterable(L))
for each in LL:
components = [x for x in L if each in x]
for i in components:
L.remove(i)
L += [tuple(set(itertools.chain.from_iterable(components)))]
# allocate merged tuple to 'desc'
df['desc'] = sorted(L)
# grouping by 'desc' value (tuple can be key list cannot be fyi...)
df['group'] = df.groupby('desc').grouper.group_info[0]
# merge with the original
df = df.drop('desc', axis=1).merge(ori, on='id', how='left')
df = df[['id', 'desc', 'mfr', 'group']]
Solution2 (2nd half is common with Solution1):
import numpy as np
import pandas as pd
# original dataframe
df = pd.DataFrame({
'id': range(5),
'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})
# for final merge
ori = df.copy()
# unique 'mfr' values
uniq_mfr = df.mfr.unique().tolist()
# make desc entries as lists
df['desc'] = df.desc.apply(lambda x: x.split(' '))
# pick up mfr values in desc column otherwise nan
mfr_in_descs = []
for ds, ms in zip(df.desc, df.mfr):
for i, d in enumerate(ds):
if d in uniq_mfr:
mfr_in_descs.append(d)
continue
if i == (len(ds) - 1):
mfr_in_descs.append(np.nan)
# create column whose element is like [ABC, ABC]
df['desc'] = mfr_in_descs
df['desc'].fillna(df.mfr, inplace=True)
df['desc'] = [[x, y] for x, y in zip(df.desc.tolist(), df.mfr.tolist())]
df = df[['id', 'desc']]
df = df.sort_values('desc').reset_index(drop=True)
# BELOW IS COMMON WITH SOLUTION1
# from here I borrowed the solution by #mimomu from below URL (slightly modified)
# try to get merged tuple based on the common elements
# https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
import itertools
L = df.desc.tolist()
LL = set(itertools.chain.from_iterable(L))
for each in LL:
components = [x for x in L if each in x]
for i in components:
L.remove(i)
L += [tuple(set(itertools.chain.from_iterable(components)))]
# allocate merged tuple to 'desc'
df['desc'] = sorted(L)
# grouping by 'desc' value (tuple can be key list cannot be fyi...)
df['group'] = df.groupby('desc').grouper.group_info[0]
# merge with the original
df = df.drop('desc', axis=1).merge(ori, on='id', how='left')
df = df[['id', 'desc', 'mfr', 'group']]
From 2 solutions above, I get the same results df:
id desc mfr group
0 0 This is text ABC 0
1 1 John Doe ABC DEF 0
2 2 John Doe DEF 0
3 3 Something JKL GHI 1
4 4 Something more JKL 1
I have a column containing values. I want to split it based on a regex. If the regex matches, the original value will be replaced with the left-side of the split. A new column will contain the right-side of a split.
Below is some sample code. I feel I am close but it isn't quite working.
import pandas as pd
import re
df = pd.DataFrame({ 'A' : ["test123","foo"]})
// Regex example to split it if it ends in numbers
r = r"^(.+?)(\d*)$"
df['A'], df['B'] = zip(*df['A'].apply(lambda x: x.split(r, 1)))
print(df)
In the example above I would expect the following output
A B
0 test 123
1 foo
I am fairly new to Python and assumed this would be the way to go. However, it appears that I haven't quite hit the mark. Is anyone able to help me correct this example?
Just base on your own regex
df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
Out[158]:
1 2
0 test 123
1 foo
df[['A','B']]=df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
df
Out[160]:
A B
0 test 123
1 foo
Your regex is working just fine, use it with str.extract
df = pd.DataFrame({ 'A' : ["test123","foo", "12test3"]})
df[['A', 'B']] = df['A'].str.extract("^(.+?)(\d*)$", expand = True)
A B
0 test 123
1 foo
2 12test 3
def bar(x):
els = re.findall(r'^(.+?)(\d*)$', x)[0]
if len(els):
return els
else:
return x, None
def foo():
df = pd.DataFrame({'A': ["test123", "foo"]})
df['A'], df['B'] = zip(*df['A'].apply(bar))
print(df)
result:
A B
0 test 123
1 foo