Regex replace within list inside a dataframe in python

Regex replace within list inside a dataframe in python - python

Hello I have dataframe such as :
COL1 COL2 COL3
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]]
how can I use re.sub(r'.*__', '') within the COL3 sulist ?
and get a new column without evrything before '__':
COL1 COL2 COL3 COL4
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]] [[(HELLO,OBY),(HOLLA,BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]] [(BO,GET)]]
here is the data :
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)

Updated data solution
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)
df['COL4'] = df['COL3 '].str.replace(r"([,(])[^(),]*__", r"\1")
df['COL4']
# => 0 [[(HELLO,BY),(HOLLA,BY)]]
# 1 [[(BO,GET)]]
# Name: COL4, dtype: object
See the regex demo.
Old data solution
You can use ast.literal_eval to turn the strings in the COL3 column into lists of lists and iterate over them while modifying the tuple items:
import ast
import pandas as pd
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[('OK2_+__HELLO','OJ_+__BY'),('LO_-__HOLLA','KUOJ_+__BY')]]", 1: "[[('JU3_+__BO','UJ3_-__GET')]]"}}
df = pd.DataFrame.from_dict(data)
def repl(m):
result = []
for l in ast.literal_eval(m):
ll = []
for x, y in l:
ll.append(tuple([re.sub(r'.*__', '', x), re.sub(r'.*__', '', y)]))
result.append(ll)
return str(result)
df['COL4'] = df['COL3 '].apply(repl)
df['COL4']
# => 0 [[('HELLO', 'BY'), ('HOLLA', 'BY')]]
# 1 [[('BO', 'GET')]]
You do not need to use str(result) if you are OK to keep the result as a list of lists.

Related

How to show data with different number of columns Pandas?

I load CSV document with different number of columns. Therefore I got this error:
Expected 12 fields in line 29, saw 13
To avoid this error I use the hack names=range(24)
df = pd.read_csv(filename, header=None, quoting=csv.QUOTE_NONE, dtype='object', sep=data_file_delimiter, engine='python', encoding = "utf-8", names=range(24))
Problem is I need to know the real number of columns to group this data further into dict data:
data = {}
for row in df.rows:
line = line.strip()
row = line.split(' ')
if len(row) not in data:
data[ len(row) ] = []
data[ len(row) ].append(row)

You can have the number of columns using len(df.columns) but if you only want to convert a pandas df to a dictionary then there are already many built-in methods as given below,
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]},index=['row1', 'row2'])
df
col1 col2
row1 1 0.50
row2 2 0.75
df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
# You can specify the return orientation.
df.to_dict('series')
{'col1': row1 1
row2 2
Name: col1, dtype: int64,
'col2': row1 0.50
row2 0.75
Name: col2, dtype: float64}
df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]]}
df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}
# You can also specify the mapping type.
from collections import OrderedDict, defaultdict
df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
Taken from here

Finding unique values of one column for specific column values

given the following df:
data = {'identifier': {0: 'a',
1: 'a',
3: 'b',
4: 'b',
5: 'c'},
'gt_50': {0: 1, 1: 1, 3: 0, 4: 0, 5: 0},
'gt_10': {0: 1, 1: 1, 3: 1, 4: 1, 5: 1}}
df = pd.DataFrame(data)
i want to find the nuniques of the column "identifier" for each column that starts with "gt_" and where the value is one.
Expected output:
- gt_50 1
- gt_10 3
I could make a for loop and filter the frame in each loop on one gt column and then count the uniques but I think it's not very clean.
Is there a way to do this in a clean way?

Use DataFrame.melt with filter _gt columns for unpivot, then get rows with 1 in DataFrame.query and last count unique values by DataFrameGroupBy.nunique:
out = (df.melt('identifier', value_vars=df.filter(regex='^gt_').columns)
.query('value == 1')
.groupby('variable')['identifier']
.nunique())
print (out)
variable
gt_10 3
gt_50 1
Name: identifier, dtype: int64
Or:
s = df.set_index('identifier').filter(regex='^gt_').stack()
out = s[s.eq(1)].reset_index().groupby('level_1')['identifier'].nunique()
print (df)
level_1
gt_10 3
gt_50 1
Name: identifier, dtype: int64

Convert Pandas DataFrame to dictionairy

I have a simple DataFrame:
Name Format
0 cntry int
1 dweight str
2 pspwght str
3 pweight str
4 nwspol str
I want a dictionairy as such:
{
"cntry":"int",
"dweight":"str",
"pspwght":"str",
"pweight":"str",
"nwspol":"str"
}
Where dict["cntry"] would return int or dict["dweight"] would return str.
How could I do this?

How about this:
import pandas as pd
df = pd.DataFrame({'col_1': ['A', 'B', 'C', 'D'], 'col_2': [1, 1, 2, 3], 'col_3': ['Bla', 'Foo', 'Sup', 'Asdf']})
res_dict = dict(zip(df['col_1'], df['col_3']))
Contents of res_dict:
{'A': 'Bla', 'B': 'Foo', 'C': 'Sup', 'D': 'Asdf'}

You're looking for DataFrame.to_dict()
From the documentation:
>>> df = pd.DataFrame({'col1': [1, 2],
... 'col2': [0.5, 0.75]},
... index=['row1', 'row2'])
>>> df
col1 col2
row1 1 0.50
row2 2 0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can always invert an internal dictionary if it's not mapped how you'd like it to be:
inv_dict = {v: k for k, v in original_dict['Name'].items()}

I think you want is:
df.set_index('Name').to_dict()['Format']
Since you want to use the values in the Name column as the keys to your dict.
Note that you might want to do:
df.set_index('Name').astype(str).to_dict()['Format']
if you want the values of the dictionary to be strings.

How do you pass a dictionary through the pandas replace function? [duplicate]

This question already has answers here:
Pandas - replacing column values
(5 answers)
Closed 4 years ago.
How would you use a dictionary to replace values throughout all the columns in a dataframe?
Below is an example, where I attempted to pass the dictionary through the replace function. I have a script that produces various sized dataframes based on a company's manager/employee structure - so the number of columns varies per dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace({di})
print(df)
There is a similar question linked below in which the solution specifies column names, but given that I'm looking at a whole dataframe that would vary in column names/size, I'd like to apply the replace function to the whole dataframe.
Remap values in pandas column with a dict
Thanks!

Don't put {} around your dictionary - that tries to turn it into a set, which throws an error since a dictionary can't be an element in a set. Instead pass in the dictionary directly:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace(di)
print(df)
Output:
col1 col2 col3 col4
0 w a w w
1 1 2 1 1
2 2 NaN 2 2
col1 col2 col3 col4
0 w a w w
1 A B A A
2 B NaN B B

Replace values in one column, depending on values in another column

import pandas as pd
dict = {
'1': 'Alb',
'2': 'Bnk',
'3': 'Cd'
}
df = pd.DataFrame(
{
'col1': {
0: 20,
1: 2,
2: 10,
3: 2,
4: 44
},
'col2': {
0:'a',
1:'b',
2:'c',
3:'b',
4:20
}
}
)
I want to replace col1 value 2 with 'Bnk' if col2 value == 'b'
How can this be done?
Thanks

There are several ways to do this but for clarity you can use apply:
import pandas as pd
dict = {
1: 'Alb',
2: 'Bnk',
3: 'Cd'
}
df = pd.DataFrame(
{
'col1': {
0: 20,
1: 2,
2: 10,
3: 2,
4: 44
},
'col2': {
0:'a',
1:'b',
2:'c',
3:'b',
4:20
}
}
)
def change(data, col2val, dict):
if data['col2'] == col2val:
data['col1'] = dict[data['col1']]
return data
new_df = df.apply(change, axis = 1, col2val = 'b', dict = dict)
print(new_df)
I also modified the dict to have integer keys for simplicity.
Output:
col1 col2
0 20 a
1 Bnk b
2 10 c
3 Bnk b
4 44 20

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex replace within list inside a dataframe in python - python

Related

How to show data with different number of columns Pandas?

Finding unique values of one column for specific column values

Convert Pandas DataFrame to dictionairy

How do you pass a dictionary through the pandas replace function? [duplicate]

Replace values in one column, depending on values in another column

Categories

Resources