Hello I have dataframe such as :
COL1 COL2 COL3
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]]
how can I use re.sub(r'.*__', '') within the COL3 sulist ?
and get a new column without evrything before '__':
COL1 COL2 COL3 COL4
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]] [[(HELLO,OBY),(HOLLA,BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]] [(BO,GET)]]
here is the data :
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)
Updated data solution
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)
df['COL4'] = df['COL3 '].str.replace(r"([,(])[^(),]*__", r"\1")
df['COL4']
# => 0 [[(HELLO,BY),(HOLLA,BY)]]
# 1 [[(BO,GET)]]
# Name: COL4, dtype: object
See the regex demo.
Old data solution
You can use ast.literal_eval to turn the strings in the COL3 column into lists of lists and iterate over them while modifying the tuple items:
import ast
import pandas as pd
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[('OK2_+__HELLO','OJ_+__BY'),('LO_-__HOLLA','KUOJ_+__BY')]]", 1: "[[('JU3_+__BO','UJ3_-__GET')]]"}}
df = pd.DataFrame.from_dict(data)
def repl(m):
result = []
for l in ast.literal_eval(m):
ll = []
for x, y in l:
ll.append(tuple([re.sub(r'.*__', '', x), re.sub(r'.*__', '', y)]))
result.append(ll)
return str(result)
df['COL4'] = df['COL3 '].apply(repl)
df['COL4']
# => 0 [[('HELLO', 'BY'), ('HOLLA', 'BY')]]
# 1 [[('BO', 'GET')]]
You do not need to use str(result) if you are OK to keep the result as a list of lists.
Related
I load CSV document with different number of columns. Therefore I got this error:
Expected 12 fields in line 29, saw 13
To avoid this error I use the hack names=range(24)
df = pd.read_csv(filename, header=None, quoting=csv.QUOTE_NONE, dtype='object', sep=data_file_delimiter, engine='python', encoding = "utf-8", names=range(24))
Problem is I need to know the real number of columns to group this data further into dict data:
data = {}
for row in df.rows:
line = line.strip()
row = line.split(' ')
if len(row) not in data:
data[ len(row) ] = []
data[ len(row) ].append(row)
You can have the number of columns using len(df.columns) but if you only want to convert a pandas df to a dictionary then there are already many built-in methods as given below,
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]},index=['row1', 'row2'])
df
col1 col2
row1 1 0.50
row2 2 0.75
df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
# You can specify the return orientation.
df.to_dict('series')
{'col1': row1 1
row2 2
Name: col1, dtype: int64,
'col2': row1 0.50
row2 0.75
Name: col2, dtype: float64}
df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]]}
df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}
# You can also specify the mapping type.
from collections import OrderedDict, defaultdict
df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
Taken from here
given the following df:
data = {'identifier': {0: 'a',
1: 'a',
3: 'b',
4: 'b',
5: 'c'},
'gt_50': {0: 1, 1: 1, 3: 0, 4: 0, 5: 0},
'gt_10': {0: 1, 1: 1, 3: 1, 4: 1, 5: 1}}
df = pd.DataFrame(data)
i want to find the nuniques of the column "identifier" for each column that starts with "gt_" and where the value is one.
Expected output:
- gt_50 1
- gt_10 3
I could make a for loop and filter the frame in each loop on one gt column and then count the uniques but I think it's not very clean.
Is there a way to do this in a clean way?
Use DataFrame.melt with filter _gt columns for unpivot, then get rows with 1 in DataFrame.query and last count unique values by DataFrameGroupBy.nunique:
out = (df.melt('identifier', value_vars=df.filter(regex='^gt_').columns)
.query('value == 1')
.groupby('variable')['identifier']
.nunique())
print (out)
variable
gt_10 3
gt_50 1
Name: identifier, dtype: int64
Or:
s = df.set_index('identifier').filter(regex='^gt_').stack()
out = s[s.eq(1)].reset_index().groupby('level_1')['identifier'].nunique()
print (df)
level_1
gt_10 3
gt_50 1
Name: identifier, dtype: int64
I have a simple DataFrame:
Name Format
0 cntry int
1 dweight str
2 pspwght str
3 pweight str
4 nwspol str
I want a dictionairy as such:
{
"cntry":"int",
"dweight":"str",
"pspwght":"str",
"pweight":"str",
"nwspol":"str"
}
Where dict["cntry"] would return int or dict["dweight"] would return str.
How could I do this?
How about this:
import pandas as pd
df = pd.DataFrame({'col_1': ['A', 'B', 'C', 'D'], 'col_2': [1, 1, 2, 3], 'col_3': ['Bla', 'Foo', 'Sup', 'Asdf']})
res_dict = dict(zip(df['col_1'], df['col_3']))
Contents of res_dict:
{'A': 'Bla', 'B': 'Foo', 'C': 'Sup', 'D': 'Asdf'}
You're looking for DataFrame.to_dict()
From the documentation:
>>> df = pd.DataFrame({'col1': [1, 2],
... 'col2': [0.5, 0.75]},
... index=['row1', 'row2'])
>>> df
col1 col2
row1 1 0.50
row2 2 0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can always invert an internal dictionary if it's not mapped how you'd like it to be:
inv_dict = {v: k for k, v in original_dict['Name'].items()}
I think you want is:
df.set_index('Name').to_dict()['Format']
Since you want to use the values in the Name column as the keys to your dict.
Note that you might want to do:
df.set_index('Name').astype(str).to_dict()['Format']
if you want the values of the dictionary to be strings.
This question already has answers here:
Pandas - replacing column values
(5 answers)
Closed 4 years ago.
How would you use a dictionary to replace values throughout all the columns in a dataframe?
Below is an example, where I attempted to pass the dictionary through the replace function. I have a script that produces various sized dataframes based on a company's manager/employee structure - so the number of columns varies per dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace({di})
print(df)
There is a similar question linked below in which the solution specifies column names, but given that I'm looking at a whole dataframe that would vary in column names/size, I'd like to apply the replace function to the whole dataframe.
Remap values in pandas column with a dict
Thanks!
Don't put {} around your dictionary - that tries to turn it into a set, which throws an error since a dictionary can't be an element in a set. Instead pass in the dictionary directly:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace(di)
print(df)
Output:
col1 col2 col3 col4
0 w a w w
1 1 2 1 1
2 2 NaN 2 2
col1 col2 col3 col4
0 w a w w
1 A B A A
2 B NaN B B
import pandas as pd
dict = {
'1': 'Alb',
'2': 'Bnk',
'3': 'Cd'
}
df = pd.DataFrame(
{
'col1': {
0: 20,
1: 2,
2: 10,
3: 2,
4: 44
},
'col2': {
0:'a',
1:'b',
2:'c',
3:'b',
4:20
}
}
)
I want to replace col1 value 2 with 'Bnk' if col2 value == 'b'
How can this be done?
Thanks
There are several ways to do this but for clarity you can use apply:
import pandas as pd
dict = {
1: 'Alb',
2: 'Bnk',
3: 'Cd'
}
df = pd.DataFrame(
{
'col1': {
0: 20,
1: 2,
2: 10,
3: 2,
4: 44
},
'col2': {
0:'a',
1:'b',
2:'c',
3:'b',
4:20
}
}
)
def change(data, col2val, dict):
if data['col2'] == col2val:
data['col1'] = dict[data['col1']]
return data
new_df = df.apply(change, axis = 1, col2val = 'b', dict = dict)
print(new_df)
I also modified the dict to have integer keys for simplicity.
Output:
col1 col2
0 20 a
1 Bnk b
2 10 c
3 Bnk b
4 44 20