Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
Dialog_act is my label
I need to assign int values, like (inform_pricerange=1, inform_area=2, request_food=3, inform_food=4...)
The goal is to look like this:
1,2
1,2,3
4
5,2,4
6
CSV (5 rows):
"transcript_id who transcript dialog_act
0 USR I need to find an expensive restauant that's in the south section of the city. inform_pricerange; inform_area;
1 SYS There are several restaurants in the south part of town that serve expensive food. Do you have a cuisine preference? inform_pricerange; inform_area; request_food;
2 USR No I don't care about the type of cuisine. inform_food;
3 SYS Chiquito Restaurant Bar is a Mexican restaurant located in the south part of town. inform_name; inform_area; inform_food;
4 USR What is their address? request_address;
5 SYS There address is 2G Cambridge Leisure Park Cherry Hinton Road Cherry Hinton, it there anything else I can help you with? inform_address;"
How can i do that?
Thanks in advice
Update
I want to define the values, not in ascending order
vals_to_replace = {'inform_pricerange': 1, 'inform_area': 2, 'request_food': 3,
'inform_food': 4, 'inform_name': 5, 'request_address': 6,
'inform_address': 7}
df['dialog_act'] = df['dialog_act'].str.strip(';').str.split('; ').explode() \
.map(vals_to_replace).astype(str) \
.groupby(level=0).apply(', '.join)
print(df)
# Output
dialog_act
0 1, 2
1 1, 2, 3
2 4
3 5, 2, 4
4 6
5 7
Old answer
Try to explode your column into a list of scalar values and use pd.factorize
# Step 1: explode
df1 = df['dialog_act'].str.strip(';').str.split('; ').explode().to_frame()
# Step 2: factorize
df['dialog_act'] = df1.assign(dialog_act=pd.factorize(df1['dialog_act'])[0] + 1) \
.astype(str).groupby(level=0)['dialog_act'].apply(', '.join)
Output:
>>> df
dialog_act
0 1, 2
1 1, 2, 3
2 4
3 5, 2, 4
4 6
5 7
>>> df1
dialog_act
0 inform_pricerange
0 inform_area
1 inform_pricerange
1 inform_area
1 request_food
2 inform_food
3 inform_name
3 inform_area
3 inform_food
4 request_address
5 inform_address
I think you can use Pandas dataframe.replace() method. Firstly, convert your table to Pandas Dataframe. Then ,
vals_to_replace = {'inform_pricerange':1, 'inform_area':2, 'request_food':3, 'inform_food': 4}
your_df = your_df.replace({'your_label':vals_to_replace})
I saw a similar question about pandas replace multiple values one column .
Related
I am developing a clinical bioinformatic application and the input this application gets is a data frame that looks like this
df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2],
'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)
store quarter employee foo columnX
0 Blank_A09 1 Blank_A09 1 Blank_A09
1 Control_4p 1 Control_4p 1 Control_4p
2 13_MEG3 2 13_MEG3 2 13_MEG3
3 04_GRB10 2 04_GRB10 2 04_GRB10
4 02_PLAGL1 1 02_PLAGL1 1 02_PLAGL1
5 Control_21q 1 Control_21q 1 Control_21q
6 01_PLAGL1 2 01_PLAGL1 9 01_PLAGL1
7 11_KCNQ10T1 2 11_KCNQ10T1 2 11_KCNQ10T1
8 16_SNRPN 2 16_SNRPN 2 16_SNRPN
9 09_H19 2 09_H19 4 09_H19
10 Control_6p 2 Control_6p 2 Control_6p
11 06_MEST 2 06_MEST 2 06_MEST
This is a minimal reproducible example, but the real one has an uncertain number of columns in which the first, the third the 5th, the 7th, etc. "should" be exactly the same.
And this is what I want to check. I want to ensure that these columns have their values in the same order.
I know how to check if 2 columns are exactly the same but I don't know how to expand this checking across all data frame.
EDIT:
The name of the columns change, in my example, they are just two examples.
Refer here How to check if 3 columns are same and add a new column with the value if the values are same?
Here is a code that would check if more columns are the same and returns the index of rows which are the same
arr = df[['quarter','foo_test','foo']].values #You can add as many columns as you wish
np.where((arr == arr[:, [0]]).all(axis=1))
You need to tweak it for your usage
Edit
columns_to_check = [x for x in range(1, len(df.columns), 2)]
arr = df.iloc[:, columns_to_check].values
If you want an efficient method you can hash the Series using pandas.util.hash_pandas_object, making the operation O(n):
pd.util.hash_pandas_object(df.T, index=False)
We clearly see that store/employee/columnX have the same hash:
store 18266754969677227875
quarter 11367719614658692759
employee 18266754969677227875
foo 92544834319824418
columnX 18266754969677227875
dtype: uint64
You can further use groupby to identify the identical values:
df.columns.groupby(pd.util.hash_pandas_object(df.T, index=False))
output:
{ 92544834319824418: ['foo'],
11367719614658692759: ['quarter'],
18266754969677227875: ['store', 'employee', 'columnX']}
I am trying to link patient's ID with patient images. Once patient could have more than one image attached to them. I have added a new column, image_ID in my dataframe that already has patient_ID.
So the code I've written below, only adds the last image_ID of a patient. How can I duplicate and add rows knowing their indices (the index that corresponds to the patient ID) so that I can duplicate all other information of the same patient for all of its images?
Since my shuffled_balanced data frame initially doesn't have the image_name column, I have created it and have set it to None. Please note if row['patient_ID'] in sample is due to the fact that patient_ID is part of image_ID.
I am also open to other ways of approaching this.
shuffled_balanced['image_ID'] = 'None'
for dirpath, dirname, filename in os.walk('/SeaExpNFS/images'):
if dirpath.endswith('20.0'):
splits = dirpath.split('/')
sample = splits[-2][:-6]
for index, row in shuffled_balanced.iterrows():
if row['patient_ID'] in sample:
shuffled_balanced.at[index,'image_ID']=sample
I think you're looking for merge. Say you have two dataframes that look something like this:
import pandas as pd
patient_df = pd.DataFrame({"patient_id": [1, 2, 3, 4, 5],
"patient_name": ["Penny",
"Leonard",
"Amy",
"Sheldon",
"Rajesh"]})
img_df = pd.DataFrame({"patient_id": [2, 3, 4, 4, 1],
"img_file": ["leonard.jpg",
"amy.jpg",
"sheldon.jpg",
"sheldon2.jpg",
"penny.jpg"]})
>>> patient_df
patient_id patient_name
0 1 Penny
1 2 Leonard
2 3 Amy
3 4 Sheldon
4 5 Rajesh
>>> img_df
patient_id img_file
0 2 leonard.jpg
1 3 amy.jpg
2 4 sheldon.jpg
3 4 sheldon2.jpg
4 1 penny.jpg
You can merge them like so:
>>> patient_df.merge(img_df, on="patient_id", how="outer")
patient_id patient_name img_file
0 1 Penny penny.jpg
1 2 Leonard leonard.jpg
2 3 Amy amy.jpg
3 4 Sheldon sheldon.jpg
4 4 Sheldon sheldon2.jpg
5 5 Rajesh NaN
I have a problem similar to this question but an opposite challenge. Instead of having a removal list, I have a keep list - a list of strings I'd like to keep. My question is how to use a keep list to filter out the unwanted strings and retain the wanted ones in the column.
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4, 5],
"name": [
"Mitty, Kitty",
"Kandy, Puppy",
"Judy, Micky, Loudy",
"Cindy, Judy",
"Kitty, Wicky",
],
}
)
ID name
0 1 Mitty, Kitty
1 2 Kandy, Puppy
2 3 Judy, Micky, Loudy
3 4 Cindy, Judy
4 5 Kitty, Wicky
To_keep_lst = ["Kitty", "Kandy", "Micky", "Loudy", "Wicky"]
Use Series.str.findall with Series.str.join:
To_keep_lst = ["Kitty", "Kandy", "Micky", "Loudy", "Wicky"]
df['name'] = df['name'].str.findall('|'.join(To_keep_lst)).str.join(', ')
print (df)
ID name
0 1 Kitty
1 2 Kandy
2 3 Micky, Loudy
3 4
4 5 Kitty, Wicky
Use a comprehension to filter out names you want to keep:
keep_names = lambda x: ', '.join([n for n in x.split(', ') if n in To_keep_lst])
df['name'] = df['name'].apply(keep_names)
print(df)
# Output:
ID name
0 1 Kitty
1 2 Kandy
2 3 Micky, Loudy
3 4
4 5 Kitty, Wicky
Note: the answer of #jezrael is much faster than mine.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to clean data in a dataframe in python, where I am to concatenate Rows in which data in two columns(name, phone_no) are similar i.e.
I have
What I have
Trying to get
Expected Result
P.S It would be much better if you could provide a sample of the dataset instead of the images. Next time you can use df.to_clipboard and paste it as a code snippet in the question for reproducibility.
Now to the answer. You can use pandas groupby and then a custom aggregation.
First I created a dataset for the example:
df = pd.DataFrame({"A": ["a", "b", "a", "b", "c"], "B": list(map(str, range(5))), "C": list(map(str, range(5, 10)))})
Looks as follows
A B C
0 a 0 5
1 b 1 6
2 a 2 7
3 b 3 8
4 c 4 9
Then you can contact rows with similar keys (in your case the keys are name and phone_no
gdf = df.groupby("A").agg({
"B": ",".join,
"C": ",".join
})
print(gdf)
And the results are as follows:
A B C
0 a 0,2 5,7
1 b 1,3 6,8
2 c 4 9
This question already has answers here:
Pandas DENSE RANK
(4 answers)
Closed 4 years ago.
I am trying to generate a unique index column in my dataset.
I have a column in my dataset as follows:
665678, 665678, 665678, 665682, 665682, 665682, 665690, 665690
And I would like to generate a separately indexed column looking like this:
1, 1, 1, 2, 2, 2, 3, 3
I came across the post How to index columns uniquely?? that describes exactly what I am trying to do. But since the solutions are described for R, I wanted to know how can I implement the same in Python using Pandas.
Thanks
Use -
df.groupby('col').ngroup()+1
Output
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
dtype: int64