Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to clean data in a dataframe in python, where I am to concatenate Rows in which data in two columns(name, phone_no) are similar i.e.
I have
What I have
Trying to get
Expected Result
P.S It would be much better if you could provide a sample of the dataset instead of the images. Next time you can use df.to_clipboard and paste it as a code snippet in the question for reproducibility.
Now to the answer. You can use pandas groupby and then a custom aggregation.
First I created a dataset for the example:
df = pd.DataFrame({"A": ["a", "b", "a", "b", "c"], "B": list(map(str, range(5))), "C": list(map(str, range(5, 10)))})
Looks as follows
A B C
0 a 0 5
1 b 1 6
2 a 2 7
3 b 3 8
4 c 4 9
Then you can contact rows with similar keys (in your case the keys are name and phone_no
gdf = df.groupby("A").agg({
"B": ",".join,
"C": ",".join
})
print(gdf)
And the results are as follows:
A B C
0 a 0,2 5,7
1 b 1,3 6,8
2 c 4 9
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 17 days ago.
Improve this question
I've created my own prediction function (R=C^3+Z^2+3) to predict my target variable. The problem is now I am dealing with a prediction function not an algorithm; therefore .predict from scikit-learn won't work. But then how can I get my predictions?
def objective(C, Z)
return C**3 + Z**2 + 3
here is what you want in pandas.
import pandas as pd
def objective(C, Z):
return C**3 + Z**2 + 3
data = {'C': [1,2,3], 'Z': [4,5,6]}
df = pd.DataFrame(data)
df['R'] = df.apply(lambda x: objective(x.C, x.Z), axis=1)
print(df)
C Z R
0 1 4 20
1 2 5 36
2 3 6 66
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
Dialog_act is my label
I need to assign int values, like (inform_pricerange=1, inform_area=2, request_food=3, inform_food=4...)
The goal is to look like this:
1,2
1,2,3
4
5,2,4
6
CSV (5 rows):
"transcript_id who transcript dialog_act
0 USR I need to find an expensive restauant that's in the south section of the city. inform_pricerange; inform_area;
1 SYS There are several restaurants in the south part of town that serve expensive food. Do you have a cuisine preference? inform_pricerange; inform_area; request_food;
2 USR No I don't care about the type of cuisine. inform_food;
3 SYS Chiquito Restaurant Bar is a Mexican restaurant located in the south part of town. inform_name; inform_area; inform_food;
4 USR What is their address? request_address;
5 SYS There address is 2G Cambridge Leisure Park Cherry Hinton Road Cherry Hinton, it there anything else I can help you with? inform_address;"
How can i do that?
Thanks in advice
Update
I want to define the values, not in ascending order
vals_to_replace = {'inform_pricerange': 1, 'inform_area': 2, 'request_food': 3,
'inform_food': 4, 'inform_name': 5, 'request_address': 6,
'inform_address': 7}
df['dialog_act'] = df['dialog_act'].str.strip(';').str.split('; ').explode() \
.map(vals_to_replace).astype(str) \
.groupby(level=0).apply(', '.join)
print(df)
# Output
dialog_act
0 1, 2
1 1, 2, 3
2 4
3 5, 2, 4
4 6
5 7
Old answer
Try to explode your column into a list of scalar values and use pd.factorize
# Step 1: explode
df1 = df['dialog_act'].str.strip(';').str.split('; ').explode().to_frame()
# Step 2: factorize
df['dialog_act'] = df1.assign(dialog_act=pd.factorize(df1['dialog_act'])[0] + 1) \
.astype(str).groupby(level=0)['dialog_act'].apply(', '.join)
Output:
>>> df
dialog_act
0 1, 2
1 1, 2, 3
2 4
3 5, 2, 4
4 6
5 7
>>> df1
dialog_act
0 inform_pricerange
0 inform_area
1 inform_pricerange
1 inform_area
1 request_food
2 inform_food
3 inform_name
3 inform_area
3 inform_food
4 request_address
5 inform_address
I think you can use Pandas dataframe.replace() method. Firstly, convert your table to Pandas Dataframe. Then ,
vals_to_replace = {'inform_pricerange':1, 'inform_area':2, 'request_food':3, 'inform_food': 4}
your_df = your_df.replace({'your_label':vals_to_replace})
I saw a similar question about pandas replace multiple values one column .
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I currently have a csv dataset that has a column that is supposed to be dates. However, the formatting in csv was in 5 digit numbers and I need to convert these to date (ddmmyyyy) formatting in python.
For example, I need to convert 22580 in csv to 27/10/2021. I would appreciate if someone could offer me a solution that will cater to a dataframe.
Thanks in advance!
Seems to be days since Jan 1 1960, so you can just create a timedelta with that many days and add to the base date;
>>> import pandas as pd
>>> from datetime import date, timedelta
>>> df = pd.DataFrame({"a":[1,2,3], 'b':[22580, 22587, 22590]})
>>> df
a b
0 1 22580
1 2 22587
2 3 22590
>>> df['c'] = df['b'].apply(lambda x:date(1960,1,1)+timedelta(days=x))
>>> df
a b c
0 1 22580 2021-10-27
1 2 22587 2021-11-03
2 3 22590 2021-11-06
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Hi Guys i want to change index of my dataframe based on result of str.contains.
a quick example
import pandas as pd
df= pd.DataFrame({'Product':['Banana_1','Banana_2','Orange_a','Orange_b'],
'Value':[5.10,5.00,2.10,2.00]})
df2 = df[df['Product'].str.contains('Banana')]
print(df2)
is there a way to use df2 filter to change df1 index?
Thanks
You can control the index based on cell values like the following, which is (I think) along the lines of what you want:
In [28]: df.index = [i if 'Banana' in df.iloc[i,0] else i+len(df) for i in range(len(df))]
In [29]: df
Out[29]:
Product Value
0 Banana_1 5.1
1 Banana_2 5.0
6 Orange_a 2.1
7 Orange_b 2.0
This is what you need:
In [1230]: index_list = df2.index.tolist()
In [1236]: index_map = {}
In [1237]: for i in index_list:
...: index_map[i] = 'myindex'
...:
In [1250]: df.rename(index=index_map, inplace=True)
In [1251]: df
Out[1251]:
Product Value
myindex Banana_1 5.1
myindex Banana_2 5.0
2 Orange_a 2.1
3 Orange_b 2.0
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
So I'm trying to split a pandas Dataframe into two separate data frames by a single binary variable. Accordingly, the groupby function seems a decent option except it doesn't return a data frame but rather a groupby object which isn't nearly as useful to me. Moreover, I can't access any values from within the groupby object. I ran a simple df.groupby('Type') statement and would like to partition the data from here meaning output those two groups to two new data frames. Any help would be sincerely appreciated. The last question I posted was met with ridiculously childish admonitions not to post homework questions. Needless to say, this as well as the aforementioned were/are NOT homework so please spare me of such remarks. As always thanks so much.
If you use groupby, you can iterate through the groups as follows:
g = df.groupby('class')
for k, v in g.groups.iteritems():
print k # a
print df.iloc[v] # df_a, the dict values are position indices for the group
print
a
class data1 data2
0 a -0.173070 141.437719
2 a -0.087673 200.815709
6 a 1.220608 159.456053
8 a 0.428373 -6.491034
9 a -0.123463 -96.898025
c
class data1 data2
5 c -0.358996 162.715982
7 c -1.339496 23.043417
b
class data1 data2
1 b -1.761652 -12.405066
3 b 1.366879 22.988654
4 b 1.125314 60.489373
Note: iterating over a set/dict is not guaranteed to be in order.
How's this?
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'class': np.random.choice(list('abc'), size=10),
'data1': np.random.randn(10),
'data2': np.random.randn(10) * 100})
df_a = df[df['class']=='a']
df_b = df[df['class']=='b']
df_c = df[df['class']=='c']
print df, '\n'
print df_a
print df_b
print df_c
Gives:
class data1 data2
0 a -0.173070 141.437719
1 b -1.761652 -12.405066
2 a -0.087673 200.815709
3 b 1.366879 22.988654
4 b 1.125314 60.489373
5 c -0.358996 162.715982
6 a 1.220608 159.456053
7 c -1.339496 23.043417
8 a 0.428373 -6.491034
9 a -0.123463 -96.898025
class data1 data2
0 a -0.173070 141.437719
2 a -0.087673 200.815709
6 a 1.220608 159.456053
8 a 0.428373 -6.491034
9 a -0.123463 -96.898025
class data1 data2
1 b -1.761652 -12.405066
3 b 1.366879 22.988654
4 b 1.125314 60.489373
class data1 data2
5 c -0.358996 162.715982
7 c -1.339496 23.043417