I want to create a combined string based on two columns, one is an integer and the other is a string. I need to combine them to create a string.
I've already tried using the solution from this answer here (Apply function to create string with multiple columns as argument) but it doesn't give the required output. H
I have two columns: prod_no which is an integer and PROD which is a string. So something like
| prod_no | PROD | out | | |
|---------|-------|---------------|---|---|
| 1 | PRODA | #Item=1=PRODA | | |
| 2 | PRODB | #Item=2=PRODB | | |
| 3 | PRODC | #Item=3=PRODC | | |
to get the last column, I used the following code:
prod_list['out'] = prod_list.apply(lambda x: "#ITEM={}=={}"
.format(prod_list.prod_no.astype(str), prod_list.PROD), axis=1)
I'm trying to produce the column "out" but the result of that code is weird. The output is #Item=0 1 22 3...very odd. I'm specifically trying to implement using apply and lambda. However, I am biased to efficient implementations since I am trying to learn how to write optimized code. Please help :)
This works.
import pandas as pd
df= pd.DataFrame({"prod_no": [1,2,3], "PROD": [ "PRODA", "PRODB", "PRODC" ]})
df["out"] = df.apply(lambda x: "#ITEM={}=={}".format(x["prod_no"], x["PROD"]), axis=1)
print(df)
Output:
PROD prod_no out
0 PRODA 1 #ITEM=1==PRODA
1 PRODB 2 #ITEM=2==PRODB
2 PRODC 3 #ITEM=3==PRODC
you can also try with zip:
df=df.assign(out=['#ITEM={}=={}'.format(a,b) for a,b in zip(df.prod_no,df.PROD)])
#or directly : df.assign(out='#Item='+df.prod_no.astype(str)+'=='+df.PROD)
prod_no PROD out
0 1 PRODA #ITEM=1==PRODA
1 2 PRODB #ITEM=2==PRODB
2 3 PRODC #ITEM=3==PRODC
Related
I have a dataframe currently which resembles this
Source | Destination | Type
A | B | Insert
A | B | Delete
B | C | Insert
What I want to achieve is something like this
Source | Destination | Type
A | B | Insert, Delete
B | C | Insert
I tried using group by Source and Destination but Im a little unsure how do I append to the type. Any ideas?
Check Below Code:
df.groupby(['Source','Destination']).agg({'Type':'unique'})
Figured out something
df = df.groupby(['Source','Destination'])['Type'].apply(lambda x: ','.join(x)).reset_index()
Make a string separated by ", ":
df.groupby(['Source', 'Destination']).agg(', '.join).reset_index()
Make a list:
df.groupby(['Source','Destination']).agg(list).reset_index()
Joining by , and building a string works but it is not very convenient if you later want to perform other operations with this column value. (such as iterating over the elements.)
This creates a set of values.
column_map: Dict[str,Any] = {}
column_map["Type"] = lambda x: set(x)
df.groupby(["Source", "Destination"]).agg(column_map)
Source | Destination | Type
A | B | {Insert Delete}
B | C | {Insert}
if you instead want to get a list and dont want eliminate duplicates. Just replace set(x) with list(x)
This question already has answers here:
Pandas Apply Function That returns two new columns
(6 answers)
Closed 1 year ago.
all.
I have a function that returns two values. One is a list, the other is a double.
I want to use something like this to create two new columns in my df and use .apply to populate those columns on a row by row basis.
def f(a_list):
#do some stuff to the list
if(stuff):
make_new_stuff_happen
#return results of stuff
return new_list, a_double
def main():
df['new_col1'], df['new_col2'] = df.apply(lambda x: f(x['some_col']))
Thanks for any help you can provide.
A few notes:
I think by double you mean a float in Python?
Even for examples, I'd name your function & vars something more meaningful, so it's easier to diagnose
Maybe this answer will help:
If this is the original dataframe you're working with:
col_1 | col_2 | col_3
-------------------------
1 | 3 | 3
2 | 3 | 4
3 | 1 | 1
You can just have a function like this:
def transform_into_two_columns(original_val_from_row):
# do some stuff to the list:
# example 1: multiply each row by 2 & save output to new list (this would be "new_list" in your question)
original_val_times_2 = original_val_from_row*2
# example 2: sum all values in list/column (this would be "a_double" in your question)
original_val_plus_2 = original_val_from_row+2.1
return original_val_times_2, original_val_plus_2
Then, you can save that function's output to a list:
list_of_tuples = df['col_2'].apply(lambda x: transform_into_two_columns(x)).to_list()
Then, with that list_of_tuples, you can create 2 new columns:
df[['NEW_col_4', 'NEW_col_5']] = pd.DataFrame(list_of_tuples, index=df.index)
Your new dataframe will look like this:
col_1 | col_2 | col_3 | NEW_col_4 | NEW_col_5
---------------------------------------------------
1 | 3 | 3 | 6 | 5.1
2 | 3 | 4 | 6 | 5.1
3 | 1 | 1 | 2 | 3.1
I have a dataframe with multiple columns, including analysis_date (datetime), and forecast_hour (int). I want to add a new column called total_hours, which is the sum of the hour component of analysis_date plus the corresponding forecast_hour in that row. Here's a visual example:
original dataframe:
analysis_date | forecast_hour
12-2-19-05 | 3
12-2-19-06 | 3
12-2-19-07 | 3
12-2-19-08 | 3
dataframe after calculation:
analysis_date | forecast_hour | total_hours
12-2-19-05 | 3 | 8
12-2-19-06 | 3 | 9
12-2-19-07 | 3 | 10
12-2-19-08 | 3 | 11
Here is the current logic that does what I want:
df['total_hours'] = df.apply(lambda row: row.analysis_date.hour + row.forecast_hours_out, axis=1)
Unfortunately, this is too slow for my application, it takes around 15 seconds for a dataframe with a few hundred thousand entries. I have tried using the swifter library, but unfortunately, it took approximately as long (if not longer) than my current implementation.
apply is slow because it is not vectorized. This should do what you want (assuming df['analysis_date'] is a datetime64):
df['total_hours'] = df['analysis_date'].dt.hour + df['forecast_hour']
I have a column in my dataframe like this:
___________________________
| columnn |
____________________________
| [happiness#sad] |
| [happy ness#moderate] |
| [happie ness#sad] |
____________________________
and I want to replace the “happy ness”,”happiness”,”happie ness” with 'happyness' . I am currently using this method but nothing is changed.
string exactly matching
happy ness===> happyness
happiness ===> happyness
happie ness===>happyness
I treid the below two approaches
1st Approach
df['column']
df.column=df.column.replace({"happiness":"happyness" ,"happy ness":"happyness" ,"happie ness":"happynesss" })
2nd Approach
df['column']=df['column'].str.replace("happiness","happyness").replace(“happy ness”.”happyness”).replace(“happie ness”,”happynesss”)
Desired Output:
______________________
| columnn |
_______________________
| [happyness,sad] |
| [happyness,moderate] |
| [happyness,sad] |
_______________________
This is one approach using replace with regex=True.
Ex:
import pandas as pd
df = pd.DataFrame({"columnn": [["happiness#sad"], ["happy ness#moderate"], ["happie ness$sad"]]})
data = {"happiness":"happyness" ,"happy ness":"happyness" ,"happie ness":"happynesss" }
df["columnn"] = df["columnn"].apply(lambda x: pd.Series(x).replace(data, regex=True).tolist())
print(df)
Output:
columnn
0 [happyness#sad]
1 [happyness#moderate]
2 [happynesss$sad]
Try this approach i think this will work for you.
df['new_col']=df['column'].replace(to_replace =
['happyness','happiness','happie ness'], value =
['happyness','happyness','happyness'])
I've got a basic dictionary that gives me a count of how many times data shows up. e.g. Adam: 10, Beth: 3, ... , Zack: 1
If I do df = pd.DataFrame([dataDict]).T then the keys from the dictionary become the index of the dataframe and I only have 1 true column of data. I've looked by I haven't found a way around this so any help would be appreciated.
Edit: More detail
The dictionary was formed from a count function of another dataframe e.g. dataDict = df1.Name.value_counts().to_dict ()
This is my expected output.
| Name | Count
------ | -----|------
0 | Adam | 10
------ | -----|------
1 | Beth | 3
What I'm getting at the moment is this:
| Count
-----|------
Adam | 10
-----|------
Beth | 3
try reset_index
dataDict = dict(Adam=10, Beth=3, Zack=1)
df = pd.Series(dataDict).rename_axis('Name').reset_index(name='Count')
df