Groupby and join values but keep all columns - python

I've this Dataframe and want to group on ID and join the values.
ID | A_Num | I_Num
--------------------------
001 | A_001 | I_001
002 | A_002 | I_002
003 | A_003 | I_004
005 | A_002 | I_002
Desired Output
ID | A_Num | I_Num
--------------------------
001 | A_001 | I_001
002;005 | A_002 | I_002
003 | A_003 | I_004
Code:
df = df.groupby(['A_Num','I_Num'])['ID'].apply(lambda tags: ';'.join(tags))
df.to_csv('D:\joined.csv', sep=';', encoding='utf-8-sig', quoting=csv.QUOTE_ALL, index=False, header=True)
When I write the DataFrame to a csv file I've only the ID column.

Try reset_index():
df=df.groupby(['A_Num','I_Num'])["ID"].apply(lambda tags: ';'.join(tags.values)).reset_index()
This way your aggregation from apply() will be executed, and then reassigned as column instead of index.

Just another way to do it is:
result= df.groupby(['A_Num', 'I_Num']).agg({'ID': list})
result.reset_index(inplace=True)
result[['ID', 'A_Num', 'I_Num']]
The output is:
Out[37]:
ID A_Num I_Num
0 [001 ] A_001 I_001
1 [002 , 005 ] A_002 I_002
2 [003 ] A_003 I_004
ID contains lists in that case. If you rather want strings, just do:
result['ID']= result['ID'].map(lambda lst: ';'.join(lst))
result[['ID', 'A_Num', 'I_Num']]
Which outputs:
Out[48]:
ID A_Num I_Num
0 001 A_001 I_001
1 002;005 A_002 I_002
2 003 A_003 I_004

Groupby 'A_Num' and 'I_Num' and then merge the IDs in the same groups.
df.groupby(['A_Num','I_Num']).ID.apply(lambda x: ';'.join(x.tolist())).reset_index()

Related

Date difference from a list in pandas dataframe

I have a pandas dataframe for text data. I created by doing group by and aggregate to get the texts per id like below. I later calculated the word count.
df = df.groupby('id') \
.agg({'chat': ', '.join }) \
.reset_index()
It looks like this:
chat is the collection of the text data per id. The created_at is the dates of chats, converted to string type.
|id|chat |word count|created_at |
|23|hi,hey!,hi|3 |2018-11-09 02:11:24,2018-11-09 02:11:43,2018-11-09 03:13:22|
|24|look there|2 |2017-11-03 18:05:34,2017-11-06 18:03:22 |
|25|thank you!|2 |2017-11-07 09:18:01,2017-11-18 11:09:37 |
I want to change add a chat duration column that gives the difference between first date and last date in days as integer.If chat ends same day then 1. The new expected column is :-
|chat_duration|
|1 |
|3 |
|11 |
Copying to clipboard looks like this before the group by
,id,chat,created_at
0,23,"hi",2018-11-09 02:11:24
1,23,"hey!",2018-11-09 02:11:43
2,23,"hi",2018-11-09 03:13:22
If I were doing the entire process
Beginning with the unprocessed data
id,chat,created_at
23,"hi i'm at school",2018-11-09 02:11:24
23,"hey! how are you",2018-11-09 02:11:43
23,"hi mom",2018-11-09 03:13:22
24,"leaving home",2018-11-09 02:11:24
24,"not today",2018-11-09 02:11:43
24,"i'll be back",2018-11-10 03:13:22
25,"yesterday i had",2018-11-09 02:11:24
25,"it's to hot",2018-11-09 02:11:43
25,"see you later",2018-11-12 03:13:22
# create the dataframe with this data on the clipboard
df = pd.read_clipboard(sep=',')
set created_at to datetime
df.created_at = pd.to_datetime(df.created_at)
create word_count
df['word_count'] = df.chat.str.split(' ').map(len)
groupby agg to get all chat as a string, created_at as a list, and word_cound as a total sum.
df = df.groupby('id').agg({'chat': ','.join , 'created_at': list, 'word_count': sum}).reset_index()
calculate chat_duration
df['chat_duration'] = df['created_at'].apply(lambda x: (max(x) - min(x)).days)
convert created_at to desired string format
If you skip this step, created_at will be a list of datetimes.
df['created_at'] = df['created_at'].apply(lambda x: ','.join([y.strftime("%m/%d/%Y %H:%M:%S") for y in x]))
Final df
| | id | chat | created_at | word_count | chat_duration |
|---:|-----:|:------------------------------------------|:------------------------------------------------------------|-------------:|----------------:|
| 0 | 23 | hi i'm at school,hey! how are you,hi mom | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/09/2018 03:13:22 | 10 | 0 |
| 1 | 24 | leaving home,not today,i'll be back | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/10/2018 03:13:22 | 7 | 1 |
| 2 | 25 | yesterday i had,it's to hot,see you later | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/12/2018 03:13:22 | 9 | 3 |
After some tries I got it:
First convert string to list.
df['created_at'] = df['created_at'].str.split(
',').apply(lambda s: list(s))
Then subtract max and min date item by converting to list
df['created_at'] = df['created_at'].apply(lambda s: (datetime.strptime(
str(max(s)), '%Y-%m-%d') - datetime.strptime(str(min(s)), '%Y-%m-%d') ).days)
Create DataFrame by split and then subtract first and last columns converted to datetimes:
df1 = df['created_at'].str.split(',', expand=True).ffill(axis=1)
df['created_at'] = (pd.to_datetime(df1.iloc[:, -1]) - pd.to_datetime(df1.iloc[:, 0])).dt.days

need a way to overwrite columns in 2 seperate pandas dataframes

I have 2 dataframes, both have an identical emails column and each has a unique ID Column. My code used to create these looks like this
import pandas as pd
df = pd.read_excel(r'C:\Users\file.xlsx')
df['healthAssessment'] = df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)
df0 = df.loc[df['receivedHealthEmail'].str.contains('No Email Sent')]
df2 = df0.loc[df['healthAssessment'] > 2.5]
df3 = df2.loc[df['Emails'].str.contains('#')]
print (df)
df4 = df
df1 = df3
receiver = df1['Emails'].astype(str)
receivers = receiver
df1['receivedHealthEmail'] = receiver
print (df1)
the first dataframe it produces looks roughly like this
Unique ID | Emails | receivedHealthEmail| healthAssessment
0 | aaaaaaaaaa#aaaaaa | No Email Sent| 2.443849
1 | bbbbbbbbbbbbb#bbb | No Email Sent| 3.809817
2 | ccccccccccccc#ccc | No Email Sent| 2.952871
3 | ddddddddddddd#ddd | No Email Sent| 2.564398
4 | eeeeeeeeeee#eeeee | No Email Sent| 3.315868
... | ... | ... ...
3294 | no email provided | No Email Sent| 7.674677
the second data frame looks like this
Unique ID Emails receivedHealthEmail| healthAssessment
1 | bbbbbbbbbbbbb#bbb| bbbbbbbbbbbbb#bbb| 3.809817
2 | cccccccccccccc#cc| cccccccccccccc#cc| 2.952871
3 | ddddddddddddd#ddd| ddddddddddddd#ddd| 2.564398
4 | eeeeeeeeeee#eeeee| eeeeeeeeeee#eeeee| 3.315868
i need a way to overwrite the received emails tab in the first dataframe using the values from the second dataframe. any help is appreciated
You can merge the 2 dataframes based on UniqueID:
df = df1.merge(df2, on='UniqueID')
df.drop(columns=['receivedHealthEmail_x', 'healthAssessment_x', 'Emails_x'], inplace=True)
print(df)
UniqueID Emails_y receivedHealthEmail_y healthAssessment_y
0 1 bbbbbbbbbbbbb#bbb bbbbbbbbbbbbb#bbb 3.809817
1 2 cccccccccccccc#cc cccccccccccccc#cc 2.952871
2 3 ddddddddddddd#ddd ddddddddddddd#ddd 2.564398
3 4 eeeeeeeeeee#eeeee eeeeeeeeeee#eeeee 3.315868

I'd like to capture several cell values from a CSV file and write to a new column

So, I'm new to Python. I've been trying to apply the things I've learnt to real world problems. The task I've set myself is this..
I want to capture the two cell values '01/01/2018 and '08/01/2018' and print them into a new csv.file under the header value dates. I also want to create a new column which shows the value associated with that date in the original csv file.
Any help would be greatly appreciated or a point in the right direction.
Original table
Hierarchy | Dept | Emp | Alpha | Bravo | Charlie | 01/01/2018 | 08/01/2018|
Hierarchy 1 | Dept 1 | JC | h | o | l | 0 | 2 |
New table
Hierarchy |Dept | Emp | Alpha | Bravo | Charlie | Date |Value |
Hierarchy 1 |Dept 1 | JC | h | o | l | 01/01/2018 | 0 |
Hierarchy 1 |Dept 1 | JC | h | o | l | 08/01/2018 | 2 |
As #ChristianSloper mentions in his comment, pd.melt is designed for this. In your case, here is a one-liner:
df
Hierarchy Dept Emp Alpha Bravo Charlie 01/01/2018 08/01/2018
0 Hierarchy_1 Dept_1 JC h o l 0 2
pd.melt(df,
id_vars=df.columns[:-2],
value_vars=df.columns[-2:],
var_name='Date',
value_name='Value')
Hierarchy Dept Emp Alpha Bravo Charlie Date Value
0 Hierarchy_1 Dept_1 JC h o l 01/01/2018 0
1 Hierarchy_1 Dept_1 JC h o l 08/01/2018 2
Ok,
I am just going to go ahead and assume that your table is stored in a csv file. So we will start by reading that in:
import pandas as pd
df = pd.read_csv('mytable.csv',sep='|')
pd.melt(df,
id_vars = ['Hierarchy ', ' Dept ', ' Emp ', ' Alpha ', ' Bravo ', ' Charlie '],
value_vars=[' 01/01/2018 ',' 08/01/2018' ],
var_name='Date',
value_name='Value')
Gives the desired result.
After the help of the contributors I have completed my task, below is the code I've used!
Thanks to the community for offering help!
"""
Transforms Data into Desired Format
"""
#import pandas module
import pandas as pd
#create variable where df = to data.csv
df = pd.read_csv('data.csv')
#create new variable for df.columns
cols = df.columns
#use .melt() function to complete data manipulation
transformed_df = pd.melt(df,
id_vars=cols[:6],
value_vars=cols[6:])
#Assert Data has been formatted correctly
print(transformed_df)
#create new csv file with new data
transformed_df.to_csv('melted_data.csv')
print("\nData has been Melted!")

Creating a dataframe from a dictionary without the key being the index

I've got a basic dictionary that gives me a count of how many times data shows up. e.g. Adam: 10, Beth: 3, ... , Zack: 1
If I do df = pd.DataFrame([dataDict]).T then the keys from the dictionary become the index of the dataframe and I only have 1 true column of data. I've looked by I haven't found a way around this so any help would be appreciated.
Edit: More detail
The dictionary was formed from a count function of another dataframe e.g. dataDict = df1.Name.value_counts().to_dict ()
This is my expected output.
| Name | Count
------ | -----|------
0 | Adam | 10
------ | -----|------
1 | Beth | 3
What I'm getting at the moment is this:
| Count
-----|------
Adam | 10
-----|------
Beth | 3
try reset_index
dataDict = dict(Adam=10, Beth=3, Zack=1)
df = pd.Series(dataDict).rename_axis('Name').reset_index(name='Count')
df

What's the most efficient way to accumulate dataframes in pyspark?

I have a dataframe (or could be any RDD) containing several millions row in a well-known schema like this:
Key | FeatureA | FeatureB
--------------------------
U1 | 0 | 1
U2 | 1 | 1
I need to load a dozen other datasets from disk that contains different features for the same number of keys. Some datasets are up to a dozen or so columns wide. Imagine:
Key | FeatureC | FeatureD | FeatureE
-------------------------------------
U1 | 0 | 0 | 1
Key | FeatureF
--------------
U2 | 1
It feels like a fold or an accumulation where I just want to iterate all the datasets and get back something like this:
Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF
---------------------------------------------------------------------
U1 | 0 | 1 | 0 | 0 | 1 | 0
U2 | 1 | 1 | 0 | 0 | 0 | 1
I've tried loading each dataframe then joining but that takes forever once I get past a handful of datasets. Am I missing a common pattern or efficient way of accomplishing this task?
Assuming there is at most one row per key in each DataFrame and all keys are of primitive types you can try an union with an aggregation. Lets start with some imports and example data:
from itertools import chain
from functools import reduce
from pyspark.sql.types import StructType
from pyspark.sql.functions import col, lit, max
from pyspark.sql import DataFrame
df1 = sc.parallelize([
("U1", 0, 1), ("U2", 1, 1)
]).toDF(["Key", "FeatureA", "FeatureB"])
df2 = sc.parallelize([
("U1", 0, 0, 1)
]).toDF(["Key", "FeatureC", "FeatureD", "FeatureE"])
df3 = sc.parallelize([("U2", 1)]).toDF(["Key", "FeatureF"])
dfs = [df1, df2, df3]
Next we can extract common schema:
output_schema = StructType(
[df1.schema.fields[0]] + list(chain(*[df.schema.fields[1:] for df in dfs]))
)
and transform all DataFrames:
transformed_dfs = [df.select(*[
lit(None).cast(c.dataType).alias(c.name) if c.name not in df.columns
else col(c.name)
for c in output_schema.fields
]) for df in dfs]
Finally an union and dummy aggregation:
combined = reduce(DataFrame.unionAll, transformed_dfs)
exprs = [max(c).alias(c) for c in combined.columns[1:]]
result = combined.repartition(col("Key")).groupBy(col("Key")).agg(*exprs)
If there is more than one row per key but individual columns are still atomic you can try to replace max with collect_list / collect_set followed by explode.

Categories