I'm trying to make the kind of transformation shown in the image below :
I made the code below but unfortunately I'm not getting the result I'm looking for:
import pandas as pd
df = pd.DataFrame({'Id': ['Id001', 'Id002', 'Id002', 'Id003', 'Id003', 'Id003', 'Id004', 'Id004'],
'Values': ['red', 'brown','white','blue', 'green', 'yellow', 'rose', 'purple']})
out = (df['Values']
.astype(str)
.groupby(df['Id'])
.agg('|'.join)
.reset_index())
Do you have any suggestions/propositions, please ?
Try transform
df['Values_grouped'] = df.groupby('Id')['Values'].transform('|'.join)
df['Id_occurrences'] = df.groupby('Id')['Values'].transform('count')
You're close, you just need to use out to assign the result back to the df (it's better if you don't reset_index() in this case):
import pandas as pd
df = pd.DataFrame({'Id': ['Id001', 'Id002', 'Id002', 'Id003', 'Id003', 'Id003', 'Id004', 'Id004'],
'Values': ['red', 'brown','white','blue', 'green', 'yellow', 'rose', 'purple']})
out = (df['Values']
.astype(str)
.groupby(df['Id'])
.agg('|'.join))
counts = df['Id'].value_counts()
df['Id_occurrences'] = [counts.loc[id] for id in df['Id']]
df['Values_grouped'] = [out.loc[id] for id in df['Id']]
Related
Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp
I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.
I have the below JSON string in data. I want it to look like the Expected Result Below
import json
import pandas as pd
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
Expected Result:
Category_matchType
Category_expression
Action_matchType
Action_expression
Label_matchType
Label_expression
0
EXACT
ABC
EXACT
DEF
REGEXP
GHI|JKL
What I've Tried:
This question is similar, but I'm not using the index the way the OP is. Following this example, I've tried using json_normalize and then using various forms of melt, stack, unstack, pivot, etc. But there has to be an easier way!
# this bit of code produces the below result where I can start using reshaping functions to get to what I need but it seems messy
df = pd.json_normalize(data, 'eventConditions')
type
matchType
expression
0
CATEGORY
EXACT
ABC
1
ACTION
EXACT
DEF
2
LABEL
REGEXP
GHI|JKL
We can use json_normalize to read the json data as pandas dataframe, then use stack followed by unstack to reshape the dataframe
df = pd.json_normalize(data, 'eventConditions')
df = df.set_index([df.groupby('type').cumcount(), 'type']).stack().unstack([1, 2])
df.columns = df.columns.map('_'.join)
CATEGORY_matchType CATEGORY_expression ACTION_matchType ACTION_expression LABEL_matchType LABEL_expression
0 EXACT ABC EXACT DEF REGEXP GHI|JKL
If your data is not too large in size, you could maybe process the json data first and then create a dataframe like this:
import pandas as pd
import json
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
new_data = {}
for i in data:
for event in i['eventConditions']:
for key in event.keys():
if key != 'type':
col_name = event['type'] + '_' + key
new_data[col_name] = [event[key]] if col_name not in new_data else new_data[col_name].append(event[key])
df = pd.DataFrame(new_data)
df
Just found a way to do it with Pandas only:
df = pd.json_normalize(data, 'eventConditions')
df = df.melt(id_vars=[('type')])
df['type'] = df['type'] + '_' + df['variable']
df.drop(columns=['variable'], inplace=True)
df.set_index('type', inplace=True)
df = df.T
so I have some data that I am using the df.explode() method on. I get an error when running the code, I know the error is caused, because one of my rows (row 3) does not have a corresponding Qty for each location, but how can I handle that error ?
Code that returns ('ValueError: cannot reindex from a duplicate axis')
import pandas as pd
import openpyxl
data = {'ITEM': ['Item1', 'Item2', 'Item3'],
'Locations': ['loc1;loc2', 'loc3', 'loc4;loc5'],
'Qty': ['100;200', '100', '500']
}
df1 = pd.DataFrame(data, columns=['ITEM', 'Locations', 'Qty'])
print(df1)
formatted_df1 = (df1.set_index(['ITEM'])
.apply(lambda x: x.str.split(';')
.explode()).reset_index())
print(formatted_df1)
Code that works (Note that the last record has 500;600):
import pandas as pd
import openpyxl
data = {'ITEM': ['Item1', 'Item2', 'Item3'],
'Locations': ['loc1;loc2', 'loc3', 'loc4;loc5'],
'Qty': ['100;200', '100', '500']
}
df1 = pd.DataFrame(data, columns=['ITEM', 'Locations', 'Qty'])
print(df1)
formatted_df1 = (df1.set_index(['ITEM'])
.apply(lambda x: x.str.split(';')
.explode()).reset_index())
print(formatted_df1)
I would like some advice on how to update/insert new data into an already existing data table using Python/Databricks:
# Inserting and updating already existing data
# Original data
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, columns = ['Customer Number','Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, columns = ['Customer Number','Colour', 'Flow'])
print(df2)
# What the updated table will look like
updated_data = {'Customer Number': ['1', '2', '3', '4',],
'Colour': ['Blue', 'Blue', 'Green', 'Blue',],
'Flow': ['Bad', 'Bad', "Good", 'Bad']
}
df3 = pd.DataFrame (updated_data, columns = ['Customer Number','Colour', 'Flow'])
print(df3)
What you can see here is that the original data has three customers. I then get 'new_data' which contains an update of customer 1's data and new data for 'customer 4', who was not already in the original data. Then if you look at 'updated_data' you can see what the final data should look like. Here 'Customer 1's data has been updated and customer 4s data has been inserted.
Does anyone know where I should start with this? Which module I could use?
I’m not expecting someone to solve this in terms of developing, just need a nudge in the right direction.
Edit: the data source is .txt or CSV, the output is JSON, but as I load the data to Cosmos DB it’ll automatically convert so don’t worry too much about that.
Thanks
Current data frame structure and 'pd.update'
With some preparation, you can use the pandas 'update' function.
First, the data frames must be indexed (this is often useful anyway).
Second, the source data frame must be extended by the new indices with dummy/NaN data so that it can be updated.
# set indices of original data frames
col = 'Customer Number'
df1.set_index(col, inplace=True)
df2.set_index(col, inplace=True)
df3.set_index(col, inplace=True)
# extend source data frame by new customer indices
df4 = df1.copy().reindex(index=df1.index.union(df2.index))
# update data
df4.update(df2)
# verify that new approach yields correct results
assert all(df3 == df4)
Current data frame structure and 'pd.concat'
A slightly easier approach joins the data frames and removes duplicate
rows (and sorts by index if wanted). However, the temporary concatenation requires
more memory which may limit the size of the data frames.
df5 = pd.concat([df1, df2])
df5 = df5.loc[~df5.index.duplicated(keep='last')].sort_index()
assert all(df3 == df5)
Alternative data structure
Given that 'Customer Number' is the crucial attribute of your data,
you may also consider restructuring your original dictionaries like that:
{'1': ['Red', 'Good'], '2': ['Blue', 'Bad'], '3': ['Green', 'Good']}
Then updating your data simply corresponds to (re)setting the key of the source data with the new data. Typically, working directly on dictionaries is faster than using data frames.
# define function to restructure data, for demonstration purposes only
def restructure(data):
# transpose original data
# https://stackoverflow.com/a/6473724/5350621
vals = data.values()
rows = list(map(list, zip(*vals)))
# create new restructured dictionary with customers as keys
restructured = dict()
for row in rows:
restructured[row[0]] = row[1:]
return restructured
# restructure data
source_restructured = restructure(source_data)
new_restructured = restructure(new_data)
# simply (re)set new keys
final_restructured = source_restructured.copy()
for key, val in new_restructured.items():
final_restructured[key] = val
# convert to data frame and check results
df6 = pd.DataFrame(final_restructured, index=['Colour', 'Flow']).T
assert all(df3 == df6)
PS: When setting 'df1 = pd.DataFrame(source_data, columns=[...])' you do not need the 'columns' argument because your dictionaries are nicely named and the keys are automatically taken as column names.
You can use set intersection to find the Customer Numbers to update and set difference to find new Customer Number to add.
Then you can first update the initial data frame rows iterating through the intersection of Costumer Number and then merge the initial data frame only with the new rows of the data frame with the new values.
# same name column for clarity
cn = 'Customer Number'
# convert Consumer Number values into integer to use set
CusNum_df1 = [int(x) for x in df1[cn].values]
CusNum_df2 = [int(x) for x in df2[cn].values]
# find Customer Numbers to update and to add
CusNum_to_update = list(set(CusNum_df1).intersection(set(CusNum_df2)))
CusNum_to_add = list(set(CusNum_df2) - set(CusNum_df1))
# update rows in initial data frame
for num in CusNum_to_update:
index_initial = df1.loc[df1[cn]==str(num)].index[0]
index_new = df2.loc[df2[cn]==str(num)].index[0]
for col in df1.columns:
df1.at[index_initial,col]= df2.loc[index_new,col]
# concatenate new rows to initial data frame
for num in CusNum_to_add:
df1 = pd.concat([df1, df2.loc[df2[cn]==str(num)]]).reset_index(drop=True)
out:
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
There are many ways, but in terms of readability, I would prefer to do this.
import pandas as pd
dict_source = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df_origin = pd.DataFrame.from_dict(dict_source)
dict_new = {'Customer Number': ['1', '4', ],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df_new = pd.DataFrame.from_dict(dict_new)
df_result = df_origin.copy()
df_result.set_index(['Customer Number', ], inplace=True)
df_new.set_index(['Customer Number', ], inplace=True)
df_result.update(df_new) # update number 1
# handle number 4
df_result.reset_index(['Customer Number', ], inplace=True)
df_new.reset_index(['Customer Number', ], inplace=True)
df_result = df_result.merge(df_new, on=list(df_result), how='outer')
print(df_result)
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
You can use 'Customer Number' as index and use update method:
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, index=source_data['Customer Number'], columns=['Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, index=new_data['Customer Number'], columns=['Colour', 'Flow'])
print(df2)
df3 = df1.reindex(index=df1.index.union(df2.index))
df3.update(df2)
print(df3)
Colour Flow
1 Blue Bad
2 Blue Bad
3 Green Good
4 Blue Bad