I'm trying to improve and clarify some pandas code! Maybe can you give me some ideas? Tried Create multiple columns from multiple return value of lambda function in python DataFrame but didn't manage to use it
I have the following process:
Parse a list of variables
For each list I create an object and generate data
Save the data in columns.
At this moment I think that my code is very complicated, maybe will be a nicer way?
Let's say I have the following csv:
column1, column2
foo , bar
foo2 , bar2
Now the code (functional) is:
for provider in ["val1", "val2"]:
input_df["object"] = input_df.apply(lambda row: SomeClass(row["column1"],
row["column2"], provider), axis=1)
# populate a column with data
input_df[provider] = input_df.apply(lambda row: row["object"].some_method(), axis=1)
# create generic columns using the columns from object method
input_df = input_df.assign(name=input_df[provider].apply(lambda x: x[0] if x[0] else None),
email=input_df[provider].apply(lambda x: x[1] if x[1] else None))
# renames generic columns
input_df.rename(columns={'name': provider + "_info1", 'email': provider + "_info2"}, inplace=True)
# clear unused column
input_df.drop(columns=[provider, 'object'])
My resoult is something like this:
column1, column2, val1_info1, val1_info2, val2_info1, val2_info2
foo , bar , x1info1 , x1info2 , y1info1 , y1info2
foo2 , bar2 , x2_info1 , x2_info2 , y2_info2 , y2_info2
Thanks,
Related
I have a column in a Python df like:
TAGS
{user_type:active}
{session_type:session1}
{user_type:inactive}
How can I efficiently make this column its own column for each of the tags specified?
Desired:
TAGS |user_type|session_type
{user_type:active} |active |null
{session_type:session1}|null |session1
{user_type:inactive} |inactive |null
My attempt only is able to do this in a boolean sense (not what I want) and only if I specify the columns from the tags (which I don't know ahead of time):
mask = df['tags'].apply(lambda x: 'user_type' in x)
df['user_type'] = mask
there are better ways but this is from what you got
df['user_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'user_type' in x else np.nan)
df['session_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'session_type' in x else np.nan)
You could use pandas.json_normalize() to convert TAGS column to dict object and check if user_type is a key of that dict.
df2 = pd.json_normalize(df['TAGS'])
df2['user_type'] = df2['TAGS'].apply(lambda x: x['user_type'] if 'user_type' in x else 'null')
This is what ended up working for me, wanted to post a short working example from the json library the helped.
def js(row):
if row:
return json.loads(row)
else:
return {'':''}
#This example includes if there was/wasn't a dataframe with other fields including tags
import json
import pandas as pd
df2 = df.copy()
#Make some dummy tags
df2['tags'] = ['{"user_type":"active","nonuser_type":"inactive"}']*len(df2['tags'])
df2['tags'] = df2['tags'].apply(js)
df_temp = pd.DataFrame(df2['tags'].values.tolist())
df3 = (pd.concat([df2.drop('tags', axis=1), df_temp], axis=1))
#Ynjxsjmh your approach reminds me of something like that I had used in the past, but in this case, I had gotten the following error:
AttributeError: 'str' object has no attribute 'values'
#Bing Wang I am a big fan of list comprehension but in this case I don't know the names of the columns before hand.
I Have a dataframe with 3 columns namely cuid ,type , errorreason. Now the error reason is empty and I have to fill it with the following logic-
1.) If cuid is unique and type is 'COL' then errorreason is 'NO ERROR'( ALL UNIQUE VALUES ARE 'NO ERROR')
2.) If cuid is not unique , and type is 'COL' AND 'ROT' , then error is errorreason is 'AD'
3.) If cuid is not unique , and type is 'COL' AND 'TOT' , then error is errorreason is 'RE'
4.) Any other case , except the above mentioned , errorreason is 'Unidentified'
I have already seperated the unique and non unique values , so first point is done. Kinda stuck on the next points . I was trying to group by the non unique values and then apply a function. Kinda stuck here.
This is a quite long solution, but I inserted explanations for each step so that they are clear to you. At the end you obtain your desired output
import numpy as np
import pandas as pd
# sample data
df = pd.DataFrame({
'cuid': [100814, 100814, 100815, 100815, 100816],
'type': ['col', 'rot', 'col', 'tot', 'col']
})
# define function for concatenating 'type' within the same 'cuid'
def str_cat(x):
return x.str.cat(sep=', ')
# create a lookup dataset that we will merge later on
df_lookup = df.groupby('cuid').agg({
'cuid': 'count',
'type': str_cat
}).rename(columns={'cuid': 'unique'})
# create the variable 'error_reason' on this lookup dataset thanks to a case when like statement using np.select
df_lookup['error_reason'] = np.select(
[
(df_lookup['cuid'] == 1) & (df_lookup['type'] == 'col'),
(df_lookup['cuid'] > 1) & (df_lookup['type'].str.contains('col')) & (df_lookup['type'].str.contains('rot')),
(df_lookup['cuid'] > 1) & (df_lookup['type'].str.contains('col')) & (df_lookup['type'].str.contains('tot'))
],
[
'NO ERROR',
'AD',
'RE'
],
default = 'Unidentified'
)
# merge the two datasets
df.merge(df_lookup.drop(columns=['type', 'unique']), on='cuid')
Output
cuid type error_reason
0 100814 col AD
1 100814 rot AD
2 100815 col RE
3 100815 tot RE
4 100816 col NO ERROR
Try to use this:
df.groupby('CUID',as_index=False)['TYPE'].aggregate(lambda x: list(x))
I have not tested this solution so let me know if it does not work.
I'm working with a large dataframe and need a way to dynamically rename column names.
Here's a slow method I'm working with:
# Create a sample dataframe
df = pd.DataFrame.from_records([
{'Name':'Jay','Favorite Color (BLAH)':'Green'},
{'Name':'Shay','Favorite Color (BLAH)':'Blue'},
{'Name':'Ray','Favorite Color (BLAH)':'Yellow'},
])
# Current columns are: ['Name', 'Favorite Color (BLAH)']
# ------
# build two lambdas to clean the column names
f_clean = lambda x: x.split('(')[0] if ' (' in x else x
f_join = lambda x: '_'.join(x.split())
df.columns = df.columns.map(f_clean, f_join).map(f_join).str.lower()
# Columns are now: ['name', 'favorite_color']
Is there a better method for solving this?
You could define a clean function and just apply to all the columns using list comprehension.
def clean(name):
name = name.split('(')[0] if ' (' in name else name
name = '_'.join(name.split())
return name
df.columns = [clean(col) for col in df.columns]
It's clear what's happening and not overly verbose.
Adding a new row to a dataframe with correct mapping in pandas
Something similar to the above question.
carrier_plan_identifier ... hios_issuer_identifier
1 AUSK ... 99806.0
2 AUSM ... 99806.0
3 AUSN ... 99806.0
4 AUSS ... 99806.0
5 AUST ... 99806.0
I need to pick multiple columns, lets say carrier_plan_identifier, wellthie_issuer_identifier and hios_issuer_identifier.
With these 3 columns I need to run a select query, something like ,
select id from table_name where carrier_plan_identifier = 'something' and wellthie_issuer_identifier = 'something' and hios_issuer_identifier = 'something'
I need to add id column back to my existing dataframe
Currently, I am doing something like this,
for index, frame in df_with_servicearea.iterrows():
if frame['service_area_id'] and frame['issuer_id']:
# reading from medical plans table
medical_plan_id = getmodeldata.get_medicalplans(sess, frame['issuer_id'], frame['hios_plan_identifier'], frame['plan_year'],
frame['group_or_individual_plan_type'])
frame['medical_plan_id'] = medical_plan_id
df_with_servicearea.append(frame)
when I do this,frame['medical_plan_id'] = medical_plan_id , nothing is added. But when I do df_with_servicearea['medical_plan_id'] = medical_plan_id only the last value of the loop is added to all the rows. I am not sure if this is the correct way to do this.
Update -:
After using , I am getting 4 rows , instead of 2 rows which should be there.
df_with_servicearea = df_with_servicearea.append(frame)
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... NaN
1 UHC99806 ... NaN
0 UHC99806 ... 879519.0
1 UHC99806 ... 879520.0
Update 2 - Implemented based on Mayank's answer-
Hi Mayank , Is something like this you are suggesting.
for index, frame in df_with_servicearea.iterrows():
if frame['service_area_id'] and frame['issuer_id']:
# reading from medical plans table
df_new = getmodeldata.get_medicalplans(sess, frame['issuer_id'], frame['hios_plan_identifier'], frame['plan_year'],
frame['group_or_individual_plan_type'])
df_new.columns = ['medical_plan_id', 'issuer_id', 'hios_plan_identifier', 'plan_year',
'group_or_individual_plan_type']
new_df = pd.merge(df_with_servicearea, df_new, on=['issuer_id', 'hios_plan_identifier', 'plan_year', 'group_or_individual_plan_type'], how='left')
print new_df
my get_medicalplans function where I am calling the select query.
def get_medicalplans(self,sess, issuerid, hios_plan_identifier, plan_year, group_or_individual_plan_type):
try:
medical_plan = sess.query(MedicalPlan.id, MedicalPlan.issuer_id, MedicalPlan.hios_plan_identifier,
MedicalPlan.plan_year, MedicalPlan.group_or_individual_plan_type).filter(MedicalPlan.issuer_id == issuerid,
MedicalPlan.hios_plan_identifier == hios_plan_identifier,
MedicalPlan.plan_year == plan_year,
MedicalPlan.group_or_individual_plan_type == group_or_individual_plan_type)
sess.commit()
return pd.read_sql(medical_plan.statement, medical_plan.session.bind)
The simplest solution to your issue is to change last row into:
df_with_servicearea = df_with_servicearea.append(frame)
However, if you want to add new column, use:
df_with_servicearea['medical_plan_id'] = df_with_servicearea.apply(
lambda row:
getmodeldata.get_medicalplans(sess,
row['issuer_id'],
row['hios_plan_identifier'],
row['plan_year'],
row['group_or_individual_plan_type']
)
if row['service_area_id']
and row['issuer_id']
else np.nan)
Try this:
Considering that you want to update the original df based on the below 3 cols:
1.) Tweak the query which you are firing on DB to include columns:carrier_plan_identifier, wellthie_issuer_identifier and hios_issuer_identifier in the select clause.
select id,carrier_plan_identifier, wellthie_issuer_identifier,hios_issuer_identifier from table_name where carrier_plan_identifier = 'something' and wellthie_issuer_identifier = 'something' and hios_issuer_identifier = 'something'
2.) Create a dataframe for the above results.
df = pd.DataFrame(cur.fetchall())
3.) Now above df has id column with the 3 other columns. Now, merge this df with the original_df based on columns : carrier_plan_identifier, wellthie_issuer_identifier and hios_issuer_identifier
original_df = pd.merge(original_df,df, on=['carrier_plan_identifier','wellthie_issuer_identifier','hios_issuer_identifier'],how='outer')
Changed left join to Outer join.
So, you have to understand what's happening here. I am joining query dataframe(df) with the original df on columns carrier_plan_identifier, wellthie_issuer_identifier and hios_issuer_identifier and appending the id column as it was not present.
Wherever a match is found, id column's value from df will be copied to original_df and in case of no match, id column will have NaN.
You don't have to use any loops. Just try out my code.
This will add id column to original_df for all rows which match. For rows which don't find a match will have id as Nan.
You can replace Nan with any value like below:
original_df = original_df.fillna("")
Let me know if this helps.
I have the below code that creates a data frame as below :
ratings = spark.createDataFrame(
sc.textFile("myfile.json").map(lambda l: json.loads(l)),
)
ratings.registerTempTable("mytable")
final_df = sqlContext.sql("select * from mytable");
The data frame look something like this
I'm storing the created_at and user_id into a list :
user_id_list = final_df.select('user_id').rdd.flatMap(lambda x: x).collect()
created_at_list = final_df.select('created_at').rdd.flatMap(lambda x: x).collect()
and parsing through one of the list to call another function:
for i in range(len(user_id_list)):
status=get_status(user_id_list[I],created_at_list[I])
I want to create a new column in my data frame called status and update the value for the corresponding user_id_list and created_at_list value
I know I need use this functionality - but not sure how to proceed
final_df.withColumn('status', 'give the condition here')
Dont create lists. Simply give a UDF function to dataframe
import pyspark.sql.functions as F
status_udf = F.udf(lambda x: get_status(x[0], x[1]))
df = df.select(df.columns + [status_udf(F.col('user_id_list'), \
F.col('created_at_list value')).alias('status')])