Splitting dictionary inside a Pandas Column into Separate Columns - python

I have data saved in a csv. I am querying this data using Python and turning it into a Pandas DataFrame. I have a column called team_members in my dataframe.It has a dictionaryof values. The column looks like so when called:
dt.team_members[1]
Output:
"[{'name': 'LearnFromScratch', 'tier': 'novice tier'}, {'name': 'Scratch', 'tier': 'novice tier'}]"
I tried to see this explanation and other similar:
Splitting multiple Dictionaries within a Pandas Column
But they do not work
I want to get a column called name with the names of the members of the team and another with tier of each member.
Can you help me?
Thanks!!

I assume the output of dt.team_members[1] is a list.
If so, you can directly pass that list to create a dataframe something like:
pd.DataFrame(dt.team_members[1])

You can extract name while looping over it
list(map(lambda x: x.get("name"), dt.team_members[1]))
if you need a new dataframe:
then follow #vivek answer:
pd.DataFrame(dt.team_members[1])

You can try this-
list = dt.team_member[1]
li = []
for x in range(len(list)):
t = {list[x]['name']:list[x]['tier']}
li.append(t)
print(li)
Output is -
[{'LearnFromScratch': 'novice tier'}, {'Scratch': 'novice tier'}]

Related

append new columns with elements of another columns dataframe Python

screenshot of dataframe
I have a dataframe with multiple columns. One of these contains names of french suppliers like "Intermarché", "Carrefour", "Leclerc" (as you can see in the framed column on the attached screenshot). Unfortunately, the names are typed by hand and are not standardized at all. From the "Distributeurs" column, I would like to create a new column with the names unified in a list by cell so that I can then use the fonction .explore() and make one product and one distributor per row. I would like to make a selection of about 30 supplies and put 'others suppliers' for the rest. I feel like I have to use regular expressions and loops but I'm totally lost. Could someone help me? Thanks
I try this but I'm lost :
df['test']=''
distrib_list=["Leclerc","Monoprix",'Franprix','Intermarché','Carrefour','Lidl','Proxi','Grand Frais','Fresh','Cora','Casino',"Relais d'Or",'Biocoop','Métro','Match','Super U','Aldi','Spar','Colruyt','Auchan']
for n in df['Distributeurs']:
if n in distrib_list:
df['test'].append
You'll need to first split your Distributeurs with the comma by doing something along the lines of df['Distributeurs'].str.split(','). Once that is done you can iterate over the rows of your dataframe, get the idx and the row in question. Then you can iterate over your splitted Distributeur cell. I also make it case insensitive (for unicode you might need to add things to this if statment).
Then you can create a new dataframe with this information (and add whatever other information you wish) by creating first a list and transforming it into a dataframe. This can be accomplished with something on the lines of:
import pandas as pd
test = []
distList = ['name1', 'name2', 'name3']
data = [['Product1', ['Name1', 'Name2']], ['Product2', ['Name1', 'Name2', 'Name3']], ['Product3', ['Name4', 'Name5']]]
df = pd.DataFrame(data, columns=['Product', 'Distributor'])
for idx, x in df.iterrows():
for i in range(len(x['Distributor'])):
if x['Distributor'][i].lower() in distList :
test.append({
'Product':df['Product'][idx],
'Distributor':x['Distributor'][i]
})
else:
pass
test_df = pd.DataFrame(test)

create json dynamic with group by column name from dataframe

I am trying to create datasets from the name of the columns of a dataframe. Where I have the columns ['NAME1', 'EMAIL1', 'NAME2', 'EMAIL2', NAME3', 'EMAIL3', etc].
I'm trying to split the dataframe based on the 'EMAIL' column, where through a loop, but it's not working properly.
I need it to be a JSON, because there is the possibility that between each 'EMAILn' column there may be a difference in number of columns.
This is my input:
I need this:
This is my code:
for i in df_entities.filter(regex=('^(EMAIL)' + str(i))).columns:
df_groups = df_temp_1.groupby(i)
df_detail = df_groups.get_group(i)
display(df_detail)
What do you recommend me to do?
From already thank you very much.
Regards
filter returns a copy of your dataframe with only the matching columns, but you're trying to loop over just the column names. Just add .columns:
for i in df_entities.filter(regex=('^(Email)' + str(i))).columns:
... # ^^^^^^^^^ important
From your input and desired output, simply call pandas.wide_to_long:
long_df = pd.wide_to_long(
df_entities.reset_index(),
stubnames=["NAME", "EMAIL"],
i="index",
j="version"
)

Why does `list(<pd.DataFrame>)` return a list of column names?

Let's say df is a typical pandas.DataFrame instance, I am trying to understand how come list(df) would return a list of column names.
The goal here is for me to track it down in the source code to understand how list(<pd.DataFrame>) returns a list of column names.
So far, the best resources I've found are the following:
Get a list from Pandas DataFrame column headers
Summary: There are multiple ways of getting a list of DataFrame column names, and each varies either in performance or idiomatic convention.
SO Answer
Summary: DataFrame follows a dict-like convention, thus coercing with list() would return a list of the keys of this dict-like structure.
pandas.DataFrame source code:
I can't find within the source code that point to how list() would create a list of column head names.
DataFrames are iterable. That's why you can pass them to the list constructor.
list(df) is equivalent to [c for c in df]. In both cases, DataFrame.__iter__ is called.
When you iterate over a DataFrame, you get the column names.
Why? Because the developers probably thought this is a nice thing to have.
Looking at the source, __iter__ returns an iterator over the attribute _info_axis, which seems to be the internal name of the columns.
Actually, as you have correctly stated in your question. One can think of a pandas dataframe as a list of lists (or more correctly a dict like object).
Take a look at this code which takes a dict and parses it into a df.
import pandas as pd
# create a dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(d)
print(df)
x = list(df)
print(x)
x = list(d)
print(x)
The result in both cases (for the dataframe df and the dict d) is this:
['col1', 'col2']
['col1', 'col2']
This result confirms your thinking that a "DataFrame follows a dict-like convention" .

How to replace a list of longer column names with small names in a dataset?

I have a list of column names which are too lengthy that it messes the code. I want to replace that list of column names with a list shorter name.
Here's an example of columns:
non_multi_choice = ["what_is_the_highest_level_of_formal_education_that_you_have_attained_or_plan_to_attain_within_the_next_2_years",
"select_the_title_most_similar_to_your_current_role_or_most_recent_title_if_retired_selected_choice",
"in_what_industry_is_your_current_employer_contract_or_your_most_recent_employer_if_retired_selected_choice",
"how_many_years_of_experience_do_you_have_in_your_current_role",
"what_is_your_current_yearly_compensation_approximate_usd",
"does_your_current_employer_incorporate_machine_learning_methods_into_their_business",
"of_the_choices_that_you_selected_in_the_previous_question_which_ml_library_have_you_used_the_most_selected_choice",
"approximately_what_percent_of_your_time_at_work_or_school_is_spent_actively_coding"]
shorter_names = ["highest_level_of_formal_education",
"job_title",
"current_industry",
"years_of_experience",
"yearly_compensation",
"does_your_current_employer_incorporate_machine_learning_methods_into_their_business",
"which_ml_library_have_you_used_the_most_selected_choice",
"what_percent_of_your_time_at_work_is_spent_actively_coding"]
I want to replace each name in the first list with the names in second list corresponding to it.
How about something like this:
for index, item in enumerate(non_multi_choice):
non_multi_choice[index] = shorter_names[index]
If those are encompassing all the column names then :
df.columns = shorter_names
If not
df.rename(columns = {old:new for old, new in zip(non_multi_choice, shorter_names})

How to obtain output files labeled with dictionary keys

I am a python/pandas user and I have a multiple dataframe like df1, df2,df3....
I want to name them as A, B, C, ... thus I wrote as below.
df_dict = {"A":df1, "B":df2,'C':df3,....}
Each dataframe has "Price" column and I want to know the output from the following formula.
frequency=df.groupby("Price").size()/len(df)
I made the following definition and want to obtain outputs from each dataframe.
def Price_frequency(df,keys=["Price"]):
frequency=df.groupby(keys).size()/len(df)
return frequency.reset_index().to_csv("Output_%s.txt" %(df),sep='\t')
As a first trial, I did
Price_frequency(df1,keys=["Price"])
but this did not work. It seems %s is wrong.
Ideally, I want output files named as "Output_A.txt", "Output_B.txt"...
If you could help me, I would be grateful for that very much.
A couple of points:
%s requires you to input a string. But in Python 3.6+ you can use formatted string literals, which you may find more readable.
Your function doesn't need to return anything here. You are using it to output csv files in a loop. Don't feel the need to add a return statement if it doesn't serve a purpose.
So you can do the following:
def price_frequency(df_dict, df_name, keys=['Price']):
frequency = df_dict[df_name].groupby(keys).size() / len(df_dict[df_name].index)
frequency.reset_index().to_csv(f'Output_{df_name}.txt', sep='\t')
df_dict = {'A': df1, 'B': df2, 'C': df3}
for df_name in df:
price_frequency(df_dict, df_name, keys=['Price'])
Iterating through columns will get output.
def Price_frequency(df):
for col in df.columns[2:]
frequency=df.groupby(col).size()/len(df)
return frequency.reset_index().to_csv("Output_%s.txt" %(col),sep='\t')
Reference: Pandas: Iterate through columns and starting at one column
Note: haven't gotten to test this yet

Categories