create json dynamic with group by column name from dataframe

create json dynamic with group by column name from dataframe - python

I am trying to create datasets from the name of the columns of a dataframe. Where I have the columns ['NAME1', 'EMAIL1', 'NAME2', 'EMAIL2', NAME3', 'EMAIL3', etc].
I'm trying to split the dataframe based on the 'EMAIL' column, where through a loop, but it's not working properly.
I need it to be a JSON, because there is the possibility that between each 'EMAILn' column there may be a difference in number of columns.
This is my input:
I need this:
This is my code:
for i in df_entities.filter(regex=('^(EMAIL)' + str(i))).columns:
df_groups = df_temp_1.groupby(i)
df_detail = df_groups.get_group(i)
display(df_detail)
What do you recommend me to do?
From already thank you very much.
Regards

filter returns a copy of your dataframe with only the matching columns, but you're trying to loop over just the column names. Just add .columns:
for i in df_entities.filter(regex=('^(Email)' + str(i))).columns:
... # ^^^^^^^^^ important

From your input and desired output, simply call pandas.wide_to_long:
long_df = pd.wide_to_long(
df_entities.reset_index(),
stubnames=["NAME", "EMAIL"],
i="index",
j="version"
)

Related

append new columns with elements of another columns dataframe Python

screenshot of dataframe
I have a dataframe with multiple columns. One of these contains names of french suppliers like "Intermarché", "Carrefour", "Leclerc" (as you can see in the framed column on the attached screenshot). Unfortunately, the names are typed by hand and are not standardized at all. From the "Distributeurs" column, I would like to create a new column with the names unified in a list by cell so that I can then use the fonction .explore() and make one product and one distributor per row. I would like to make a selection of about 30 supplies and put 'others suppliers' for the rest. I feel like I have to use regular expressions and loops but I'm totally lost. Could someone help me? Thanks
I try this but I'm lost :
df['test']=''
distrib_list=["Leclerc","Monoprix",'Franprix','Intermarché','Carrefour','Lidl','Proxi','Grand Frais','Fresh','Cora','Casino',"Relais d'Or",'Biocoop','Métro','Match','Super U','Aldi','Spar','Colruyt','Auchan']
for n in df['Distributeurs']:
if n in distrib_list:
df['test'].append

You'll need to first split your Distributeurs with the comma by doing something along the lines of df['Distributeurs'].str.split(','). Once that is done you can iterate over the rows of your dataframe, get the idx and the row in question. Then you can iterate over your splitted Distributeur cell. I also make it case insensitive (for unicode you might need to add things to this if statment).
Then you can create a new dataframe with this information (and add whatever other information you wish) by creating first a list and transforming it into a dataframe. This can be accomplished with something on the lines of:
import pandas as pd
test = []
distList = ['name1', 'name2', 'name3']
data = [['Product1', ['Name1', 'Name2']], ['Product2', ['Name1', 'Name2', 'Name3']], ['Product3', ['Name4', 'Name5']]]
df = pd.DataFrame(data, columns=['Product', 'Distributor'])
for idx, x in df.iterrows():
for i in range(len(x['Distributor'])):
if x['Distributor'][i].lower() in distList :
test.append({
'Product':df['Product'][idx],
'Distributor':x['Distributor'][i]
})
else:
pass
test_df = pd.DataFrame(test)

Splitting dictionary inside a Pandas Column into Separate Columns

I have data saved in a csv. I am querying this data using Python and turning it into a Pandas DataFrame. I have a column called team_members in my dataframe.It has a dictionaryof values. The column looks like so when called:
dt.team_members[1]
Output:
"[{'name': 'LearnFromScratch', 'tier': 'novice tier'}, {'name': 'Scratch', 'tier': 'novice tier'}]"
I tried to see this explanation and other similar:
Splitting multiple Dictionaries within a Pandas Column
But they do not work
I want to get a column called name with the names of the members of the team and another with tier of each member.
Can you help me?
Thanks!!

I assume the output of dt.team_members[1] is a list.
If so, you can directly pass that list to create a dataframe something like:
pd.DataFrame(dt.team_members[1])

You can extract name while looping over it
list(map(lambda x: x.get("name"), dt.team_members[1]))
if you need a new dataframe:
then follow #vivek answer:
pd.DataFrame(dt.team_members[1])

You can try this-
list = dt.team_member[1]
li = []
for x in range(len(list)):
t = {list[x]['name']:list[x]['tier']}
li.append(t)
print(li)
Output is -
[{'LearnFromScratch': 'novice tier'}, {'Scratch': 'novice tier'}]

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.

I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object

As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.

Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c

To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).

If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.

You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]

here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])

I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Create a subset of a DataFrame depending on column name

I have a pandas DataFrame called timedata with different column names, some of which contain the word Vibration, some eccentricity. Is is possible to create a dataframe of just the columns containing the word Vibration?
I have tried using
vib=[]
for i in timedata:
if 'Vibration' in i:
vib=vib.append(i)
to then create a DataFrame based on the indicies of these columns. This really does not seem like the most efficient way to do it and I'm sure there must be something simple to do with list comprehension.
EDIT
Dataframe of form:
df = DataFrame({'Ch 1:Load': randn(10), 'Ch 2:Vibration Brg 1T ': randn(10), 'Ch 3:Eccentricity Brg 1H ': randn(10), 'Ch 4:Vibration Brg 2T ': randn(10)})
Sorry I'm having a slow day! thanks for any help

Something like this to manually select all columns with the word "Vibration" in it:
df[[col for col in df.columns if "Vibration" in col]]
You can also do the same with the filter method:
df.filter(like="Vibration")
If you want to do a more flexible filter, you can use the regex option. E.g. to look if "Vibration" or "Ecc" is in the column name:
df.filter(regex='Ecc|Vibration')

newDf = Df.loc[:,['Vibration']]
or
newDf = Df.loc[:,['Vibration','eccentricity']]
to get more collumns
to search for a value in a collumn:
newDf = Df[Df["CollumnName"] == "vibration"]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

create json dynamic with group by column name from dataframe - python

filter returns a copy of your dataframe with only the matching columns, but you're trying to loop over just the column names. Just add .columns: for i in df_entities.filter(regex=('^(Email)' + str(i))).columns: ... # ^^^^^^^^^ important

From your input and desired output, simply call pandas.wide_to_long: long_df = pd.wide_to_long( df_entities.reset_index(), stubnames=["NAME", "EMAIL"], i="index", j="version" )

Related

append new columns with elements of another columns dataframe Python

Splitting dictionary inside a Pandas Column into Separate Columns

Using pd.Dataframe.replace with an apply function as the replace value

How to add suffix and prefix to all columns in python/pyspark dataframe

Create a subset of a DataFrame depending on column name

Categories

Resources