I have a dataframe df of three columns "id", "nodes_set", "description" where "nodes_set" is a list of of strings.
I am trying to split it into groups based on their values of sequences as follows:
df_by_nodes_set = df.groupby('nodes_set')
list(df_by_nodes_set)
I think the problem lies in the fact that I am trying to use groupby with lists, but I am not sure how to deal with that.
The question is unclear, but if you need to group by a list, then that list can be converted into a hash or simply concatenate the elements to get an id, like below:
df = pd.DataFrame([[i, list(range(i)),'sample ' + str(i)] for i in range(5)] , columns = ["id", "nodes_set", "description"])
nodes_set_key = df['nodes_set'].apply(lambda x: '_'.join(map(str, x)))
df.groupby(nodes_set_key).last()
Here is the code output:
Related
I've got a list of dataframes that I want filtered depending on the values in one column that all three of them have. I want to split all three dataframes into three each; one sub-dataframe for each value in that one column. So I want to make 9 dataframes out of 3.
I've tried:
df_list=[df_a,df_b,df_c]
for df_tmp in df_list:
for i, g in df_tmp.groupby('COLUMN'):
globals()[str(df_tmp) + str(i)] = g
But I get super weird results. Can someone help me fix that code?
Thanks!
This should give you a list with dictionaries: One dictionary for each of the original dataframes, each one containing one dataframe referenced with the unique name from 'COLUMN'.
tables = [{'df_' + name: df[df['COLUMN'] == name].copy() for name in df['COLUMN'].unique()} for df in df_list]
So, for example, you can call tables[0] to get the three dataframes derivated from df_a. Or tables[0]['df_foo'] to get the table from df_a with all the rows with the value 'foo' in the column 'COLUMNS'.
Or, if you want to use a dictionary to have all the df associated with keys instead of indexes in a list:
tables = {'df_' + str(i): {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for i in range(len(df_list))}
and then you can all them as tables['df_0']['df_foo'].
You can of course create a list of names and use it to assing the keys:
df_names = ['df_a', 'df_b', 'df_c']
tables = {'df_' + name: {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for item in df_names}
And now you do tables['df_a']['df_foo'].
Let's say you choose to use one of the dictionaries and want to apply a single operation to all the dataframes, for example, let's say that each dataframe has a column called 'price' and you want to apply a function called get_discount(), then
for key1 in tables: # top level corresponding to [df_a,df_b,df_c]
for key2 in tables[key]: # bottom level corresponding to each filtered df
tables[key1][key2]['prices'] = tables[key1][key2]['prices'].apply(get_discount)
My input Data Frame is
Below is My code for creating Multiple columns as per my single column data, if my column contains 'reporting' that should be column name as well as it will be place one if reporting contains in that column.
am getting correct output but I want this code dynamical way is any another ways...
df['reporting']=pd.np.where((df['Name'].str.contains('reporting',regex=False)),1,0)
df['update']=pd.np.where((df['Name'].str.contains('update',regex=False)),1,0)
df['offer']=pd.np.where((df['Name'].str.contains('offer',regex=False)),1,0)
df['line']=pd.np.where((df['Name'].str.contains('line',regex=False)),1,0)
Output:
Use Series.str.findall for get all value sof list with \b\b for words boundaries, join them by | and pass to Series.str.get_dummies:
L = ["reporting","update","offer","line"]
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df = df.join(df['Name'].str.findall(pat).str.join('|').str.get_dummies())
Or processing each column separately, here np.where is not necessary, convert True,False to 1,0 by Series.astype or Series.view:
for c in L:
df[c] = df['Name'].str.contains(c, regex=False).astype(int)
for c in L:
df[c] = df['Name'].str.contains(c, regex=False).view('i1')
Make a list of keywords, iterate the list and create new columns?
keywords = ["reporting","update","offer","line"]
for word in keywords:
df[word]=pd.np.where((df['Name'].str.contains(word,regex=False)),1,0)
I know I can use isin to filter dataframe
my question is that, when I'm doing this for many dataframes, the code looks a bit repetitive.
For example, below is how I filter some datasets to limit only to specific user datasets.
## filter data
df_order_filled = df_order_filled[df_order_filled.user_id.isin(df_user.user_id)]
df_liquidate_order = df_liquidate_order[df_liquidate_order.user_id.isin(df_user.user_id)]
df_fee_discount_ = df_fee_discount_[df_fee_discount_.user_id.isin(df_user.user_id)]
df_dep_wit = df_dep_wit[df_dep_wit.user_id.isin(df_user.user_id)]
the name of the dataframe is repeated 3 times for each df, which is kind unnecessary.
How can I simplify my code?
Thanks!
Use list comprehension with list of DataFrames:
dfs = [df_order_filled, df_liquidate_order, df_fee_discount_, df_dep_wit]
dfs1 = [x[x.user_id.isin(df_user.user_id) for x in dfs]
Output is another list with filtered DataFrames.
Another similar idea is use dictionary:
dict1 = {'df_order_filled': df_order_filled,
'df_liquidate_order': df_liquidate_order,
'df_fee_discount':df_fee_discount,
'df_dep_wit':df_dep_wit}
dict2 = {k: x[x.user_id.isin(df_user.user_id) for k, x in dict1.items()}
I had a problem, which is a for loop program.like below:
list = [1,2,3,4]
for index in list:
new_df_name = "user_" + index
new_df_name = origin_df1.join(origin_df2,'id','left')
but the "new_df_name" is just a Variable and String type.
how to realize these?
I assume, what you really need is to have a list of dataframes (which non necessary have any specific names) and then union them all together.
dataframes = [df1, df2, df3, etc... ]
res_df, tail_dfs = dataframes[0], dataframes[1:]
for df in tail_dfs:
res_df = res_df.unionAll(df)
upd.
even better option to union described in comment.
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff