I had a problem, which is a for loop program.like below:
list = [1,2,3,4]
for index in list:
new_df_name = "user_" + index
new_df_name = origin_df1.join(origin_df2,'id','left')
but the "new_df_name" is just a Variable and String type.
how to realize these?
I assume, what you really need is to have a list of dataframes (which non necessary have any specific names) and then union them all together.
dataframes = [df1, df2, df3, etc... ]
res_df, tail_dfs = dataframes[0], dataframes[1:]
for df in tail_dfs:
res_df = res_df.unionAll(df)
upd.
even better option to union described in comment.
Related
I would like to yield multiple empty dataframes by a function in Python.
import pandas as pd
df_list = []
def create_multiple_df(num):
for i in range(num):
df = pd.DataFrame()
df_name = "df_" + str(num)
exec(df_name + " = df ")
df_list.append(eval(df_name))
for i in df_list:
yield i
e.g. when I create_multiple_df(3), I would like to have df_1, df_2 and df_3 returned.
However, it didn't work.
I have two questions,
How to store multiple dataframes in a list (i.e. without evaluating the contents of the dataframes)?
How to yield multiple variable elements from a list?
Thanks!
It's very likely that you do not want to have df_1, df_2, df_3 ... etc. This is often a design pursued by beginners for some reason, but trust me that a dictionary or simply a list will do the trick without the need to hold different variables.
Here, it sounds like you simply want a list comprehension:
dfs = [pd.DataFrame() for _ in range(n)]
This will create n empty dataframes and store them in a list. To retrieve or modify them, you can simply access their position. This means instead of having a dataframe saved in a variable df_1, you can have that in the list dfs and use dfs[1] to get/edit it.
Another option is a dictionary comprehension:
dfs = {i: pd.DataFrame() for i in range(n)}
It works in a similar fashion, you can access it by dfs[0] or dfs[1] (or even have real names, e.g. {f'{genre}': pd.DataFrame() for genre in ['romance', 'action', 'thriller']}. Here, you could do dfs['romance'] or dfs['thriller'] to retrieve the corresponding df).
I have a dataframe df of three columns "id", "nodes_set", "description" where "nodes_set" is a list of of strings.
I am trying to split it into groups based on their values of sequences as follows:
df_by_nodes_set = df.groupby('nodes_set')
list(df_by_nodes_set)
I think the problem lies in the fact that I am trying to use groupby with lists, but I am not sure how to deal with that.
The question is unclear, but if you need to group by a list, then that list can be converted into a hash or simply concatenate the elements to get an id, like below:
df = pd.DataFrame([[i, list(range(i)),'sample ' + str(i)] for i in range(5)] , columns = ["id", "nodes_set", "description"])
nodes_set_key = df['nodes_set'].apply(lambda x: '_'.join(map(str, x)))
df.groupby(nodes_set_key).last()
Here is the code output:
I have some difficulties to create multiple lists using pandas from a list of multiple dataframes:
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
...
dfN = pd.read_csv('df1.csv')
dfs = [df1, df2, ..., dfN]
So far, I am able to convert each dataframe into a list by df1 = df1.values.tolist(). Since I have multiple data frames, I would like to convert each dataframe into a list with a loop.
Appreciate any suggestions!
Use list comprehensions:
dfs = [i.values.tolist() for i in dfs]
same as you are storing dataframes?
lists = []
for df in dfs:
temp_list = df.values.tolist()
lists.append(temp_list)
This will give you a list of lists. Each list within will be values from a dataframe. Or did I understand the question incorrectly?
Edit: If you wish to name each list, then you can use a dictionary instead? Would be better than trying to create thousands of variables dynamically.
dict_of_lists = {}
for index, df in enumerate(dfs):
listname = "list" + str(index)
dict_of_lists[listname] = df.values.tolist()
use pd.concat to join all dataframes to one big dataframe
df_all = pd.concat(dfs,axis=1)
df_all.values.tolist()
I want to control dataframe in a specific way.
if there is a group of dfs
(mostly devided by using 'for' function)
like
df1 = iris.iloc[ : , 0:1]
df2 = iris.iloc[ : , 1:2]
.
.
.
dfn = iris.iloc[ : , n-1:n]
I want to group dfs like
df_group1 = df1, df2, ...., dfn
df_group2 = df1, df3, ...., df(n-1)
First, I do not want to do this manually.
Second, I do not want to use tuple, dictionary, or list; temporal usage is possible in process of grouping, but the end result should not be tuple, dictionary, or list. This is because I want to use df_group in other functions as well
Third, combining the dataframes is not the point of this question.
Thanks
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff