python dataframe grouping separately - python

I want to control dataframe in a specific way.
if there is a group of dfs
(mostly devided by using 'for' function)
like
df1 = iris.iloc[ : , 0:1]
df2 = iris.iloc[ : , 1:2]
.
.
.
dfn = iris.iloc[ : , n-1:n]
I want to group dfs like
df_group1 = df1, df2, ...., dfn
df_group2 = df1, df3, ...., df(n-1)
First, I do not want to do this manually.
Second, I do not want to use tuple, dictionary, or list; temporal usage is possible in process of grouping, but the end result should not be tuple, dictionary, or list. This is because I want to use df_group in other functions as well
Third, combining the dataframes is not the point of this question.
Thanks

Related

Concatenate non empty dataframes

I have n number of dataframes which is formed by downloading data from firestore. The number of dataframes depend on number of unique value of a variable.
coming to the question, I want to concatenate these dataframes into one final dataframe. But I want to ignore the empty dataframes. How can I do this?
For example if I have df1,df2,df3,df4. if df3 is empty, I want to concatenate df1, df2 and df4
I would do something like using .empty attribute:
def concat(*args):
return pd.concat([x for x in args if not x.empty])
df = concat(*[df1, df2, df3, df4])

Filling a dataframe with multiple dataframe values

I have some 100 dataframes that need to be filled in another big dataframe. Presenting the question with two dataframes
import pandas as pd
df1 = pd.DataFrame([1,1,1,1,1], columns=["A"])
df2 = pd.DataFrame([2,2,2,2,2], columns=["A"])
Please note that both the dataframes have same column names.
I have a master dataframe that has repetitive index values as follows:-
master_df=pd.DataFrame(index=df1.index)
master_df= pd.concat([master_df]*2)
Expected Output:-
master_df['A']=[1,1,1,1,1,2,2,2,2,2]
I am using for loop to replace every n rows of master_df with df1,df2... df100.
Please suggest a better way of doing it.
In fact df1,df2...df100 are output of a function where the input is column A values (1,2). I was wondering if there is something like
another_df=master_df['A'].apply(lambda x: function(x))
Thanks in advance.
If you want to concatenate the dataframes you could just use pandas concat with a list as the code below shows.
First you can add df1 and df2 to a list:
df_list = [df1, df2]
Then you can concat the dfs:
master_df = pd.concat(df_list)
I used the default value of 0 for 'axis' in the concat function (which is what I think you are looking for), but if you want to concatenate the different dfs side by side you can just set axis=1.

How to assign a String variable to a dataframe name

I had a problem, which is a for loop program.like below:
list = [1,2,3,4]
for index in list:
new_df_name = "user_" + index
new_df_name = origin_df1.join(origin_df2,'id','left')
but the "new_df_name" is just a Variable and String type.
how to realize these?
I assume, what you really need is to have a list of dataframes (which non necessary have any specific names) and then union them all together.
dataframes = [df1, df2, df3, etc... ]
res_df, tail_dfs = dataframes[0], dataframes[1:]
for df in tail_dfs:
res_df = res_df.unionAll(df)
upd.
even better option to union described in comment.

Comparing two dataframe columns and outputting a third

I apologize in advance if this has been covered, I could not find anything quite like this. This is my first programming job (I was previously software QA) and I've been beating my head against a wall on this.
I have 2 dataframes, one is very large [df2] (14.6 million lines) and I am iterating through it in chunks. I attempted to compare a column of the same name in each dataframe, if they're equal I would like to output a secondary column of the larger frame.
i.e.
if df1['tag'] == df2['tag']:
df1['new column'] = df2['plate']
I attempted a merge but this didn't output what I expected.
df3 = pd.merge(df1, df2, on='tag', how='left')
I hope I did an okay job explaining this.
[Edit:] I also believe I should mention that df2 and df1 both have many additional columns I do not want to interact with/change. Is it possible to only compare the single columns of two dataframes, and output the third additional column?
You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need
df1 = df1.merge(df2, on="tag", how = 'inner')
df1['new column'] = df1['plate']
del df1['plate']
I hope this works.
As smci mentioned, this is a perfect time to use join/merge. If you're looking to preserve df1, a left join is what you want. So you were on the right path:
df1 = pd.merge(df1['tag'],
df2['tag', 'plate'],
on='tag', how='left')
df1.rename({'plate': 'new column'}, axis='columns')
That will only compare the tag columns in each dataframe, so the other columns won't matter. It'll bring over the plate column from df2, and then renames it to whatever you want your new column to be named.
This is totally a case for join/merge. You want to put df2 on the left because it's smaller.
df2.join(df1, on='tag', ...)
You only misunderstood the type of join/merge) you want to make:
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
'how'='left' join would create an (unwanted) entry for all rows of the LHS df2. That's not quite what you want (if df2 contained other tag values not seen in df1, you'd also get entries for them).
'how'='inner' would form the intersection of df2 and df1 on the 'on'='tag' field. i.e. you only get entries for where df1 contains a valid tag value according to df2.
So:
df3 = df2.join(df1, on='tag', how='inner')
# then reference df3['plate']
or if you only want the 'plate' column in df3 (or some other selection of columns), you can directly do:
df2.join(df1, on='tag', how='inner') ['plate']

How do I combine two dataframes?

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Categories