I want to concat the values if they have same columns.
I've found some solutions that are from different dataframe, but not from one dataframe.
Also, I tried to separate columns to single dataframe then concat, but it seems not working because the columns' name are shown differently. (For example, it shows "apple", "banana", "pizza", "apple.1", "banana.1"...)
Is there any solution to show like this? Thanks!
You can use melt to flatten your dataframe then pivot to reshape it as its original shape:
df.columns = df.columns.str.rsplit('.').str[0]
out = df.melt().assign(index=lambda x: x.groupby('variable').cumcount()) \
.pivot_table('value', 'index', 'variable', fill_value=0) \
.rename_axis(index=None, columns=None)[df.columns.unique()]
print(out)
# Output
apple banana pizza
0 1 4 4
1 2 3 7
2 3 2 3
3 5 0 1
4 8 0 5
5 9 0 34
Related
I got two dataframes, simplified they look like this:
Dataframe A
ID
item
1
apple
2
peach
Dataframe B
ID
flag
price ($)
1
A
3
1
B
2
2
B
4
2
A
2
ID: unique identifier for each item
flag: unique identifier for each vendor
price: varies for each vendor
In this simplified case I want to extract the price values of dataframe B and add them to dataframe A in separate columns depending on their flag value.
The result should look similar to this
Dataframe C
ID
item
price_A
price_B
1
apple
3
2
2
peach
2
4
I tried to split dataframe B into two dataframes the different flag values and merge them afterwards with dataframe A, but there must be an easier solution.
Thank you in advance! :)
*edit: removed the pictures
You can use pd.merge and pd.pivot_table for this:
df_C = pd.merge(df_A, df_B, on=['ID']).pivot_table(index=['ID', 'item'], columns='flag', values='price')
df_C.columns = ['price_' + alpha for alpha in df_C.columns]
df_C = df_C.reset_index()
Output:
>>> df_C
ID item price_A price_B
0 1 apple 3 2
1 2 peach 2 4
(dfb
.merge(dfa, on="ID")
.pivot_table(index=['ID', 'item'], columns='flag', values='price ($)')
.add_prefix("price_")
.reset_index()
)
My dataframe has a column of lists and looks like this.
id source
0 3 [nan,nan,nan]
1 5 [nan,foo,foo,nan,foo]
2 7 [ham,nan,ham,nan]
3 9 [foo,foo]
I need to remove duplicates from each list. So I am looking from something like below.
id source
0 3 [nan]
1 5 [nan,foo]
2 7 [ham,nan]
3 9 [foo]
I tried to use the following code which didn't work. What do you recommend?
df['source'] = list(set(df['source']))
You can .explode on source column, .drop_duplicates and .groupby back:
df = (
df.explode("source")
.drop_duplicates(["id", "source"])
.groupby("id", as_index=False)
.agg(list)
)
print(df)
Prints:
id source
0 3 [nan]
1 5 [nan, foo]
2 7 [ham, nan]
3 9 [foo]
Or convert the list to pd.Series, drop duplicates and convert back to list:
df["source"] = df["source"].apply(lambda x: [*pd.Series(x).drop_duplicates()])
print(df)
I have a DataFrame as below:
index text_column
0 ,(Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1 ,Info_concern,Info_concern
2 ,color_Concern,color_Concern,no_category
3 ,reg_Concern,reg_Concern
I am trying to remove duplicates including the source value completely within each row.
I tried this:
df['result'] = [set(x) for x in df['text_column']]
This gives me a list of values without duplicates but with source value, I need the source value to be removed as well.
Desired output:
result
0 (concern_code),(concern_color)
1
2 no_category
3
Any suggestions or advice ?
Version 1: Removing duplicates across all rows:
You can use .drop_duplicates() with parameter keep=False after splitting and expanding the substrings by .str.split() and .explode().
Then, regroup the entries into their original rows by .groupby() on the row index (level 0). Finally, aggregate and join back the substrings of the original same row with .agg() and ','.join
df['result'] = (df['text_column'].str.split(',')
.explode()
.drop_duplicates(keep=False)
.groupby(level=0).agg(','.join)
)
.drop_duplicates() with parameter keep=False ensures to remove duplicates including the source value.
Alternatively, you can also do it with .stack() in place of .explode(), as follows:
df['result'] = (df['text_column'].str.split(',', expand=True)
.stack()
.drop_duplicates(keep=False)
.groupby(level=0).agg(','.join)
)
Data Input:
(Added extra test cases from the sample data in question:)
text_column
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1 Info_concern,Info_concern
2 color_Concern,color_Concern,no_category
3 reg_Concern,reg_Concern
4 ABCDEFGHIJKL
5 ABCDEFGHIJKL
Result:
print(df)
text_column result
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see) (concern_code),(concern_color)
1 Info_concern,Info_concern NaN
2 color_Concern,color_Concern,no_category no_category
3 reg_Concern,reg_Concern NaN
4 ABCDEFGHIJKL NaN
5 ABCDEFGHIJKL NaN
Note the last 2 rows with same strings are removed as duplicates even when they are in different rows.
Version 2: Removing duplicates within the same row only:
If the scope of removing duplicates is limited to only within the same row rather than across all rows, we can achieve this by the following code variation:
df['result'] = (df['text_column'].str.split(',', expand=True)
.stack()
.groupby(level=0)
.agg(lambda x: ','.join(x.drop_duplicates(keep=False)))
)
Data Input:
(Added extra test cases from the sample data in question:)
text_column
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1 Info_concern,Info_concern
2 color_Concern,color_Concern,no_category
3 reg_Concern,reg_Concern
4 ABCDEFGHIJKL
5 ABCDEFGHIJKL
Output:
print(df)
text_column result
0 (Unable_to_see),(concern_code),(concern_color),(Unable_to_see) (concern_code),(concern_color)
1 Info_concern,Info_concern
2 color_Concern,color_Concern,no_category no_category
3 reg_Concern,reg_Concern
4 ABCDEFGHIJKL ABCDEFGHIJKL
5 ABCDEFGHIJKL ABCDEFGHIJKL
Note the last 2 rows with same strings are kept since they are in different rows.
I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?
I have nine datasets. All of them have the following columns:
org, name, items,spend
I want to join them into a single dataframe with the following columns:
org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...
I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:
ad = pd.DataFrame.merge(df_presents, df_trees,
on=['practice', 'name'],
suffixes=['_presents', '_trees'])
This works great, doing print list(aggregate_data.columns.values) shows me the following columns:
[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]
But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.
You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:
result = functools.reduce(merge, dfs)
This is equivalent to
result = dfs[0]
for df in dfs[1:]:
result = merge(result, df)
To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:
merge = functools.partial(pd.merge, on=['org', 'name'])
Since specifying the suffixes parameter in functools.partial would only allow
one fixed choice of suffix, and since here we need a different suffix for each
pd.merge call, I think it would be easiest to prepare the DataFrames column
names before calling pd.merge:
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
For example,
import pandas as pd
import numpy as np
import functools
np.random.seed(2015)
N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)),
columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())
yields
org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \
0 2 4 4 2 3 0 1
1 2 4 4 2 3 0 1
2 2 4 4 2 3 0 1
3 2 4 4 2 3 0 1
4 2 4 4 2 3 0 1
spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \
0 3 1 0 1 0 4
1 3 1 0 1 0 4
2 3 1 0 1 0 4
3 3 1 0 1 0 4
4 3 1 0 1 0 4
spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9
0 3 4 1 3 0 1 2
1 3 4 1 3 0 0 3
2 3 4 1 3 0 0 0
3 3 3 1 3 0 1 2
4 3 3 1 3 0 0 3
Would doing a big pd.concat() and then renaming all the columns work for you? Something like:
desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)
new_columns = ['org', 'name']
for i in range(num_dataframes):
new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])
bid_df.columns = new_columns
This should give you columns like:
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8
I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):
Create an empty dictionary, merge_dict.
Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
Generate a new index as sorted(merge_dict).
Generate a new list of data for each column by looping through merge_dict.items().
Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.
Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.
Good luck.
I have a list of dataframes in python pandas that have the same rowname and rowvalues. What I would like to do is produce one dataframe with them innerjoined on the rowvalues. I have looked online and found the merge function, but this isn't working because my rows aren't a column. Does anyone know the best way to do this? Is the solution to take the row values and turn it into a column, and if so how do you do that? Thanks for the help.
input:
"happy"
userid
1 2
2 8
3 9
"sad"
userid
1 9
2 12
3 11
output:
"sad" "happy"
userid
1 9 2
2 12 8
3 11 9
It looks like your DataFrames have indices, in which case your merge() should indicate that's how it wants to proceed:
In [51]: df1
Out[51]:
"happy"
userid
1 2
2 8
3 9
In [52]: df2
Out[52]:
"sad"
userid
1 9
2 12
3 11
In [53]: pd.merge(df2, df1, left_index=True, right_index=True)
Out[53]:
"sad" "happy"
userid
1 9 2
2 12 8
3 11 9
And if you want to run this over a list of DataFrames, just reduce() them:
reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), list_of_dfs)
Transposing swaps the columns and rows of the DataFrame. If dfs is your list of DataFrames, then:
dfs = [df.T for df in dfs]
will make dfs a list of transposed DataFrames.
Then to merge:
merged = dfs[0]
for df in dfs[1:]:
merged = pd.merge(merged, df, how='inner')
By default pd.merge merges DataFrames based on all columns shared in common.
Note that transposing requires copying all the data in the original DataFrame into a new DataFrame. It would be more efficient to build the DataFrame in the correct (transposed) format from the beginning (if possible), rather than fixing it later by transposing.