Concatenating dataframes by index/columns elements - python

import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.random(36).reshape((9,4)), index = np.arange(1, np.random.random(36).reshape((9,4)).shape[0]+1), columns=['A', 'B', 'C', 'D'])
te = pd.DataFrame(data = np.random.randint(low=1, high=10, size=1000), columns=['Number_test'])
I want to concatenate two dataframes by index of df to each corresponding element in column Number_test

Use pandas.DataFrame.merge:
pd.merge(df, te, left_index=True, right_on='Number_test')
or
pd.merge(df.reset_index(), te, left_on='index', right_on='Number_test')

Related

dataframe pivoting and adding and convert category into columns with prefix

I am trying to transform this Dataframe.
To look like the following:
Here is the code to create the sample df
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
symbol_list = [['A', 'B', ''], ['C','D','E']]
The end result dataframe is grouped by id field with symbol column turns into multiple columns with symbol ordering mapped to the user input list.
I was using .apply() method to construct each datarow for the above dataframe but it is taking very long time for 10000+ datapoints.
I am trying to find a more efficient way to transform the dataframe. I am thinking that I will need to use pivot function to unstack the data frame with the combination of resetting index (to turn category value into column). Appreciate any help on this!
Use GroupBy.cumcount with DataFrame.unstack for reshape, then extract date by DataFrame.pop with max per rows, flatten columns and last add new column date by DataFrame.assign:
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
#IMPORTANT all values from symbol_list are in column symbol (without empty strings)
symbol_list = [['A', 'B', ''], ['C','D','E']]
order = [y for x in symbol_list for y in x if y]
print (order)
['A', 'B', 'C', 'D', 'E']
#convert all values to Categoricals with specified order by flatten lists
df['symbol'] = pd.Categorical(df['symbol'], ordered=True, categories=order)
df['date'] = pd.to_datetime(df['date'])
#sorting by id and symbol
df = df.sort_values(['id','symbol'])
df1 = df.set_index(['id',df.groupby('id').cumcount()]).unstack()
date_max = df1.pop('date').max(axis=1)
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.assign(date = date_max)
print (df1)
symbol_0 symbol_1 symbol_2 value_0 value_1 value_2 date
id
1 A B NaN 0.0 1.0 NaN 2021-07-02
2 C D E 2.0 3.0 4.0 2021-07-02

How to be able to concatenate the values of a column with the name of the other columns in a DataFrame

How to be able to concatenate the values of a column called "ITEM" with the name of the other columns, thus creating new columns.
If I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({
'ITEM': ['Item1', 'Item2', 'Item3'],
'Variable1': [1,1,2],
'Variable2': [2,1,3],
'Variable3':[3,2,4]
})
df
I need to transform this dataframe:
enter image description here
on that dataframe:
enter image description here
import pandas as pd
df = pd.DataFrame({
'ITEM': ['Item1', 'Item2', 'Item3'],
'Variable1': [1,1,2],
'Variable2': [2,1,3],
'Variable3':[3,2,4]
})
df = pd.melt(df, id_vars='ITEM',value_vars=['Variable1','Variable2','Variable3'])
df['title'] = df['variable']+'_'+df['ITEM']
df = df[['title','value']].T
df.columns = df.iloc[0]
df = df[1:]

Pandas groupby multiple columns with values of unique groupings as their own column

Example Dataframe =
df = pd.DataFrame({'ID': [1,1,2,2,2,3,3,3],
... 'Type': ['b','b','b','a','a','a','a']})
I would like to return the counts grouped by ID and then a column for each unique ID in Type and the count of each Type for that grouped row:
pd.DataFrame({'ID': [1,2,3],'Count_TypeA': [0,2,3], 'CountTypeB':[2,1,0]}, 'TotalCount':[2,3,3])
Is there an easy way to do this using the groupby function in pandas?
For what you need you can use the method get_dummies from pandas. This will convert categorical variable into dummy/indicator variables. You can check the reference here.
Check if this meets your requirements:
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 2, 2, 3, 3, 3],
'Type': ['b', 'b', 'b', 'a', 'a', 'a', 'a', 'a']})
dummy_var = pd.get_dummies(df["Type"])
dummy_var.rename(columns={'a': 'CountTypeA', 'b': 'CountTypeB'}, inplace=True)
df1 = pd.concat([df['ID'], dummy_var], axis=1)
df_group1 = df1.groupby(by=['ID'], as_index=False).sum()
df_group1['TotalCount'] = df_group1['CountTypeA'] + df_group1['CountTypeB']
print(df_group1)
This will print the following result:
ID CountTypeA CountTypeB TotalCount
0 1 0 2 2
1 2 2 1 3
2 3 3 0 3

IndexError: too many indices for array. Numpy + Pandas DataFrame

I expect the DataFrame to output in an 'Excel' type of fashion, but instead, get the index error:
'IndexError: too many indices for array'
import numpy as np
import pandas as pd
from numpy.random import randn
rowi = ['A', 'B', 'C', 'D', 'E']
coli = ['W', 'X', 'Y', 'Z']
df = pd.DataFrame(randn[5, 4], rowi, coli) # data , index , col
print(df)
How do I solve the problem?
Is this what you want:
df = pd.DataFrame(randn(5, 4), rowi, coli)
Out[583]:
W X Y Z
A -0.630006 -0.033165 -1.005409 -0.827504
B 0.044278 0.526636 1.082062 -1.664397
C 0.523847 -0.688798 -0.626712 0.149128
D 0.541975 -1.448316 -0.961484 -0.526547
E 0.066888 0.238089 1.180641 0.462298

pandas aggregate data from two data frames

I have two pandas data frames, with some indexes and some column names in common (like partially overlapping time-series related to common quantities).
I need to merge these two dataframes in a single one containing all the indexes and all the values for each index, keeping the values of the left (right) one in case an index-column combination appears in both data frames.
Both merge and join methods are unhelpful as the merge method will duplicate information I don't need and join causes the same problem.
What's an efficient method to obtain the result I need?
EDIT:
If for example I have the two data frames
df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])
What I need is a method that allows me to create:
merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3],
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])
Here are three possibilities:
Use concat/groupby: First concatenate both DataFrames vertically. Then group by the index and select the first row in each group.
Use combine_first: Make a new index which is the union of df1 and df2. Reindex df1 using the new index. Then use combine_first to fill in NaNs with values from df2.
Use manual construction: We could use df2.index.difference(df1.index) to find exactly which rows need to be added to df1. So we could manually select those rows from df2 and concatenate them on to df1.
For small DataFrames, using_concat is faster. For larger DataFrames, using_combine_first appears to be slightly faster than the other options:
import numpy as np
import pandas as pd
import perfplot
def make_dfs(N):
df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
return df1, df2
def using_concat(dfs):
df1, df2 = dfs
result = pd.concat([df1,df2], sort=False)
n = result.index.nlevels
return result.groupby(level=range(n)).first()
def using_combine_first(dfs):
df1, df2 = dfs
index = df1.index.union(df2.index)
result = df1.reindex(index)
result = result.combine_first(df2)
return result
def using_manual_construction(dfs):
df1, df2 = dfs
index = df2.index.difference(df1.index)
cols = df2.columns.difference(df1.columns)
result = pd.concat([df1, df2.loc[index]], sort=False)
result.loc[df2.index, cols] = df2
return result
perfplot.show(
setup=make_dfs,
kernels=[using_concat, using_combine_first,
using_manual_construction],
n_range=[2**k for k in range(5,21)],
logx=True,
logy=True,
xlabel='len(df)')
Without seeing your code I can only give a generic answer:
To merge 2 dataframes use
df3 = pd.merge(df1, df2, how='right', on=('col1', 'col2'))
or
a.merge(b, how='right', on=('c1', 'c2'))

Categories