I am trying to update df2 with columns and data in ref_df1 such that my output data frame has all columns ['Code', 'Place', 'Product', 'Name', 'Value'] and has pulled data from the reference data frame using Code column values as key. I am not sure how to get to the output.
import pandas as pd
data1 = {
'Code': [1, 2, 3, 4, 5, 6],
'Name': ['Company1', 'Company2', 'Company3', 'Company4', 'Company5', 'Company6'],
'Value': [200, 300, 400, 500, 600, 700],
}
ref_df1 = pd.DataFrame(data1, columns=['Code', 'Name', 'Value'])
data2 = {
'Code': [1, 2, 1, 3, 4, 1, 6],
'Place': ['A', 'B', 'E', 'G', 'I', 'K', 'L'],
'Product': ['P11', 'P22', 'P12', 'P33', 'P44', 'P13', 'P61'],
}
df2 = pd.DataFrame(data2, columns=['Code', 'Place', 'Product'])
Output:
You can merge both the data frames.
df2.merge(ref_df1)
#output:
Code Place Product Name Value
0 1 A P11 Company1 200
1 1 E P12 Company1 200
2 1 K P13 Company1 200
3 2 B P22 Company2 300
4 3 G P33 Company3 400
5 4 I P44 Company4 500
6 6 L P61 Company6 700
Related
For example, I have the DataFrame:
a = [{'name': 'A', 'col_1': 5, 'col_2': 3, 'col_3': 1.5},
{'name': 'B', 'col_1': 4, 'col_2': 2.5, 'col_3': None},
{'name': 'C', 'col_1': 8, 'col_2': None, 'col_3': None},
{'name': 'D', 'col_1': 7, 'col_2': 9, 'col_3': None}]
df = pd.DataFrame(a)
df['col_1'] = df['col_1'].fillna(0)
df['col_2'] = df['col_2'].fillna(0)
df['col_3'] = df['col_3'].fillna(0)
print(df)
I'm trying to calculate the values for the df['color_4'] column, and I'm trying to do it in one line of code. But maybe it is impossible.
The calculation logic is as follows, and it works for df['name'] == 'A' and 'B', but not for df['name'] == 'C' and should be supplemented
df['col_4'] = [i if i != 0 else x - (x * 0.75) for i, x, y in zip(df['col_3'], df['col_2'], df['col_1'])]
It is necessary to continue the calculation, if the value is in the df['col_3'] and df['col_2'] == 0, then y - (y * 0.75) -> df['col_1']
In the case of df['name'] == 'D', it is first necessary to compare the values in the columns df['col_1'] and df['col_2'] and choose the minimum value
I need the next result:
You could also use Pandas operations, but I let it in the style you want :
df['col_4'] = [t[0] if t[0]>0 else
(min(t[1], t[2]) - (min(t[1], t[2]) * 0.75) if bool(t[1]) else t[2]-0.75*t[2])
for t in zip(df['col_3'], df['col_2'], df['col_1'])]
Complete script for checking
import pandas as pd
a = [{'name': 'A', 'col_1': 5, 'col_2': 3, 'col_3': 1.5},
{'name': 'B', 'col_1': 4, 'col_2': 2.5, 'col_3': None},
{'name': 'C', 'col_1': 8, 'col_2': None, 'col_3': None},
{'name': 'D', 'col_1': 7, 'col_2': 9, 'col_3': None}]
df = pd.DataFrame(a)
df['col_1'] = df['col_1'].fillna(0)
df['col_2'] = df['col_2'].fillna(0)
df['col_3'] = df['col_3'].fillna(0)
df['col_4'] = [t[0] if t[0]>0 else
(min(t[1], t[2]) - (min(t[1], t[2]) * 0.75) if bool(t[1]) else t[2]-0.75*t[2])
for t in zip(df['col_3'], df['col_2'], df['col_1'])]
print(df)
Result
name col_1 col_2 col_3 col_4
0 A 5 3.0 1.5 1.500
1 B 4 2.5 0.0 0.625
2 C 8 0.0 0.0 2.000
3 D 7 9.0 0.0 1.750
You can simplify you logic to:
get the min of col_1/col_2
if col_3 is 0, use 0.25 * that min value
else use col_3
df['col_4'] = np.where(df['col_3'].eq(0),
df[['col_1', 'col_2']].min(axis=1).mul(0.25),
df['col_3'])
Output:
name col_1 col_2 col_3 col_4
0 A 5 3.0 1.5 1.500
1 B 4 2.5 0.0 0.625
2 C 8 0.0 0.0 0.000
3 D 7 9.0 0.0 1.750
I have df like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2]}
df = pd.DataFrame(data=d)
I'd like to iterate over col3 whenever the next row is increasing, until the same value reoccurs, then combine rows/columns like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2], 'col4': ['B/Done;C/Open;K/Open', 'C/Open;K/Open', 'None', 'None', 'M/Open', 'None']}
df = pd.DataFrame(data=d)
I have thousands of rows, so I am trying to avoid using a for loop if possible.
I believe you can't perform this in a vectorial way.
Here is a working approach, but using a loop in a custom function:
def combine(series):
out = []
for s in series.iloc[1:]:
out.append(out[-1]+';'+s if out else s)
out = out[::-1]
out.append(None)
return pd.Series(out, index=series.index)
group = df['col3'].diff().eq(0)[::-1].cumsum()[::-1]
df['col4'] = (df.assign(col=df['col1']+'/'+df['col2'])
.groupby(group, sort=False)['col']
.apply(combine)
)
output:
col1 col2 col3 col4
0 A Open 1 B/Done;C/Open;K/Open
1 B Done 2 B/Done;C/Open
2 C Open 3 B/Done
3 K Open 3 None
4 L Done 1 M/Open
5 M Open 2 None
Suppose I got a pandas dataframe with two columns containing a list (length >= 1) at the end. The first one ("mode") has a variable which should be appended to the desired header name, the second column ("res") contains the data:
>>> df = pd.DataFrame([
{ 'c1': 850, 'c2': 'Ex', 'c3': 300.0, 'c4': 250, 'mode': [0, 1], 'res': [1.525, 1.321] },
{ 'c1': 850, 'c2': 'Ex', 'c3': 300.0, 'c4': 250, 'mode': [0, 1], 'res': [1.526, 1.311] }
])
with the result
c1 c2 c3 c4 mode res
0 850 Ex 300.0 250 [0, 1] [1.525, 1.321]
1 850 Ex 300.0 250 [0, 1] [1.526, 1.311]
Is there a better way to split the dataframe df to get this desired result
c1 c2 c3 c4 res_mode_0 res_mode_1
0 850 Ex 300.0 250 1.525 1.321
1 850 Ex 310.0 250 1.526 1.311
than using loops?
You can try the following code. Advantage is it works regardless of the number of elements in the list.
df = pd.DataFrame([
{ 'c1': 850, 'c2': 'Ex', 'c3': 300.0, 'c4': 250, 'mode': [0, 1], 'res': [1.525, 1.321] },
{ 'c1': 850, 'c2': 'Ex', 'c3': 300.0, 'c4': 250, 'mode': [0, 1], 'res': [1.526, 1.311] }
])
split_df = pd.DataFrame(df["res"].tolist()).add_prefix("res_mode_")
df = pd.concat([df, split_df], axis=1).drop(["mode", "res"], axis=1)
Output:
df
c1 c2 c3 c4 res_mode_0 res_mode_1
0 850 Ex 300.0 250 1.525 1.321
1 850 Ex 300.0 250 1.526 1.311
The most efficient way to do it;
pd.concat([pd.DataFrame(df.pop('your_column').values.tolist()), df], axis=1)
Unfortunately, you will have to use this on each columns you need to expand.
I have the following dataframe -
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
This is my desired output -
desired_df = pd.DataFrame({
'ID': [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'Prior_Current': ['a', 'a1', 'b', 'c', 'c1', 'd', 'e', 'f', 'f1', 'g',
'g1'],
'Start_Date': ['', '1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019'],
'End_Date': ['1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019', '']
})
I tried the following -
keys = ['Prior', 'Current']
df2 = (
pd.melt(df, id_vars='ID', value_vars=keys, value_name='Prior_Current')
.merge(df[['ID', 'Date']], how='left', on='ID')
)
df2['Start_Date'] = np.where(df2['variable'] == 'Prior', df2['Date'], '')
df2['End_Date'] = np.where(df2['variable'] == 'Current', df2['Date'], '')
df2.sort_values(['ID'], ascending=True, inplace=True)
But this does not seem be working. Please help.
you can use stack and pivot_table:
k = df.set_index(['ID', 'Date']).stack().reset_index()
df = k.pivot_table(index = ['ID',0], columns = 'level_2', values = 'Date', aggfunc = ''.join, fill_value= '').reset_index()
df.columns = ['ID', 'prior-current', 'start-date', 'end-date']
OUTPUT:
ID prior-current start-date end-date
0 1 a 1/1/2019
1 1 a1 1/1/2019
2 2 b 5/1/2019
3 2 c 5/1/2019 10/2/2019
4 2 c1 10/2/2019
5 3 d 15/3/2019
6 3 e 15/3/2019 6/5/2019
7 3 f 6/5/2019 7/9/2019
8 3 f1 7/9/2019
9 4 g 16/11/2019
10 4 g1 16/11/2019
Explaination:
After stack / reset_index df will look like this:
ID Date level_2 0
0 1 1/1/2019 Prior a
1 1 1/1/2019 Current a1
2 2 5/1/2019 Prior b
3 2 5/1/2019 Current c
4 2 10/2/2019 Prior c
5 2 10/2/2019 Current c1
6 3 15/3/2019 Prior d
7 3 15/3/2019 Current e
8 3 6/5/2019 Prior e
9 3 6/5/2019 Current f
10 3 7/9/2019 Prior f
11 3 7/9/2019 Current f1
12 4 16/11/2019 Prior g
13 4 16/11/2019 Current g1
Now, we can use ID and column 0 as index / level_2 as column / Date column as value.
Finally, we need to rename the columns to get the desired result.
My approach is to build and attain the target df step by step. The first step is an extension of your code using melt() and merge(). The merge is done based on the columns 'Current' and 'Prior' to get the start and end date.
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID')
df2 = df2.merge(df[['Current', 'Date']], how='left', left_on='Prior_Current', right_on='Current').drop('Current',1)
df2 = df2.merge(df[['Prior', 'Date']], how='left', left_on='Prior_Current', right_on='Prior').drop('Prior',1)
df2 = df2.fillna('').reset_index(drop=True)
df2.columns = ['ID', 'Prior_Current', 'Start_Date', 'End_Date']
Alternative way is to define a custom function to get date, then use lambda function:
def get_date(x, col):
try:
return df['Date'][df[col]==x].values[0]
except:
return ''
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID').reset_index(drop=True)
df2['Start_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Current'))
df2['End_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Prior'))
Output
import pandas as pd
data = {0: {'ID': 'A', 'Qty': 1, 'Type': 'SVGA'},
1: {'ID': 'B', 'Qty': 2, 'Type': 'SVGA'},
2: {'ID': 'B', 'Qty': 2, 'Type': 'XGA'},
3: {'ID': 'C', 'Qty': 3, 'Type': 'XGA'},
4: {'ID': 'D', 'Qty': 4, 'Type': 'XGA'},
5: {'ID': 'A', 'Qty': 1, 'Type': 'LED'},
6: {'ID': 'C', 'Qty': 3, 'Type': 'LED'}}
df = pd.DataFrame.from_dict(data, orient='index')
Is it possible to transform this dataframe to a duplicated matrix in sum.
Expected output:
LED SVGA XGA
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9
It seems like the key here is the "ID" column, because the value for each Type-Type cell is computed with respect to whether these Types coexist for the same ID.
So, start with a self-merge on "ID". You can then pivot your result to get your matrix.
merge + crosstab
v = df.merge(df[['ID', 'Type']], on='ID')
pd.crosstab(v.Type_x, v.Type_y, v.Qty, aggfunc='sum')
Type_y LED SVGA XGA
Type_x
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9
merge + pivot_table
df.merge(df[['ID', 'Type']], on='ID').pivot_table(
index='Type_x', columns='Type_y', values='Qty', aggfunc='sum'
)
Type_y LED SVGA XGA
Type_x
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9