Python: Rows to Column Conversion in pandas - python

Need help in converting rows answers to columns in python.Given below is sample dataset.
Thanks for Help..
ID| date |question_id |Choice_id| answer
1 | 2020-01-01 | 471362125 |NAN | 29720950
2 | 2020-01-01 | 471362121 |311470023| 8
3 | 2020-01-01 | 471362120 |311470024| 9
4 | 2020-01-01 | 471362524 |312472025| 5
5 | 2020-01-01 | 471362122 |NAN. | Delivery Issue
Expected output
id|date|471362125_nan|471362121_311470023|471362120_311470024|471362524_312472025|471362122_NAN
1 | 2020-01-01| 29720950|8|9|5|Delivery Issue
I will rename this with question text using rename in pandas

You could do it with the brute force technique using a lot of iloc, setting the column names appending and resetting the index. The main idea is that the column names and first row mainly come from two columns, so you append those together horizontally:
input:
import pandas as pd
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'date': {0: '2020-01-01',
1: '2020-01-01',
2: '2020-01-01',
3: '2020-01-01',
4: '2020-01-01'},
'question_id': {0: 471362125,
1: 471362121,
2: 471362120,
3: 471362524,
4: 471362122},
'Choice_id': {0: 'NAN',
1: '311470023',
2: '311470024',
3: '312472025',
4: 'NAN.'},
'answer': {0: '29720950', 1: '8', 2: '9', 3: '5', 4: 'Delivery Issue'}})
code:
df1 = df.copy()
d = df1['date'].min()
i = df1['ID'].min()
df1.columns = df1['question_id'].astype(str) + '_' + df1['Choice_id'].astype(str).to_list()
b = df1.columns.to_list()
a = pd.DataFrame(df1.iloc[:,4]).T
a.columns = b
df2 = a.append(df1).iloc[0:1]
df2 = df2.reset_index()
df2 = df2.rename({'index' : 'date'}, axis=1).reset_index()
df2 = df2.rename({'index' : 'id'}, axis=1)
df2['date'] = d
df2['id'] = i
df2.columns.names=['']
df2
output:
id date 471362125_NAN 471362121_311470023 471362120_311470024 471362524_312472025 471362122_NAN.
0 1 2020-01-01 29720950 8 9 5 Delivery Issue

Related

Merging two multiindex dataframes

I have 2 dataframes:
df1 = pd.DataFrame.from_dict({('category', ''): {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
(pd.Timestamp('2021-06-28 00:00:00'),
'metric_1'): {0: 4120.549999999999, 1: 11226.016666666665, 2: 25049.443333333333, 3: 18261.083333333332, 4: 2553.1208333333334, 5: 2843.01, 6: 73203.51333333334},
(pd.Timestamp('2021-06-28 00:00:00'), 'metric_2'): {0: 9907.79,
1: 7614.650000000001,
2: 13775.259999999998,
3: 13158.250000000004,
4: 1457.85,
5: 1089.5600000000002,
6: 38864.9},
(pd.Timestamp('2021-07-05 00:00:00'),
'metric_1'): {0: 5817.319999999998, 1: 10799.45, 2: 23521.51, 3: 22062.350833333334, 4: 1249.5974999999999, 5: 3229.77, 6: 52796.06083333332},
(pd.Timestamp('2021-07-05 00:00:00'), 'metric_2'): {0: 6321.21,
1: 5606.01,
2: 10239.689999999999,
3: 17476.600000000002,
4: 943.7199999999999,
5: 1410.33,
6: 29645.45}}).set_index('category')
df2 = pd.DataFrame.from_dict({'category': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
1: {0: 36234.035577957984,
1: 69078.07089184562,
2: 128879.5397517309,
3: 178376.63536908248,
4: 9293.956915067887,
5: 8184.780211399392,
6: 177480.74540313095},
2: {0: 37887.581678419825,
1: 72243.67956241772,
2: 134803.02342121338,
3: 186603.8963173654,
4: 9716.385738295368,
5: 8555.606693927,
6: 185658.87577993725}}).set_index('category')
First I change the column names of df2 to be the same as df
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
df2 = df2.rename(columns=date_mappings)
Then I try to merge it
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack(), left_index=True, right_index=True).sort_index(axis=1))
But I get an error:
ValueError: Cannot merge a Series without a name
What is my mistake?
My goal is to add columns from df2 to df1 in each week like so that df1 would have 3 columns instead of 2.
After using
c = [df2.columns.map(date_mappings.get), df2.columns]
df1.join(df2.set_axis(c, axis=1)).sort_index(axis=1)
I get the values appended to the end of the dataframe rather than to the same columns with the same week naming:
Maybe this could be an issue that df2 holds dates from 2021-06-28 to 2022-06-27 while df1 holds dates from 2020 to today.
Unwanted adding to the end of the df
Idea is create MultiIndex in both DataFrames:
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
#create MultiIndex in df2 with datetimes in first level
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
#removed unused levels, here category, so possible convert first leve to datetimes
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05 \
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
metric_3
category
A 37887.581678
B 72243.679562
C 134803.023421
D 186603.896317
E 9716.385738
F 8555.606694
G 185658.875780
If need remove greater datetimes like maximal df1 datetimes use:
#change mapping for test
date_mappings = {
1 : '2021-06-28',
2 : '2022-07-05'}
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
df2 = df2.loc[:, df2.columns.get_level_values(0) <= df1.columns.get_level_values(0).max()]
print (df2)
2021-06-28
metric_3
category
A 36234.035578
B 69078.070892
C 128879.539752
D 178376.635369
E 9293.956915
F 8184.780211
G 177480.745403
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
Use pd.DataFrame.reindex + pd.DataFrame.join
reindex has a convenient level parameter that allows you to expand on the index levels not present.
df1.join(df2.reindex(df1.index, level=0))
I am not sure if this is what you want but you might need to_frame:
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack().to_frame(), left_index=True, right_index=True).sort_index(level=0))
print(df)

Increasing a value during merge in pandas

I have 2 dataframes
df1
product_id value name
abc 10 a
def 20 b
ggg 10 c
df2
Which I get after using df2.groupby(['prod_id'])['code'].count().reset_index()
prod_id code
abc 10
def 20
ggg 10
ooo 5
hhh 1
I want to merge values from df2 to df1 left on product_id, right on prod_id.
To get:
product_id value name
abc 20 a
def 40 b
ggg 20 c
I tried:
pd.merge(df1, df2.groupby(['prod_id'])['code'].count().reset_index(),
left_on='product_id', right_on='prod_id', how='left')
Which returns df1 with 2 additional columns prod_id and code with the code column holding the amount by which I would like to increase value in df1. Now I can just add those 2 columns but I would like to avoid that.
Here’s one alternative:
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
Complete example:
df1 = pd.DataFrame({'product_id': {0: 'abc', 1: 'def', 2: 'ggg'},
'value': {0: 10, 1: 20, 2: 10},
'name': {0: 'a', 1: 'b', 2: 'c'}})
df2 = pd.DataFrame({'prod_id': {0: 'abc', 1: 'def', 2: 'ggg', 3: 'ooo', 4: 'hhh'},
'code': {0: 10, 1: 20, 2: 10, 3: 5, 4: 1}})
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
OUTPUT:
product_id value name
0 abc 20 a
1 def 40 b
2 ggg 20 c
you could use reindex on df2 with the order of df1 product_id, after the groupby.count (without the reset_index). like
df1['value'] += (
df2.groupby(['prod_id'])
['code'].count()
.reindex(df1['product_id'], fill_value=0)
.to_numpy()
)

Compare two columns in two dataframes with a condition on another column

I have a multilevel dataframe and I want to compare the value in column secret with a condition on column group. If group = A, we allow the value in another dataframe to be empty or na. Otherwise, output INVALID for the mismatching ones.
multilevel dataframe:
name secret group
df1 df2 df1 df2 df1 df2
id
1 Tim Tim random na A A
2 Tom Tom tree A A
3 Alex Alex apple apple B B
4 May May file cheese C C
expected output for secret
id name secret group
1 Tim na A
2 Tom A
3 Alex apple B
4 May INVALID C
so far I have:
result_df['result'] = multilevel_df.groupby(level=0, axis=0).apply(lambda x: secret_check(x))
#take care of the rest by compare column by column
result_df = multilevel_df.groupby(level=0, axis=1).apply(lambda x: validate(x))
def validate(x):
if x[0] == x[1]:
return x[1]
else:
return 'INVALID'
def secret_check(x):
if (x['group'] == 'A' and pd.isnull(['secret']): #this line is off
return x[1]
elif x[0] == x[1]:
return x[1]
else:
return 'INVALID'
Assuming we have the following dataframe:
df = pd.DataFrame({0: {0: 1, 1: 2, 2: 3, 3: 4},
1: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
2: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
3: {0: 'random', 1: 'tree', 2: 'apple', 3: 'file'},
4: {0: 'na', 1: '', 2: 'apple', 3: 'cheese'},
5: {0: 'A', 1: 'A', 2: 'B', 3: 'C'},
6: {0: 'A', 1: 'A', 2: 'B', 3: 'C'}})
df
df.columns = pd.MultiIndex.from_tuples([('id',''), ('name', 'df1'), ('name', 'df2'),
('secret', 'df1'), ('secret', 'df2'), ('group', 'df1'), ('group', 'df2')])
df
In[1]:
id name secret group
df1 df2 df1 df2 df1 df2
0 1 Tim Tim random na A A
1 2 Tom Tom tree A A
2 3 Alex Alex apple apple B B
3 4 May May file cheese C C
You can use np.select() to return results based on conditions.
.droplevel() to get out of a multiindex dataframe
and df.loc[:,~df.columns.duplicated()] to drop duplicate columns. Since we are setting the answer to df1 columns, df2 columns are not needed.
df[('secret', 'df1')] = np.select([(df[('group', 'df2')] != 'A') &
(df[('secret', 'df1')] != df[('secret', 'df2')])], #condition 1
[df[('secret', 'df1')] + ' > ' + df[('secret', 'df2')]], #result 1
df[('secret', 'df2')]) #alterantive if conditions not met
df.columns = df.columns.droplevel(1)
df = df.loc[:,~df.columns.duplicated()]
df
Out[1]:
id name secret group
0 1 Tim na A
1 2 Tom A
2 3 Alex apple B
3 4 May file > cheese C
If I understand you right, you want to mark "secret" in df2 as invalid if the secrets in df1 and df2 differ and the group is not A. There you go:
condition = (df[('secret', 'df1')] != df[('secret', 'df2')]) &\
df[('group', 'df1')] != 'A')
df.loc[condition, ('secret', 'df2')] = 'INVALID'

Merging multiple dataframe lines into aggregate lines

For the following dataframe:
df = pd.DataFrame({'Name': {0: "A", 1: "A", 2:"A", 3: "B"},
'Spec1': {0: '1', 1: '3', 2:'5',
3: '1'},
'Spec2': {0: '2a', 1: np.nan, 2:np.nan,
3: np.nan}
}, columns=['Name', 'Spec1', 'Spec2'])
Name Spec1 Spec2
0 A 1 2a
1 A 3 NaN
2 A 5 NaN
3 B 1 NaN
I would like to aggregate the columns into:
Name Spec
0 A 1,3,5,2a
1 B 1
Is there a more "pandas" way of doing this than just looping and keeping track of the values?
Or using melt
df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'})
Out[2226]:
Name spec
0 A 1,3,5,2a
1 B 1
Another way
In [966]: (df.set_index('Name').unstack()
.dropna().reset_index()
.groupby('Name')[0].apply(','.join))
Out[966]:
Name
A 1,3,5,2a
B 1
Name: 0, dtype: object
Group rows by name, combine column values as a list, dropping NaN:
df = df.groupby('Name').agg(lambda x: list(x.dropna()))
Spec1 Spec2
Name
A [1, 3, 5] [2a]
B [1] []
Now merge Spec1 and Spec2 lists. Bring Name back as a column. Name the new Spec column.
df = (df.Spec1 + df.Spec2).reset_index().rename(columns={0:"Spec"})
Name Spec
0 A [1, 3, 5, 2a]
1 B [1]
Finally, convert Spec lists to string representations:
df.Spec = df.Spec.apply(','.join)
Name Spec
0 A 1,3,5,2a
1 B 1

How to turn pandas data frame w/ row per unique term & date into df which contains 1 row per date, w/ unique terms + their values as columns?

I have a csv, or data frame, that looks something along the lines of this, but includes several hundred thousand rows:
df = {'Date': {0: '2014-01-01,
1: '2014-01-01',
2: '2014-01-01',
3: '2014-01-02',
4: '2014-01-02'},
'Name': {0: 'John',
1: 'John',
2: 'Rob',
3: 'Mel',
4: 'Rob'},
'Rank': {0: 1, 1: 3, 2: 2, 3: 5, 4: 6},
'Count': {0: 10, 1: 3, 2: 9, 3: 11, 4: 4}}
The names recur for each of the dates, but change in count and ranking. Instead of having one row per date for every single one of these names, as I do now, I'd like to arrange my data frame so that there is a value for every date. That is to say, I'd like my table to look like this:
Date John_Rank Rob_Rank Mel_rank John_count Mel_count Rob_count
2014-01-01 ... ... ... ... ...
2014-01-02 ... ... ... ... ...
I'd like to use this format to calculate the differences in ranks. I've come up against this several times before, but haven't had this many rows to deal with for a long stretch of dates — I've only done this manually up until now. Any advice would be very much appreciated!!
I think you can use pivot_table with default aggfunc='mean':
import pandas as pd
d = {'Date': {0: '2014-01-01',
1: '2014-01-01',
2: '2014-01-01',
3: '2014-01-02',
4: '2014-01-02'},
'Name': {0: 'John',
1: 'John',
2: 'Rob',
3: 'Mel',
4: 'Rob'},
'Rank': {0: 1, 1: 3, 2: 2, 3: 5, 4: 6},
'Count': {0: 10, 1: 3, 2: 9, 3: 11, 4: 4}}
df = pd.DataFrame(d)
print df
Count Date Name Rank
0 10 2014-01-01 John 1
1 3 2014-01-01 John 3
2 9 2014-01-01 Rob 2
3 11 2014-01-02 Mel 5
4 4 2014-01-02 Rob 6
df = pd.pivot_table(df, index='Date', columns='Name')
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print df
Count_John Count_Mel Count_Rob Rank_John Rank_Mel Rank_Rob
Date
2014-01-01 6.5 NaN 9 2 NaN 2
2014-01-02 NaN 11 4 NaN 5 6
Or if you want swaplevel multiindex in columns:
df = pd.pivot_table(df, index='Date', columns='Name')
df.columns = df.columns.swaplevel(0,1)
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print df
John_Count Mel_Count Rob_Count John_Rank Mel_Rank Rob_Rank
Date
2014-01-01 6.5 NaN 9 2 NaN 2
2014-01-02 NaN 11 4 NaN 5 6

Categories