Pandas lookup from one of multiple columns, based on value - python

I have the following DataFrame:
Date best a b c d
1990 a 5 4 7 2
1991 c 10 1 2 0
1992 d 2 1 4 12
1993 a 5 8 11 6
I would like to make a dataframe as follows:
Date best value
1990 a 5
1991 c 2
1992 d 12
1993 a 5
So I am looking to find a value based on another row value by using column names. For instance, the value for 1990 in the second df should lookup "a" from the first df and the second row should lookup "c" (=2) from the first df.
Any ideas?

There is a built in lookup function that can handle this type of situation (looks up by row/column). I don't know how optimized it is, but may be faster than the apply solution.
In [9]: df['value'] = df.lookup(df.index, df['best'])
In [10]: df
Out[10]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5

You create a lookup function and call apply on your dataframe row-wise, this isn't very efficient for large dfs though
In [245]:
def lookup(x):
return x[x.best]
df['value'] = df.apply(lambda row: lookup(row), axis=1)
df
Out[245]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5

You can do this using np.where like below. I think it will be more efficient
import numpy as np
import pandas as pd
df = pd.DataFrame([['1990', 'a', 5, 4, 7, 2], ['1991', 'c', 10, 1, 2, 0], ['1992', 'd', 2, 1, 4, 12], ['1993', 'a', 5, 8, 11, 6]], columns=('Date', 'best', 'a', 'b', 'c', 'd'))
arr = df.best.values
cols = df.columns[2:]
for col in cols:
arr2 = df[col].values
arr = np.where(arr==col, arr2, arr)
df.drop(columns=cols, inplace=True)
df["values"] = arr
df
Result
Date best values
0 1990 a 5
1 1991 c 2
2 1992 d 12
3 1993 a 5

lookup is deprecated since version 1.2.0. With melt you can 'unpivot' columns to the row axis, where the column names are stored per default in column variable and their values in value. query returns only such rows where the columns best and variable are equal. drop and sort_values are used to match your requested format.
df_new = (
df.melt(id_vars=['Date', 'best'], value_vars=['a', 'b', 'c', 'd'])
.query('best == variable')
.drop('variable', axis=1)
.sort_values('Date')
)
Output:
Date best value
0 1990 a 5
9 1991 c 2
14 1992 d 12
3 1993 a 5

A simple solution that uses a mapper dictionary:
vals = df[['a','b','c','d']].to_dict('list')
mapper = {k: vals[v][k] for k,v in zip(df.index, df['best'])}
df['value'] = df.index.map(mapper).to_numpy()
Output:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5

Use looking up values by index column labels because DataFrame.lookup is deprecated since version 1.2.0:
idx, cols = pd.factorize(df['best'])
df['value'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5

Related

Converting columns to rows in Pandas with a secondary index value

Following up to my previous question here:
import pandas as pd
d = pd.DataFrame({'value':['a', 'b'],'2019Q1':[1, 5], '2019Q2':[2, 6], '2019Q3':[3, 7]})
which displays like this:
value 2019Q1 2019Q2 2019Q3
0 a 1 2 3
1 b 5 6 7
How can I transform it into this shape:
Year measure Quarter Value
2019 a 1 1
2019 a 2 2
2019 a 3 3
2019 b 1 5
2019 b 2 6
2019 b 3 7
Use pd.wide_to_long with DataFrame.melt:
df2 = df.copy()
df2.columns = df.columns.str.split('Q').str[::-1].str.join('_')
new_df = (pd.wide_to_long(df2.rename(columns = {'value':'Measure'}),
['1','2','3'],
j="Year",
i = 'Measure',
sep='_')
.reset_index()
.melt(['Measure','Year'],var_name = 'Quarter',value_name = 'Value')
.loc[:,['Year','Measure','Quarter','Value']]
.sort_values(['Year','Measure','Quarter']))
print(new_df)
Year Measure Quarter Value
0 2019 a 1 1
2 2019 a 2 2
4 2019 a 3 3
1 2019 b 1 5
3 2019 b 2 6
5 2019 b 3 7
this is just an addition for future visitors : when u split columns and use expand=True, u get a multiindex. This allows reshaping using the stack method.
#set value column as index
d = d.set_index('value')
#split columns and convert to multiindex
d.columns = d.columns.str.split('Q',expand=True)
#reshape dataframe
d.stack([0,1]).rename_axis(['measure','year','quarter']).reset_index(name='Value')
measure year quarter Value
0 a 2019 1 1
1 a 2019 2 2
2 a 2019 3 3
3 b 2019 1 5
4 b 2019 2 6
5 b 2019 3 7

Extrapolate time series data based on Start and end values, using Python?

I have an excel sheet with values representing start and end_time of a time series data, as shown below. Times are in seconds.
+------------+---------+-------+
Start_Time End_Time Value
0 2 A
2 3 B
3 9 A
9 11 C
I want to extrapolate the values between start and end_time and display the values for each second.
+---------+------+
Time Value
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 c
Any help to implement it in Python will be appreciated. Thanks.
Setup
You should find how to read your excel sheet with pandas easily, and options will depend on the file itself, so I won't cover this part.
Below is the reproduction of your sample dataframe, used for the example.
import pandas as pd
df = pd.DataFrame({'Start_Time': [0, 2, 3, 9],
'End_Time': [2, 3, 9, 11],
'Value': ['A', 'B', 'A', 'C']})
>>> df
Out[]:
End_Time Start_Time Value
0 2 0 A
1 3 2 B
2 9 3 A
3 11 9 C
Solution
(pd.Series(range(df.End_Time.max() + 1), name='Value') # Create a series on whole range
.map(df.set_index('End_Time').Value) # Set values from "df"
.bfill() # Backward fill NaNs values
.rename_axis('Time')) # Purely cosmetic axis rename
Out[]:
Time
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
Name: Value, dtype: object
Walkthrough
Create the whole "Time" range
s = pd.Series(range(df.End_Time.max() + 1))
>>> s
Out[]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
dtype: int32
Use "End_Time" as index for df
>>> df.set_index('End_Time')
Out[]:
Start_Time Value
End_Time
2 0 A
3 2 B
9 3 A
11 9 C
Map df values to corresponding "End_Time" values from s
s = s.map(df.set_index('End_Time').Value)
>>> s
Out[]:
0 NaN
1 NaN
2 A
3 B
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 A
10 NaN
11 C
dtype: object
Backward-fill the NaN values
s = s.bfill()
>>> s
Out[]:
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
dtype: object
Then rename_axis('Time') only renames the series axis to match your desired output.
Note that this works here because you use excluding Start_Time.
If you were using including Start_Time (where Value really starts at Start_Time, which is more common) you should change End_Time to Start_Time and bfill() to ffill() (forward-fill).

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani
You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3
This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

Duplicating a Pandas DF N times

So right now, if I multiple a list i.e. x = [1,2,3]* 2 I get x as [1,2,3,1,2,3] But this doesn't work with Pandas.
So if I want to duplicate a PANDAS DF I have to make a column a list and multiple:
col_x_duplicates = list(df['col_x'])*N
new_df = DataFrame(col_x_duplicates, columns=['col_x'])
Then do a join on the original data:
pd.merge(new_df, df, on='col_x', how='left')
This now duplicates the pandas DF N times, Is there an easier way? Or even a quicker way?
Actually, since you want to duplicate the entire dataframe (and not each element), numpy.tile() may be better:
In [69]: import pandas as pd
In [70]: arr = pd.np.array([[1, 2, 3], [4, 5, 6]])
In [71]: arr
Out[71]:
array([[1, 2, 3],
[4, 5, 6]])
In [72]: df = pd.DataFrame(pd.np.tile(arr, (5, 1)))
In [73]: df
Out[73]:
0 1 2
0 1 2 3
1 4 5 6
2 1 2 3
3 4 5 6
4 1 2 3
5 4 5 6
6 1 2 3
7 4 5 6
8 1 2 3
9 4 5 6
[10 rows x 3 columns]
In [75]: df = pd.DataFrame(pd.np.tile(arr, (1, 3)))
In [76]: df
Out[76]:
0 1 2 3 4 5 6 7 8
0 1 2 3 1 2 3 1 2 3
1 4 5 6 4 5 6 4 5 6
[2 rows x 9 columns]
Here is a one-liner to make a DataFrame with n copies of DataFrame df
n_df = pd.concat([df] * n)
Example:
df = pd.DataFrame(
data=[[34, 'null', 'mark'], [22, 'null', 'mark'], [34, 'null', 'mark']],
columns=['id', 'temp', 'name'],
index=pd.Index([1, 2, 3], name='row')
)
n = 4
n_df = pd.concat([df] * n)
Then n_df is the following DataFrame:
id temp name
row
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark

Categories