Generating a new variable based on the values of other variables - python

I have the following data set
import pandas as pd
df = pd.DataFrame({"ID": [1,1,1,1,1,2,2,2,2,2],
"TP1": [1,2,3,4,5,9,8,7,6,5],
"TP2": [11,22,32,43,53,94,85,76,66,58],
"TP10": [114,222,324,443,535,94,385,76,266,548],
"count": [1,2,3,4,10,1,2,3,4,10]})
print (df)
I want a "Final" variable in the df that will be based on the ID, TP and count variable.
The final result will look like following.
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,1,1,1,1,2,2,2,2,2], "TP1": [1,2,3,4,5,9,8,7,6,5],
"TP2": [11,22,32,43,53,94,85,76,66,58], "TP10": [114,222,324,443,535,94,385,76,266,548],
"count": [1,2,3,4,10,1,2,3,4,10],
"final" : [1,22,np.nan,np.nan,535,9,85,np.nan,np.nan,548]})
print (df)
So for example, the loop of if will do the following
It will look at the ID
Then for 1st ID it should look at value of count, if the value of count is 1
Then if should look at the variable TP1 and its 1st value should be placed in "final" variable.
The look will then look at count 2 for ID 1 and the value of TP2 should come in the "final" variable and so on.
I hope my question is clear. I am looking for a loop because there are 1000 TP variables in the original dataset.
I tried to make a code something like the following but it is utterly rubbish.
for col in df.columns:
if col.startswith('TP') and count == int(col[2:])
df["Final"] = count
Thanks

If my understanding is correct, if count=1 then pick TP1, if count=2 then pick TP2 etc.
This can be done with numpy.select(). Note that I have added the condition if f"TP{x}" in df.columns because not all columns TP1, TP2, TP3, ... TP10 are available in the dataframe. If all are available in your actual dataframe then this if statement is not required.
import numpy as np
conds = [df["count"] == x for x in range(1,11) if f"TP{x}" in df.columns]
output = [df[f"TP{x}"] for x in range(1,11) if f"TP{x}" in df.columns]
df["final"] = np.select(conds, output, np.nan)
print(df)
Output:
ID TP1 TP2 TP10 count final
0 1 1 11 114 1 1.0
1 1 2 22 222 2 22.0
2 1 3 32 324 3 NaN
3 1 4 43 443 4 NaN
4 1 5 53 535 10 535.0
5 2 9 94 94 1 9.0
6 2 8 85 385 2 85.0
7 2 7 76 76 3 NaN
8 2 6 66 266 4 NaN
9 2 5 58 548 10 548.0

Related

Convert data resulted a function into data frame

I have run a function code on python to get output elevation using a code as given below. But I would like to convert the results into a data frame.
depth() #function
Results: 0 49 2
1 50 2.5
2 52 3
3 53 3.5
4 54 4
......
.......
100 102 9
I am facing problems to turn these results into a data frame. I used following codes, but didn't work.
df = pd.DataFrame(columns = ['id', 'Z', 'water level'])
df = df.apply(water_depth())
print(df)
IIUC, you can try:
import pandas as pd
from io import StringIO
data = StringIO("""0 49 2
1 50 2.5
2 52 3
3 53 3.5
4 54 4
100 102 9
""")
df = pd.read_csv(data, sep=' ', header=None)
df.columns = ['id', 'Z', 'water level']
print(df)
Output:
id Z water level
0 0 49 2.0
1 1 50 2.5
2 2 52 3.0
3 3 53 3.5
4 4 54 4.0
5 100 102 9.0
When you have the data saved in a file.csv, you can replace data with 'file.csv'.
The way I'd do it:
dictionary = {
'water_level':[x/10 for x in range(20,91,5)], #example
'Z':[] # generate values of z
'id':[] # generate values of id
}
If you have some special function to generate the values, just create three lists and then create the dictionary.
After that:
pd.DataFrame(dictionary)

Merging dataframes with multiple key columns

I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!
IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))

How to get dataframe of unique ids

I'm trying to group the following dataframe by unique binId and then parse the resulting rows based of 'z' and pick the row with highest value of 'z'. Here is my dataframe.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3','4','5','6'], 'binId': ['1','2','2','1','1','3'], 'x':[1,4,5,6,3,4], 'y':[11,24,35,16,23,34],'z':[1,4,5,2,3,4]})
`
I tried following code which gives required answer,
def f(x):
tp = df[df['binId'] == x][['binId','ID','x','y','z']].sort_values(by='z', ascending=False).iloc[0]
return tp`
and then,
binids= pd.Series(df.binId.unique())
print binids.apply(f)
The output is,
binId ID x y z
0 1 5 3 23 3
1 2 3 5 35 5
2 3 6 4 34 4
But the execution is too slow. What is the faster way of doing this?
Use idxmax for indices of max and select by loc:
df1 = df.loc[df.groupby('binId')['z'].idxmax()]
Or faster is use sort_values with drop_duplicates:
df1 = df.sort_values(['binId', 'z']).drop_duplicates('binId', keep='last')
print (df1)
ID binId x y z
4 5 1 3 23 3
2 3 2 5 35 5
5 6 3 4 34 4

python pandas select both head and tail

For a DataFrame in Pandas, how can I select both the first 5 values and last 5 values?
For example
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
How to show the first two and the last two rows?
You can use iloc with numpy.r_:
print (np.r_[0:2, -2:0])
[ 0 1 -2 -1]
df = df.iloc[np.r_[0:2, -2:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-07 8 8 8
2012-12-08 9 9 9
df = df.iloc[np.r_[0:4, -4:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
You can use df.head(5) and df.tail(5) to get first five and last five.
Optionally you can create new data frame and append() head and tail:
new_df = df.tail(5)
new_df = new_df.append(df.head(5))
Not quite the same question but if you just want to show the top / bottom 5 rows (eg with display in jupyter or regular print, there's potentially a simpler way than this if you use the pd.option_context context.
#make 100 3d random numbers
df = pd.DataFrame(np.random.randn(100,3))
# sort them by their axis sum
df = df.loc[df.sum(axis=1).index]
with pd.option_context('display.max_rows',10):
print(df)
Outputs:
0 1 2
0 -0.649105 -0.413335 0.374872
1 3.390490 0.552708 -1.723864
2 -0.781308 -0.277342 -0.903127
3 0.433665 -1.125215 -0.290228
4 -2.028750 -0.083870 -0.094274
.. ... ... ...
95 0.443618 -1.473138 1.132161
96 -1.370215 -0.196425 -0.528401
97 1.062717 -0.997204 -1.666953
98 1.303512 0.699318 -0.863577
99 -0.109340 -1.330882 -1.455040
[100 rows x 3 columns]
Small simple function:
def ends(df, x=5):
return df.head(x).append(df.tail(x))
And use like so:
df = pd.DataFrame(np.random.rand(15,6))
ends(df,2)
I actually use this so much, I think it would be a great feature to add to pandas. (No features are to be added to pandas.DataFrame core API) I add it after import like so:
import pandas as pd
def ends(df, x=5):
return df.head(x).append(df.tail(x))
setattr(pd.DataFrame,'ends',ends)
Use like so:
import numpy as np
df = pd.DataFrame(np.random.rand(15,6))
df.ends(2)
You should use both head() and tail() for this purpose. I think the easiest way to do this is:
df.head(5).append(df.tail(5))
In Jupyter, expanding on #bolster's answer, we'll create a reusable convenience function:
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
Then
display_n(df,2)
Returns
0 1 2
0 0.167961 -0.732745 0.952637
1 -0.050742 -0.421239 0.444715
... ... ... ...
98 0.085264 0.982093 -0.509356
99 -0.758963 -0.578267 -0.115865
(except as a nicely formatted HTML table)
when df is df = pd.DataFrame(np.random.randn(100,3))
Notes:
Of course you could make the same thing print as text by modifying display to print above.
On unix-like systems, you can the autoload the above function in all notebooks by placing it in a py or ipy file in ~/.ipython/profile_default/startup as described here.
If you want to keep it to just Pandas, you can use apply() to concatenate the head and tail:
import pandas as pd
from string import ascii_lowercase, ascii_uppercase
df = pd.DataFrame(
{"upper": list(ascii_uppercase), "lower": list(ascii_lowercase)}, index=range(1, 27)
)
df.apply(lambda x: pd.concat([x.head(2), x.tail(2)]))
upper lower
1 A a
2 B b
25 Y y
26 Z z
Associated with Linas Fx.
Defining below
pd.DataFrame.less = lambda df, n=10: df.head(n//2).append(df.tail(n//2))
then you can type only df.less()
It's same as type df.head().append(df.tail())
If you type df.less(2), the result is same as df.head(1).append(df.tail(1))
Combining #ic_fl2 and #watsonic to give the below in Jupyter:
def ends_attr():
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
# set pd.DataFrame attribute where .ends runs display_n() function
setattr(pd.DataFrame,'ends',display_n)
ends_attr()
View first and last 3 rows of your df:
your_df.ends(3)
I like this because I can copy a single function and know I have everything I need to use the ends attribute.

Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3

Categories