I have a dataframe from a stata file and I would like to add a new column to it which has a numeric list as an entry for each row. How can one accomplish this? I have been trying assignment but its complaining about index size.
I tried initiating a new column of strings (also tried integers) and tried something like this but it didnt work.
testdf['new_col'] = '0'
testdf['new_col'] = testdf['new_col'].map(lambda x : list(range(100)))
Here is a toy example resembling what I have:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
testdf = pd.DataFrame.from_dict(data)
This is what I would like to have:
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15], 'list' : [[1,2,3],[7,8,9,10,11],[9,10,11,12],[10,11,12,13,14,15]]}
testdf2 = pd.DataFrame.from_dict(data2)
My final goal is to use explode on that "list" column to duplicate the rows appropriately.
Try this bit of code:
testdf['list'] = pd.Series(np.arange(i, j) for i, j in zip(testdf['start_val'],
testdf['end_val']+1))
testdf
Output:
col_1 col_2 start_val end_val list
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Let's use comprehension and zip with a pd.Series constructor and np.arange to create the lists.
If you'd stick to using the apply function:
import pandas as pd
import numpy as np
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
df = pd.DataFrame.from_dict(data)
df['range'] = df.apply(lambda row: np.arange(row['start_val'], row['end_val']+1), axis=1)
print(df)
Output:
col_1 col_2 start_val end_val range
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Related
I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)
I have a dataframe with 5 columns. I want to look through 3 of them and store in dict or list (whichever is more efficient) the values of each of the 5 columns
Example:
A
B
C
D
E
1
10
20
9
5
4
2
4
55
14
5
2
3
3
3
9
7
7
I would like to create three lists as such
index_1 = [10,20,4]
index_2 = [4,55,2]
index_3 = [3,3,7]
I have no idea how to go forward after looping through the columns
cols = ['A', 'B', 'E']
for col in cols:
df[col]
Try:
index_1, index_2, index_3 = [list(row) for row in df[["A", "B", "E"]].values]
Use locals() to create 3 python variables:
cols = ['A', 'B', 'E']
for idx, col in enumerate(cols, 1):
locals()[f"index_{idx}"] = df[col].tolist()
>>> index_1
[10, 4, 3]
>>> index_2
[20, 55, 3
>>> index_3
[4, 2, 7]
We can try
d = df[['A','B','E']].T.to_dict('list')
Out[227]: {1: [10, 20, 4], 2: [4, 55, 2], 3: [3, 3, 7]}
d[1]
Out[231]: [10, 20, 4]
I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])
I have a pandas dataframe, one of the columns contains list of values like
df = pd.DataFrame({'A': [11,22,33],
'B': [[4,5],[10,11,12], []]})
Now I want to drop all the rows that have empty lists in column 'B', Can anyone help please? Thanks
Using str.len
df[df.B.str.len()!=0]
Out[223]:
A B
0 11 [4, 5]
1 22 [10, 11, 12]
Or
df[df.B.astype(bool)]
Out[225]:
A B
0 11 [4, 5]
1 22 [10, 11, 12]
My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2