I have a dataframe where every two rows are related. I am trying to give every two rows a unique ID. I thought it would be much easier but I cannot figure it out. Let's say I have this dataframe:
df = pd.DataFrame({'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
Var1 Var2
A B
2 5
C D
7 9
I would like to add an ID that would result in a dataframe that looks like this:
df = pd.DataFrame({'ID' : [1,1,2,2],'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
ID Var1 Var2
1 A B
1 2 5
2 C D
2 7 9
This is just a sample, but every two rows are related so just trying to count by 1, 1, 2, 2, 3, 3 etc in the ID column.
Thanks for any help!
You can create a sequence first and then divide it by 2 (integer division):
import numpy as np
df['ID'] = np.arange(len(df)) // 2 + 1
df
# Var1 Var2 ID
#0 A B 1
#1 2 5 1
#2 C D 2
#3 7 9 2
I don't think think is a native Pandas way to do it but this works...
import pandas as pd
df = pd.DataFrame({'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
df['ID'] = 1 + df.index // 2
df[['ID', 'Var1', 'Var2']]
Output:
ID Var1 Var2
0 1 A B
1 1 2 5
2 2 C D
3 2 7 9
Related
This question already has answers here:
How to create dummies for certain columns with pandas.get_dummies()
(4 answers)
Closed 1 year ago.
I have a dataset with 82 columns, and would like to turn all column values into dummy variables using pd.get_dummies except for the first column "business_id".
How can I define the pd.get_dummies function to only work on the other 81 columns?
You can exclude columns based on location by slicing df.columns:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c'],
'C': [1, 2, 3],
'D': [4, 5, 6]})
df = pd.get_dummies(df, columns=df.columns[1:])
# For Display
print(df.to_string(index=False))
Output:
A B_a B_b B_c C_1 C_2 C_3 D_4 D_5 D_6
a 0 1 0 1 0 0 1 0 0
b 1 0 0 0 1 0 0 1 0
a 0 0 1 0 0 1 0 0 1
For a more general solution, you can filter out particular columns programmatically using filter over your df.columns.
Put whatever column names you want to exclude in columns_to_exclude.
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c'],
'C': [1, 2, 3],
'D': [4, 5, 6]})
columns_to_exclude = ['B', 'D']
df = pd.get_dummies(df, columns=filter(
lambda i: i not in columns_to_exclude,
df.columns))
# For Display
print(df.to_string(index=False))
Output:
A B C D A_a A_b C_1 C_2 C_3
a b 1 4 1 0 1 0 0
b a 2 5 0 1 0 1 0
a c 3 6 1 0 0 0 1
I have a data in excel file. here is a sample data and image.
In[1] import pandas as pd
df = pd.DataFrame({'T1': ['A', 'B', 'A'],
'T1_data': [3, 2, '3K'],
'T2': ['B', 'A', 'B'],
'T2_data': ["5,2K", 4, 2],
})
df
Out[1] T1 T1_data T2 T2_data
0 A 3 B 5,2K
1 B 2 A 4
2 A 3K B 2
expected outputs :
i want this
T1_count T1_count T2_count T2_data
A 2 3, 3k 1 4
B 1 2 2 5, 2k
and this
T12_count T12_data
A 3 3, 3K, 4
B 3 2, 5, 2.
I know simple value_counts() but i don't know how can i do above things. if anyone can help it would be really appriciated.
df1 = df['T1'].value_counts()
df1
A 2
B 1
Name: T1, dtype: int64
Assuming the following dataframe:
variable value
0 A 12
1 A 11
2 B 4
3 A 2
4 B 1
5 B 4
I want to extract the last observation for each variable. In this case, it would give me:
variable value
3 A 2
5 B 4
How would you do this in the most panda/pythonic way?
I'm not worried about performance. Clarity and conciseness is important.
The best way I came up with:
df = pd.DataFrame({'variable': ['A', 'A', 'B', 'A', 'B', 'B'], 'value': [12, 11, 4, 2, 1, 4]})
variables = df['variable'].unique()
new_df = df.drop(index=df.index, axis=1)
for v in variables:
new_df = new_df.append(df[df['variable'] == v].tail(1), inplace=True)
Use drop_duplicates
new_df = df.drop_duplicates('variable',keep='last')
Out[357]:
variable value
3 A 2
5 B 4
I have a Dataframe file in which I want to switch the order of columns in only the third row while keeping other rows the same.
Under some condition, I have to switch orders for my project, but here is an example that probably has no real meaning.
Suppose the dataset is
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df
out[1]:
A B C
0 0 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
I want to have the output:
A B C
0 0 5 a
1 1 6 b
2 **7 2** c
3 3 8 d
4 4 9 e
How do I do it?
I have tried:
new_order = [1, 0, 2] # specify new order of the third row
i = 2 # specify row number
df.iloc[i] = df[df.columns[new_order]].loc[i] # reorder the third row only and assign new values to df
I observed from the output of the right-hand side that the columns are reordering as I wanted:
df[df.columns[new_order]].loc[i]
Out[2]:
B 7
A 2
C c
Name: 2, dtype: object
But when assigned to df again, it did nothing. I guess it's because of the name matching.
Can someone help me? Thanks in advance!
I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.
You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD
Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)
Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD